### Language Model Used:
	 - The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.
	 - [Blog-Post](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/)
	 - [Research Paper](https://arxiv.org/pdf/1907.11692)
     - [Documentation for python](https://huggingface.co/transformers/model_doc/roberta.html)


### Hardware Requirements:
	 - Python 3.6 and above
	 - Pytorch, Transformers and All the stock Python ML Libraries
	 - GPU enabled setup 

### Intro:
In this Notebook,  I introduce how to get text embedding from RoBERTa (/BERT/ALBERT/etc.).  
There are 2 methods to get text embedding from RoBERTa.
1. get CLS Token
2. pool RoBERTa output (RoBERTa output = word embeddings)  

In [1]:
#!conda install transformers==3.0.2
#!conda install -c conda-forge transformers
#!pip install fiftyone

In [2]:
# Importing the libraries needed
import os
import pandas as pd
import numpy as np
import torch
import seaborn as sns
import transformers
import json
import random
import logging
from tqdm import tqdm
from torch import cuda
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaModel, RobertaTokenizer
from sklearn.model_selection import train_test_split
import fiftyone as fo

logging.basicConfig(level=logging.ERROR)

In [3]:
class Settings:
    batch_size=16
    max_len=350
    device = "cuda" if torch.cuda.is_available() else "cpu"
    seed = 318

In [4]:
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.backends.cudnn.deterministic = True
    
set_seed(Settings.seed)

# Model

In [5]:
class RobertaClass(torch.nn.Module):
    def __init__(self):
        super().__init__()
        #super(RobertaClass, self).__init__()
        self.roberta = RobertaModel.from_pretrained("roberta-base")

    def forward(self, input_ids, attention_mask, token_type_ids=None):
#         output = self.roberta(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        output = self.roberta(input_ids, attention_mask=attention_mask)
        return output

In [6]:
model = RobertaClass()
model.to(Settings.device) #has 768-output embbedings

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaClass(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768

# Dataset
### Preparing the Dataset and Dataloader

I will start with defining few key variables that will be used later during the training/fine tuning stage.
Followed by creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. I will also define the Dataloader that will feed  the data in batches to the neural network for suitable training and processing. 
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

-  I am using the Roberta tokenizer which uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/roberta.html#robertatokenizer)
- `target` is the encoded category on the news headline. 
-  I am using the COCO APIs to load the COCO-Dataset

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [7]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', truncation=True, do_lower_case=True)

In [8]:
class TrainValidDataset(Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.df = df
        self.text = df["excerpt"].values  #TODO: adjust to coco
        self.target = df["target"].values #TODO: adjust to coco
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        texts = self.text[idx] #text = str(self.text[index]),text = " ".join(text.split()) #TODO: adjust to coco
        tokenized = self.tokenizer.encode_plus(
            texts,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            padding="max_length"
        )
        ids = tokenized["input_ids"]
        mask = tokenized["attention_mask"]
        #token_type_ids = tokenized["token_type_ids"] #TODO: adjust to coco
        targets = self.target[idx]
        
        return {
            "ids": torch.LongTensor(ids),
            "mask": torch.LongTensor(mask),
            #'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long), #TODO: adjust to coco
            "targets": torch.tensor(targets, dtype=torch.float32)
        }

# Get Text Embeddings

### Get Input for Inference

Untill we can load the COCO Dataset (Check at the bottom), I will stub the input

In [10]:

ids=[] #some sentence
mask = []
token_type_ids= []

def testing_tokenizer(text):
        tokenized = tokenizer.encode_plus(
            text,
            truncation=True,
            add_special_tokens=True,
            max_length=Settings.max_len,
            padding="max_length"
        )
        ids = tokenized["input_ids"]
        mask = tokenized["attention_mask"]
        #token_type_ids = tokenized["token_type_ids"] #TODO: adjust to coco
        
        return {
            "ids": torch.LongTensor(ids),
            "mask": torch.LongTensor(mask),
            #'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long), #TODO: adjust to coco
        }

In [99]:
# texts=["Saw met applauded favourite deficient engrossed concealed and her","A series of escapades demonstrating the adage"]    
text="Saw met applauded favourite deficient engrossed concealed and her"
tokenizer_output = testing_tokenizer(text)

# print(tokenizer_output)

In [50]:
ids =tokenizer_output["ids"].to(Settings.device).unsqueeze(0)
mask=tokenizer_output["mask"].to(Settings.device).unsqueeze(0)
# target=tokenizer_output["target"].to(Settings.device).unsqueeze(0)

print(ids.shape)
print(mask.shape)
# print(targets.shape)
print("{}= num of texts, {} = num of word tokens per a text  ".format(ids.shape[0],ids.shape[1]))

torch.Size([1, 350])
torch.Size([1, 350])
1= num of texts, 350 = num of word tokens per a text  


350 = num of word tokens per a text  

meaning of pooler output will be explained later.  

# Inference

In [59]:
# # inference on input
# output = model(input_ids=ids, attention_mask=mask, token_type_ids=token_type_ids)

output = model(ids, mask)
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0567,  0.0669, -0.0062,  ..., -0.1220, -0.0256, -0.0132],
         [ 0.0362,  0.1375, -0.1234,  ..., -0.0722,  0.0101,  0.2137],
         [ 0.0526,  0.1547, -0.2172,  ..., -0.2524, -0.0212,  0.1244],
         ...,
         [ 0.0210,  0.0784, -0.1106,  ..., -0.3599,  0.0994,  0.0343],
         [ 0.0210,  0.0784, -0.1106,  ..., -0.3599,  0.0994,  0.0343],
         [ 0.0210,  0.0784, -0.1106,  ..., -0.3599,  0.0994,  0.0343]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-1.2758e-02, -2.1751e-01, -2.0590e-01, -8.2266e-02,  1.1925e-01,
          1.8918e-01,  2.5004e-01, -9.0720e-02, -5.8725e-02, -1.7462e-01,
          2.3513e-01, -7.6291e-03, -8.3253e-02,  7.6222e-02, -1.3445e-01,
          4.9221e-01,  2.2376e-01, -4.4706e-01,  2.0791e-02, -2.1818e-02,
         -2.5792e-01,  7.0410e-02,  4.5894e-01,  3.0744e-01,  1.5259e-01,
          6.5493e-02, -1.3735e-01, -1.5289e-03,  1.7140e-01,  2.206

In [87]:
print(output.keys())
print(output["last_hidden_state"].shape)
print(output["pooler_output"].shape)

odict_keys(['last_hidden_state', 'pooler_output'])
torch.Size([1, 350, 768])
torch.Size([1, 768])


2 outputs (last_hidden_state, pooler_output) from RoBERTa  

In [90]:
# last_hidden_state & pooler output
last_hidden_state = output[0] 
pooler_output     = output[1]
print("shape:", last_hidden_state.shape)
print("shape:", pooler_output.shape)

shape: torch.Size([1, 350, 768])
shape: torch.Size([1, 768])


350 = num of tokens in a text, 768 = dimension of word embedding  

meaning of pooler output will be explained later.  

## 1. Get CLS Token

![get cls token](https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png)

In [94]:
# .detach() = make copies and remove gradient information  
cls_embeddings = last_hidden_state[:, 0, :].detach()

print("shape:", cls_embeddings.shape)
print("{} = num of texts, {} = dimension of text embedding".format(cls_embeddings.shape[0],cls_embeddings.shape[1]))
print("")
print(cls_embeddings)

shape: torch.Size([1, 768])
1 = num of texts, 768 = dimension of text embedding

tensor([[-5.6703e-02,  6.6869e-02, -6.1948e-03, -1.0451e-01,  6.5941e-02,
         -8.7096e-02, -6.3141e-02,  2.5446e-02,  6.8636e-02, -7.0920e-02,
         -2.4731e-02,  5.4767e-02,  4.4447e-02, -1.2294e-02,  7.7633e-02,
          2.3673e-02, -5.1255e-02,  1.7971e-02,  5.3081e-02, -4.0974e-02,
         -1.0916e-01,  1.2808e-02,  2.7878e-02,  9.9498e-02,  1.4707e-02,
          1.0225e-02,  1.0518e-01,  8.4490e-02, -5.7642e-02, -5.0169e-02,
         -2.7050e-02, -2.5857e-02,  1.8748e-02, -4.6684e-02,  4.2878e-02,
          7.8247e-02,  7.9697e-02,  1.0175e-02, -7.2391e-02, -1.2638e-02,
         -3.3176e-02,  6.5563e-02,  1.3120e-02,  1.4457e-02,  5.8819e-02,
          5.5081e-02,  2.7548e-02,  3.7400e-02, -3.1940e-02,  2.8428e-02,
          4.2821e-02,  7.3411e-02, -5.3415e-02,  1.3923e-02, -8.1630e-02,
          4.3756e-02, -2.8308e-03,  7.0701e-02,  1.5391e-02, -3.0722e-02,
          6.7207e-02, -1.2232e-

768 = dimension of text embedding  per text

## 2. Pool RoBERTa Output

In [95]:
last_hidden_state.shape

torch.Size([1, 350, 768])

In [96]:
# apply avg.pooling to word embeddings
pooled_embeddings = last_hidden_state.detach().mean(dim=1)

print("shape:", pooled_embeddings.shape)
print("")
print(pooled_embeddings)

shape: torch.Size([1, 768])

tensor([[ 2.0323e-02,  7.9437e-02, -1.1032e-01, -3.5171e-02,  2.2646e-01,
         -2.7637e-02, -7.7746e-02, -1.4751e-05,  9.8366e-02, -4.2963e-02,
         -5.9774e-02, -8.9828e-03, -1.1668e-01, -1.1539e-03, -5.1606e-02,
          4.5606e-01,  3.0563e-01,  1.5051e-01,  3.0782e-02,  2.1871e-01,
          5.1562e-02,  2.3083e-02,  4.0371e-01, -1.3139e-01,  1.3204e-01,
         -8.1892e-02,  2.3483e-01,  6.0469e-02,  1.4729e-02, -1.1617e-01,
          8.0583e-02,  6.4143e-02, -2.5171e-01, -2.0018e-01,  5.6423e-02,
          3.3663e-02, -1.1430e-02,  1.8541e-02,  3.3345e-01, -9.3171e-02,
         -1.1654e-01,  2.2614e-01,  1.1700e-01, -4.0023e-02,  1.1706e-01,
         -6.9963e-02,  3.8883e-02,  4.5444e-01, -8.1482e-04,  1.7505e-01,
          1.0154e-01, -1.3573e-01, -1.6017e-02,  3.5862e-03,  6.8536e-02,
          1.3999e-01, -5.0224e-02, -1.3244e-01, -1.3408e-01,  5.0157e-03,
          1.9436e-01, -1.8392e-01, -8.8699e-02, -2.2767e-02,  1.0509e-02,
         

In [97]:
pd.DataFrame(pooled_embeddings.numpy()).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.020323,0.079437,-0.110324,-0.035171,0.226464,-0.027637,-0.077746,-1.5e-05,0.098366,-0.042963,...,0.085374,0.023791,-0.26204,0.052758,-0.014822,0.16207,0.219,-0.354602,0.098926,0.033357


In [98]:
last_hidden_state.shape

# apply avg.pooling to word embeddings
pooled_embeddings = last_hidden_state.detach().mean(dim=1)

print("shape:", pooled_embeddings.shape)
print("")
print(pooled_embeddings)

pd.DataFrame(pooled_embeddings.numpy()).head()

shape: torch.Size([1, 768])

tensor([[ 2.0323e-02,  7.9437e-02, -1.1032e-01, -3.5171e-02,  2.2646e-01,
         -2.7637e-02, -7.7746e-02, -1.4751e-05,  9.8366e-02, -4.2963e-02,
         -5.9774e-02, -8.9828e-03, -1.1668e-01, -1.1539e-03, -5.1606e-02,
          4.5606e-01,  3.0563e-01,  1.5051e-01,  3.0782e-02,  2.1871e-01,
          5.1562e-02,  2.3083e-02,  4.0371e-01, -1.3139e-01,  1.3204e-01,
         -8.1892e-02,  2.3483e-01,  6.0469e-02,  1.4729e-02, -1.1617e-01,
          8.0583e-02,  6.4143e-02, -2.5171e-01, -2.0018e-01,  5.6423e-02,
          3.3663e-02, -1.1430e-02,  1.8541e-02,  3.3345e-01, -9.3171e-02,
         -1.1654e-01,  2.2614e-01,  1.1700e-01, -4.0023e-02,  1.1706e-01,
         -6.9963e-02,  3.8883e-02,  4.5444e-01, -8.1482e-04,  1.7505e-01,
          1.0154e-01, -1.3573e-01, -1.6017e-02,  3.5862e-03,  6.8536e-02,
          1.3999e-01, -5.0224e-02, -1.3244e-01, -1.3408e-01,  5.0157e-03,
          1.9436e-01, -1.8392e-01, -8.8699e-02, -2.2767e-02,  1.0509e-02,
         

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.020323,0.079437,-0.110324,-0.035171,0.226464,-0.027637,-0.077746,-1.5e-05,0.098366,-0.042963,...,0.085374,0.023791,-0.26204,0.052758,-0.014822,0.16207,0.219,-0.354602,0.098926,0.033357


note!: pooler output "not" equal pooled_embeddings we calculated  
What is pooler output ?  
-> It takes the representation from the [CLS] token from top layer of RoBERTa encoder, and feed that through another dense layer.  
reference: https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/modeling.py#L224-L232  

# Loading COCO  - (Work in Progress)
- Two of the best tools for this are the official COCO APIs and FiftyOne.
- Official COCO APIs provide basic functionality to load and compute dataset-wide evaluation on your dataset.
- FiftyOne is recommended as it provides similar functionality to the cocoapi, along with a powerful API and GUI designed specifically to make it as easy as possible for you to explore, analyze, and work with your data.

Importing labeled datasets in formats such as COCO, you may find it more natural to provide the data_path and labels_path parameters to independently specify the location of the source media on disk and the annotations file containing the labels to import:

In [63]:
COCO_anotation_PATH="C:/Users/caner/Desktop/Master's/TUM WiSe 2022/Explainable AI Praktikum/xai-praktikum/data/MS-COCO/annotations/captions_val2017.json"
COCO_data_PATH="C:/Users/caner/Desktop/Master's/TUM WiSe 2022/Explainable AI Praktikum/xai-praktikum/data/MS-COCO/"

In [10]:
%matplotlib inline
from pycocotools.coco import COCO
import numpy as np
import skimage.io as io
import matplotlib.pyplot as plt
import pylab
pylab.rcParams['figure.figsize'] = (8.0, 10.0)

dataDir= COCO_data_PATH
dataType='val2017'
annFile='{}/annotations/instances_{}.json'.format(dataDir,dataType)


coco_caps=COCO(COCO_anotation_PATH)


ModuleNotFoundError: No module named 'pycocotools'

In [None]:
# display COCO categories and supercategories
cats = coco.loadCats(coco.getCatIds())
nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(nms)))

nms = set([cat['supercategory'] for cat in cats])
print('COCO supercategories: \n{}'.format(' '.join(nms)))

In [None]:
# load and display caption annotations
annIds = coco_caps.getAnnIds(imgIds=img['id']);
anns = coco_caps.loadAnns(annIds)
coco_caps.showAnns(anns)
plt.imshow(I); plt.axis('off'); plt.show()

above is COCO apis

below is fiftyone apis

In [18]:
import fiftyone.utils.coco as fouc

print(COCO_anotation_PATH)
dataset_annotation = fouc.load_coco_detection_annotations(COCO_anotation_PATH)


C:/Users/caner/Desktop/Master's/TUM WiSe 2022/Explainable AI Praktikum/xai-praktikum/data/MS-COCO/annotations/captions_val2017.json


In [19]:
# View summary info about the dataset
dataset_annotation

# Print the first few samples in the dataset
# print(dataset_annotation)

({'description': 'COCO 2017 Dataset',
  'url': 'http://cocodataset.org',
  'version': '1.0',
  'year': 2017,
  'contributor': 'COCO Consortium',
  'date_created': '2017/09/01',
  'licenses': [{'url': 'http://creativecommons.org/licenses/by-nc-sa/2.0/',
    'id': 1,
    'name': 'Attribution-NonCommercial-ShareAlike License'},
   {'url': 'http://creativecommons.org/licenses/by-nc/2.0/',
    'id': 2,
    'name': 'Attribution-NonCommercial License'},
   {'url': 'http://creativecommons.org/licenses/by-nc-nd/2.0/',
    'id': 3,
    'name': 'Attribution-NonCommercial-NoDerivs License'},
   {'url': 'http://creativecommons.org/licenses/by/2.0/',
    'id': 4,
    'name': 'Attribution License'},
   {'url': 'http://creativecommons.org/licenses/by-sa/2.0/',
    'id': 5,
    'name': 'Attribution-ShareAlike License'},
   {'url': 'http://creativecommons.org/licenses/by-nd/2.0/',
    'id': 6,
    'name': 'Attribution-NoDerivs License'},
   {'url': 'http://flickr.com/commons/usage/',
    'id': 7,
    'n

In [20]:
import fiftyone as fo

'''
check https://voxel51.com/docs/fiftyone/api/fiftyone.utils.coco.html#fiftyone.utils.coco.COCODetectionDatasetImporter
for COCO import options.
Load via FiftyOne
'''
# Import the dataset
dataset = fo.Dataset.from_dir(
    dataset_type=fo.types.COCODetectionDataset,    #COCO Dataset
    labels_path=COCO_anotation_PATH,   #annotations path
    data_path=COCO_data_PATH, #images path 
)


"""
# FiftyOne Data Loading
dataset = fo.Dataset.from_archive(
            archive_path=COCO_data_PATH,
            data_path=f"{COCO_data_PATH}/val2017/",
            labels_path=f"{COCO_anotation_PATH}",
            dataset_type=fo.types.COCODetectionDataset         
          ) 


"""

 100% |███████████████| 5000/5000 [1.4s elapsed, 0s remaining, 3.7K samples/s]         


INFO:eta.core.utils: 100% |███████████████| 5000/5000 [1.4s elapsed, 0s remaining, 3.7K samples/s]         


'\n# FiftyOne Data Loading\ndataset = fo.Dataset.from_archive(\n            archive_path=COCO_data_PATH,\n            data_path=f"{COCO_data_PATH}/val2017/",\n            labels_path=f"{COCO_anotation_PATH}",\n            dataset_type=fo.types.COCODetectionDataset         \n          ) \n\n\n'

In [43]:
'''The primary way of interacting with your dataset is through views. 
Every query you make will give you a different view into your dataset, like sorting by samples with the most number of objects'''
# View summary info about the dataset
print(dataset)

# Print the first few samples in the dataset
print(dataset.head())

#To visualize youra dataset, launch the FiftyOne App:
# session = fo.launch_app(dataset)

Name:        2022.11.25.19.01.04
Media type:  image
Num samples: 5000
Persistent:  False
Tags:        []
Sample fields:
    id:       fiftyone.core.fields.ObjectIdField
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
[<Sample: {
    'id': '638102e109ffa2d8f76033ab',
    'media_type': 'image',
    'filepath': "C:\\Users\\caner\\Desktop\\Master's\\TUM WiSe 2022\\Explainable AI Praktikum\\xai-praktikum\\data\\MS-COCO\\val2017\\000000000139.jpg",
    'tags': [],
    'metadata': <ImageMetadata: {
        'size_bytes': None,
        'mime_type': None,
        'width': 640,
        'height': 426,
        'num_channels': None,
    }>,
}>, <Sample: {
    'id': '638102e109ffa2d8f76033ac',
    'media_type': 'image',
    'filepath': "C:\\Users\\caner\\Desktop\\Master's\\TUM WiSe 2022\\Explainable AI Praktik

In [99]:
# Converting it to dictionary for DataFrame handling
data_dict = dataset.to_dict()
data_captions = json.load(open(COCO_anotation_PATH))

data_dict

 100% |███████████████| 5000/5000 [1.1s elapsed, 0s remaining, 4.9K samples/s]         


INFO:eta.core.utils: 100% |███████████████| 5000/5000 [1.1s elapsed, 0s remaining, 4.9K samples/s]         


{'name': '2022.11.25.19.01.04',
 'version': '0.18.0',
 'media_type': 'image',
 'sample_fields': {'id': 'fiftyone.core.fields.ObjectIdField',
  'filepath': 'fiftyone.core.fields.StringField',
  'tags': 'fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)',
  'metadata': 'fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)'},
 'info': {'description': 'COCO 2017 Dataset',
  'url': 'http://cocodataset.org',
  'version': '1.0',
  'year': 2017,
  'contributor': 'COCO Consortium',
  'date_created': '2017/09/01',
  'licenses': [{'url': 'http://creativecommons.org/licenses/by-nc-sa/2.0/',
    'id': 1,
    'name': 'Attribution-NonCommercial-ShareAlike License'},
   {'url': 'http://creativecommons.org/licenses/by-nc/2.0/',
    'id': 2,
    'name': 'Attribution-NonCommercial License'},
   {'url': 'http://creativecommons.org/licenses/by-nc-nd/2.0/',
    'id': 3,
    'name': 'Attribution-NonCommercial-NoDerivs License'},
   {'url': 'http://creativecommons.or

In [86]:
dataDir= COCO_data_PATH
annFile='{}/annotations/instances_{}.json'.format(dataDir,dataType)

annot_file = json.load(open(annFile))
annot_file["images"]

[{'license': 4,
  'file_name': '000000397133.jpg',
  'coco_url': 'http://images.cocodataset.org/val2017/000000397133.jpg',
  'height': 427,
  'width': 640,
  'date_captured': '2013-11-14 17:02:52',
  'flickr_url': 'http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg',
  'id': 397133},
 {'license': 1,
  'file_name': '000000037777.jpg',
  'coco_url': 'http://images.cocodataset.org/val2017/000000037777.jpg',
  'height': 230,
  'width': 352,
  'date_captured': '2013-11-14 20:55:31',
  'flickr_url': 'http://farm9.staticflickr.com/8429/7839199426_f6d48aa585_z.jpg',
  'id': 37777},
 {'license': 4,
  'file_name': '000000252219.jpg',
  'coco_url': 'http://images.cocodataset.org/val2017/000000252219.jpg',
  'height': 428,
  'width': 640,
  'date_captured': '2013-11-14 22:32:02',
  'flickr_url': 'http://farm4.staticflickr.com/3446/3232237447_13d84bd0a1_z.jpg',
  'id': 252219},
 {'license': 1,
  'file_name': '000000087038.jpg',
  'coco_url': 'http://images.cocodataset.org/val2017/000000

In [97]:
# Getting the dataset labels(captions)
captions = []

for i in range(0, len(annot_file["images"])):
    captions.append({"id": annot_file["images"][i]["id"], "caption":data_captions["annotations"][i]["caption"]})

print(captions)

[{'id': 397133, 'caption': 'A black Honda motorcycle parked in front of a garage.'}, {'id': 37777, 'caption': 'A Honda motorcycle parked in a grass driveway'}, {'id': 252219, 'caption': 'An office cubicle with four different types of computers.'}, {'id': 87038, 'caption': 'A small closed toilet in a cramped space.'}, {'id': 174482, 'caption': 'Two women waiting at a bench next to a street.'}, {'id': 403385, 'caption': 'A black Honda motorcycle with a dark burgundy seat.'}, {'id': 6818, 'caption': 'A tan toilet and sink combination in a small room.'}, {'id': 480985, 'caption': 'The home office space seems to be very cluttered.'}, {'id': 458054, 'caption': 'A beautiful dessert waiting to be shared by two people'}, {'id': 331352, 'caption': 'A woman sitting on a bench and a woman standing waiting for the bus.'}, {'id': 296649, 'caption': 'A woman sitting on a bench in the middle of the city'}, {'id': 386912, 'caption': 'This is an advanced toilet with a sink and control panel.'}, {'id': 5

In [98]:
# prepare dataset


df_train = pd.DataFrame(data_dict["samples"])

train_dataset = TrainValidDataset(df_train, tokenizer, Settings.max_len)
train_loader  = DataLoader(train_dataset, batch_size=Settings.batch_size,
                          shuffle=True, num_workers=8, pin_memory=True)

KeyError: 'excerpt'

## Inference on COCO

In [None]:
# make mini batch data
batch = next(iter(train_loader))
ids = batch["ids"].to(Settings.device)
mask = batch["mask"].to(Settings.device)
targets = batch["targets"].to(Settings.device) #todo: what is targets? is it token type ids?


In [None]:
# # inference on input
# output = model(input_ids=ids, attention_mask=mask, token_type_ids=token_type_ids)
output = model(ids, mask)
output