## Scibert
- This is based on the tutorial from <a href="https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/">Chris McCormick BERT word Embeddings tutorial </a>
- Adapted to work for <a href="https://github.com/allenai/scibert">allenai/scibert </a>
- The input text data format:

  ```
  Text Context C4 C1
  ```
- Enrique’s comments, guidelines:
    - https://huggingface.co/allenai/scibert_scivocab_uncased
    - https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
    - Evaluation scores
    - Later Ensemble models, train the ranker
- Mihai’s guidelines
    - Models produce an embeddings for text : we just use text with 2 candidates (G1 and G2) -> 3 embeddings
    - Embedding(Text)
    - Embedding(G1)
    - Embedding(G2)
    - cosine(Embedding(Text), Embedding(G1)) > cosine(Embedding(Text), Embedding(G2)
    - Forward pass embedding

##### Revised algorithm
- The data would be 3 features
  - [CLS] Text
  - [CLS] C1
  - [CLS] C2
- forward([CLS] Text) => get embedding of [CLS]
- forward([CLS] C1) => get embedding of [CLS]
- forward([CLS] C2) => get embedding of [CLS]
- cosine([CLS]_Text, [CLS]_C1) > cosine([CLS]_Text, [CLS]_C2)how

##### Example
This notebook only includes one example of the format
```
Text Context C4 C1
```

## Citation
Chris McCormick and Nick Ryan. (2019, May 14). *BERT Word Embeddings Tutorial*. Retrieved from http://www.mccormickml.com

## Installation and preprocessing

In [1]:
!pip install torch
!pip install transformers
!pip install pytorch-metric-learning
!pip install accelerate

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m94.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m116.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m88.0 MB/s[0m eta [36m0:00:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/Colab\ Notebooks/skema/ASKEM\ Grounding\ Annotations/dataset
!ls

/content/drive/MyDrive/Colab Notebooks/skema/ASKEM Grounding Annotations/dataset
ged_5Febbuckymodel_webdocs.csv	ged_5Febs00285-015-0906-z.csv
ged_5FebCHIME_SIR.csv		transformers-cache
ged_5FebCHIME_SVIIvR.csv


In [86]:
import torch
from transformers import BertTokenizer, BertModel
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
import numpy as np
import os
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
#logging.basicConfig(level=logging.INFO)

import matplotlib.pyplot as plt
%matplotlib inline

import accelerate
from accelerate import Accelerator

accelerator = Accelerator()
device = 'gpu'
device = accelerator.device
path = "/content/drive/MyDrive/Colab Notebooks/skema/ASKEM Grounding Annotations/dataset"
CACHE_DIR=os.path.join(path, 'transformers-cache')

# Load pre-trained model tokenizer (vocabulary)
# tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', do_lower_case=True)
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', do_lower_case=True)

# tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
# model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

# Load pre-trained model (weights)
# model = BertModel.from_pretrained('allenai/scibert_scivocab_uncased',
#                                   output_hidden_states = True,  use_cache=True,# Whether the model returns all hidden-states.
#                                   cache_dir=CACHE_DIR, low_cpu_mem_usage=True, offload_state_dict=True )
model = AutoModelForMaskedLM.from_pretrained('allenai/scibert_scivocab_uncased', output_hidden_states = True,
                                             use_cache=True,# Whether the model returns all hidden-states.
                                  cache_dir=CACHE_DIR, low_cpu_mem_usage=True, offload_state_dict=True)
import tempfile

offload_dir='/content/offload'
os.makedirs(offload_dir) if not os.path.exists(offload_dir) else None

with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="200MB")
    print('Temp Dir Path:', tmp_dir)
    print(sorted(os.listdir(tmp_dir)))
    # model = BertModel.from_pretrained(tmp_dir, low_cpu_mem_usage=True,offload_folder=offload_dir,
    #                               output_hidden_states = True,  use_cache=True,# Whether the model returns all hidden-states.
    #                               cache_dir=CACHE_DIR, offload_state_dict=True )
    model = AutoModelForMaskedLM.from_pretrained(tmp_dir, low_cpu_mem_usage=True,offload_folder=offload_dir,
                                  output_hidden_states = True, use_cache=True,# Whether the model returns all hidden-states.
                                  cache_dir=CACHE_DIR, offload_state_dict=True)

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Temp Dir Path: /tmp/tmp3m7t71ef
['config.json', 'generation_config.json', 'pytorch_model-00001-of-00003.bin', 'pytorch_model-00002-of-00003.bin', 'pytorch_model-00003-of-00003.bin', 'pytorch_model.bin.index.json']


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0.1,Unnamed: 0,Text,Context,1,2
0,0,for running simulations,for running simulations,simulation,simulation run length
1,1,for running simulations,for running simulations,simulation,simulation of time step
2,2,for running simulations,for running simulations,simulation,simulation of disease
3,3,for running simulations,for running simulations,simulation,simulation software application
4,4,for running simulations,for running simulations,simulation,vaccination efficacy for simulator configuration


In [93]:
path = "/content/drive/MyDrive/Colab Notebooks/skema/ASKEM Grounding Annotations/dataset"
files = [os.path.join( path, file) for file in os.listdir(path) if file.startswith("ged_5Feb") and ".csv" in file]
df = pd.read_csv(files[0])
df.head(), files

(   Unnamed: 0                     Text                  Context           1  \
 0           0  for running simulations  for running simulations  simulation   
 1           1  for running simulations  for running simulations  simulation   
 2           2  for running simulations  for running simulations  simulation   
 3           3  for running simulations  for running simulations  simulation   
 4           4  for running simulations  for running simulations  simulation   
 
                                                   2  
 0                             simulation run length  
 1                           simulation of time step  
 2                             simulation of disease  
 3                   simulation software application  
 4  vaccination efficacy for simulator configuration  ,
 ['/content/drive/MyDrive/Colab Notebooks/skema/ASKEM Grounding Annotations/dataset/ged_5Febbuckymodel_webdocs.csv',
  '/content/drive/MyDrive/Colab Notebooks/skema/ASKEM Grounding Annota

In [94]:
df[:10]

Unnamed: 0.1,Unnamed: 0,Text,Context,1,2
0,0,for running simulations,for running simulations,simulation,simulation run length
1,1,for running simulations,for running simulations,simulation,simulation of time step
2,2,for running simulations,for running simulations,simulation,simulation of disease
3,3,for running simulations,for running simulations,simulation,simulation software application
4,4,for running simulations,for running simulations,simulation,vaccination efficacy for simulator configuration
5,5,for running simulations,for running simulations,simulation,simulation of organism
6,6,for running simulations,for running simulations,simulation,simulation of control measure participant
7,7,for running simulations,for running simulations,simulation,simulation time series
8,8,for running simulations,for running simulations,simulation,simulation time step value
9,9,for running simulations,for running simulations,simulation,simulation of exposed organism


In [95]:
inputs = tokenizer(df['Text'].tolist()[150], return_tensors = 'pt', padding=True)
outputs = model(**inputs, output_hidden_states=True)
data_last_hidden_states = outputs.hidden_states[-1]
data_cls_embeddings = outputs.hidden_states[0][:,0,:]

In [96]:
data_cls_embeddings.shape,data_outputs.hidden_states[0].shape, df['Text'].tolist()[150], df['1'].tolist()[150], df['2'].tolist()[150]

(torch.Size([1, 768]),
 torch.Size([1, 5, 768]),
 'for running simulations',
 'simulation',
 'simulation of exposed organism')

In [101]:
def get_cls_embeddings(tokenizer, model, data, is_hidden_states_embeddings = False):
  inputs = tokenizer(data, return_tensors = 'pt', padding=True)
  data_outputs = model(**inputs, output_hidden_states=True)
  data_last_hidden_states = data_outputs.hidden_states[-1]
  data_cls_embeddings = data_outputs.hidden_states[-1][:,0,:]
  # .detach().numpy()
  if is_hidden_states_embeddings:
    return data_last_hidden_states
  return data_cls_embeddings

In [102]:
def get_triplet_embeddings(tokenizer, model, texts, c1, c2, is_hidden_states_embeddings = False):
  # text = tokenizer.cls_token +" "+ df.at[0,'Text']
  # context = tokenizer.cls_token+" "+df.at[0,"Context"]
  # C1 = tokenizer.cls_token+" "+df.at[0,"1"]
  # C2 = tokenizer.cls_token+" "+df.at[0,"2"]
  text_cls_embeddings = get_cls_embeddings(tokenizer, model, texts, is_hidden_states_embeddings)
  c1_cls_embeddings = get_cls_embeddings(tokenizer, model, c1, is_hidden_states_embeddings)
  c2_cls_embeddings = get_cls_embeddings(tokenizer, model, c2, is_hidden_states_embeddings)
  return text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings

In [103]:
def get_this_triplet_loss(text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings ):
  triplet_loss = torch.nn.TripletMarginLoss(margin=1.0, p=2)
  this_triplet_loss = triplet_loss(text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings)
  return this_triplet_loss

In [104]:
text, c1, c2 = df["Text"][10],df["1"][10], df["2"][10]
text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings = get_triplet_embeddings(tokenizer, model, text, c1, c2, False)
text, c1, c2

('for running simulations',
 'simulation',
 'simulation of organism awaiting effect of control measure')

In [97]:
text_cls_embeddings = get_cls_embeddings(tokenizer, model,  df['Text'].tolist()[150], False)
c1_cls_embeddings = get_cls_embeddings(tokenizer, model,  df['1'].tolist()[150], False)
c2_cls_embeddings = get_cls_embeddings(tokenizer, model,  df['2'].tolist()[150], False)

In [100]:
get_this_triplet_loss(text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings )

tensor(0.2664, grad_fn=<MeanBackward0>)

In [98]:
text_cls_embeddings.shape, c1_cls_embeddings.shape, c2_cls_embeddings.shape

(torch.Size([1, 768]), torch.Size([1, 768]), torch.Size([1, 768]))

In [None]:
torch.eq(text_cls_embeddings, c1_cls_embeddings), torch.eq(text_cls_embeddings, c2_cls_embeddings)

In [105]:
this_triplet_loss = get_this_triplet_loss(text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings )
this_triplet_loss

tensor(1.4133, grad_fn=<MeanBackward0>)

### Types of Triplets with TripletLosses:
Reference: [triplet-ranking-loss](https://medium.com/@harsh.kumar.cse20/understanding-pairwise-ranking-loss-and-triplet-ranking-loss-8c10073c13da)

Triplet Loss:

$L(r_a, r_p, r_n) = max(0, m + d(r_a, r_p) - d(r_a, r_n))$


From above loss equation we can have three possibilities or three catergory of triplets.
1. easy-triplets: triplets which have loss of 0 i.e

    **d(ra,rn) > d(ra,rp) + m.**
2. semi-hard triplets: triplets where the negative is not closer to the anchor than the positive, but which still have positive loss i.e

    **d(ra,rp) < d(ra,rn) < d(ra,rp) + m.**

3. hard-triplets: triplets where the negative is closer to the anchor than the positive, i.e

    **d(ra,rn) < d(ra,rp).**

In [110]:
def get_triplet_loss_reults(df):

  triplet_loss_dict = []
  easy_triplets = []
  semi_hard_triplets = []
  hard_triplets = []
  margin = 1.0
  counter = 1

  for i, row in df.iterrows():

    counter += 1
    text = tokenizer.cls_token +" "+ df.at[i,'Text']
    context = tokenizer.cls_token+" "+df.at[i,"Context"]
    c1 = tokenizer.cls_token+" "+df.at[i,"1"]
    c2 = tokenizer.cls_token+" "+df.at[i,"2"]
    text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings = get_triplet_embeddings(tokenizer, model, text, c1, c2, False)
    this_triplet_loss = get_this_triplet_loss(text_cls_embeddings, c1_cls_embeddings, c2_cls_embeddings )

    positive_distance = (text_cls_embeddings - c1_cls_embeddings).pow(2).sum().sqrt()
    #torch.cdist(txt_last_hidden_states, c1_last_hidden_states)
    negative_distance = (text_cls_embeddings - c2_cls_embeddings).pow(2).sum().sqrt()
    distance_with_margin = torch.add(positive_distance, margin)

    # easy triplet
    if torch.gt(negative_distance, distance_with_margin):
      if counter <=50:
        print("Easy Triplet : ", df.at[i,'Text'], df.at[i,'1'], df.at[i,'2'],this_triplet_loss.item() )
      #print(torch.gt(negative_distance, distance_with_margin))
      easy_triplets.append({"text":df.at[i,'Text'], "c1":df.at[i,'1'], "c2":df.at[i,'2'], "triplet_loss":this_triplet_loss.item()})
    # semi-hard triplet
    elif torch.lt(positive_distance, negative_distance) and torch.lt(negative_distance, distance_with_margin):
      if counter <=50:
        print("Semi-hard Triplet : ", df.at[i,'Text'], df.at[i,'1'], df.at[i,'2'],this_triplet_loss.item() )
      #print(torch.lt(positive_distance, negative_distance) ,  torch.lt(negative_distance, distance_with_margin))
      semi_hard_triplets.append({"text":df.at[i,'Text'], "c1":df.at[i,'1'], "c2":df.at[i,'2'], "triplet_loss":this_triplet_loss.item()})

    # hard-triplet
    elif torch.lt(negative_distance, positive_distance):
      if counter <=50:
        print("Hard Triplet : ", df.at[i,'Text'], df.at[i,'1'], df.at[i,'2'],this_triplet_loss.item() )
      #print(torch.lt(positive_distance, negative_distance))
      hard_triplets.append({"text":df.at[i,'Text'], "c1":df.at[i,'1'], "c2":df.at[i,'2'], "triplet_loss":this_triplet_loss.item()})

    triplet_loss_dict.append({"text":df.at[i,'Text'], "c1":df.at[i,'1'], "c2":df.at[i,'2'], "triplet_loss":this_triplet_loss.item()})
  return triplet_loss_dict, easy_triplets, semi_hard_triplets, hard_triplets

In [111]:
def save_files(this_dictionary, path, filename):
  pd.DataFrame.from_dict(this_dictionary).to_csv(os.path.join(path, filename))

In [118]:
for file in files[:3]:
  df = pd.read_csv(file)
  triplet_loss_dict, easy_triplets, semi_hard_triplets, hard_triplets = get_triplet_loss_reults(df[:200])
  print("Total number of easy, semi-hard and hard triplets respectively for this dataset are : %s \n, %s, %s, %s, %s"
            %(file, str(len(triplet_loss_dict)),str(len(easy_triplets)),str(len(semi_hard_triplets)),str(len(hard_triplets)) ))
  pd.DataFrame.from_dict(triplet_loss_dict).to_csv(os.path.join(path, "triplet_loss_"+os.path.basename(file)[:-4]+".csv"))
  pd.DataFrame.from_dict(easy_triplets).to_csv(os.path.join( path, "easy_triplets_"+os.path.basename(file)[:-4]+".csv"))
  pd.DataFrame.from_dict(semi_hard_triplets).to_csv(os.path.join( path, "semi_hard_triplets_"+os.path.basename(file)[:-4]+".csv"))
  pd.DataFrame.from_dict(hard_triplets).to_csv(os.path.join( path, "hard_triplets_"+os.path.basename(file)[:-4]+".csv"))

Hard Triplet :  for running simulations simulation simulation run length 1.7775297164916992
Hard Triplet :  for running simulations simulation simulation of time step 4.07520866394043
Hard Triplet :  for running simulations simulation simulation of disease 1.7122650146484375
Hard Triplet :  for running simulations simulation simulation software application 1.0747556686401367
Hard Triplet :  for running simulations simulation vaccination efficacy for simulator configuration 1.6077852249145508
Hard Triplet :  for running simulations simulation simulation of organism 1.588623046875
Semi-hard Triplet :  for running simulations simulation simulation of control measure participant 0.7704906463623047
Hard Triplet :  for running simulations simulation simulation time series 1.9276981353759766
Hard Triplet :  for running simulations simulation simulation time step value 2.1971235275268555
Semi-hard Triplet :  for running simulations simulation simulation of exposed organism 0.5775489807128906
H