<a href="https://colab.research.google.com/github/rprimi/colB5BERT/blob/main/python/colB5BERT_mo_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting BERT embeddings and calculating consine similarity between items and posts tokens
Ricardo Primi
Projeto Final, UNICAMP, Disciplina IA368 Deep Learning aplicada a buscas

### General set-up



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!git clone https://github.com/rprimi/colB5BERT.git
%cd /content/colB5BERT
!git pull

Cloning into 'colB5BERT'...
remote: Enumerating objects: 226, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 226 (delta 24), reused 18 (delta 18), pack-reused 196[K
Receiving objects: 100% (226/226), 31.75 MiB | 15.39 MiB/s, done.
Resolving deltas: 100% (139/139), done.
/content/colB5BERT
Already up to date.


In [None]:
!pip3 install transformers

In [None]:
import os
import pandas as pd
import numpy as np
import textwrap
import pickle
import h5py
import logging

from transformers import BertModel, BertTokenizer
from transformers import RobertaModel, RobertaTokenizer

import torch
from torch.nn.functional import cosine_similarity
from tqdm import tqdm


Modules `vsm`, `utils` and `sst` are from Stanford's CS224u https://github.com/cgpotts/cs224u

In [None]:
import sys
sys.path.append('/content/colB5BERT/python/')

import utils
import vsm
import sst


In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri Jun 23 03:17:14 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    43W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Data

In [None]:
b5_data = pd.read_csv('/content/colB5BERT/data/db_textos.splitted.csv', sep=';')
b5_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11537 entries, 0 to 11536
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              11537 non-null  int64 
 1   id_divisao      11537 non-null  int64 
 2   texto_dividido  11537 non-null  object
dtypes: int64(2), object(1)
memory usage: 270.5+ KB


In [None]:
b5_data

Unnamed: 0,id,id_divisao,texto_dividido
0,100,1,ajudando porque a zuzu é um amor e tem a voz f...
1,100,2,vai ter share sim e se reclamar dou share mais...
2,100,3,"quanto parece , A , $NUMBER$ . MvC $NUMBER$ CL..."
3,100,4,$NUMBER$ jeitos de dar entry então é só sucess...
4,100,5,quiser se vira Esse livro é de co-autoria de $...
...,...,...,...
11532,999,6,"amo muito ! < $NUMBER$ < $NUMBER$ "" Fique por ..."
11533,999,7,pai ! Feliz aniversário ! < $NUMBER$ < $NUMBER...
11534,999,8,rei do $NAME$ Club de $NAME$ Oeste : $NAME$ $N...
11535,999,9,todo tipo de público . A realização do projeto...


In [None]:
base_itens_b5 = pd.read_excel('/content/colB5BERT/data/base_itens.xlsx')
base_itens_b5
base_itens_b5.info()
# base_itens_b5['item_pt_text'].tolist()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 415 entries, 0 to 414
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ord0_index    415 non-null    int64  
 1   test          415 non-null    object 
 2   coditem       415 non-null    object 
 3   item_pt_text  415 non-null    object 
 4   item_en_text  415 non-null    object 
 5   domain        413 non-null    object 
 6   facet         413 non-null    object 
 7   pole          415 non-null    int64  
 8   seman_pairs   273 non-null    float64
dtypes: float64(1), int64(2), object(6)
memory usage: 29.3+ KB


### Loading Transformer models
Specify a model, a tokenizer, and load a model pretrained weights:

In [None]:
bert_weights_name = 'neuralmind/bert-base-portuguese-cased'
bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)
bert_model = BertModel.from_pretrained(bert_weights_name)

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Getting BERT embeddings

In [None]:
def tokenize_texts(bert_tokenizer, texts):
    tokenized_texts = []
    for text in texts:
        encoded_text = bert_tokenizer.encode(text, add_special_tokens=True)
        # truncate the encoded text to the first 512 tokens
        encoded_text = encoded_text[:512]
        # encoded_text = encoded_text
        tokenized_texts.append(encoded_text)
    return tokenized_texts


def get_bert_embeddings(bert_model, examples, layers):
    # Move model to GPU if available
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    bert_model = bert_model.to(device)

    embeddings = {layer: [] for layer in layers}
    for ex_ids in examples:
        # Convert data to tensor and move to GPU
        ex_ids_tensor = torch.tensor([ex_ids]).to(device)
        with torch.no_grad():
            # Output includes 'last_hidden_state', 'pooler_output', 'hidden_states'
            output = bert_model(ex_ids_tensor, output_hidden_states=True)
            hidden_states = output.hidden_states
            for layer in layers:
                # Verify layer index is valid
                if layer < 0 or layer >= len(hidden_states):
                    print(f"Invalid layer {layer}")
                else:
                    # Hidden states is a tuple. Indexing into it gives a tensor of shape
                    # (batch_size, sequence_length, hidden_size). Since batch_size is 1,
                    # we remove the batch dimension.
                    layer_output = hidden_states[layer].squeeze(0)
                    # Convert back to CPU for further processing or storage
                    embeddings[layer].append(layer_output.to('cpu'))
    return embeddings

In [None]:
# Specify the layers you want
layers = [6, 9, 11, 12]
layers = [6]

layers = [11]

len(b5_data['texto_dividido'].tolist())

embeddings_posts = get_bert_embeddings(bert_model, tokenize_texts(bert_tokenizer, b5_data['texto_dividido'].tolist()), layers)
embeddings_itens = get_bert_embeddings(bert_model, tokenize_texts(bert_tokenizer, base_itens_b5['item_pt_text'].tolist()), layers)



In [None]:
def save_embeddings_to_disk(embeddings, filename):
    # Convert tensors to numpy arrays and store them in the same structure
    numpy_embeddings = {str(layer): [t.numpy() for t in tensors] for layer, tensors in embeddings.items()}

    # Use numpy's savez function to store the dictionary
    # We use ** to unpack the dictionary into keyword arguments
    np.savez(filename, **numpy_embeddings)

def load_embeddings_from_disk(filename):
    with np.load(filename) as data:
        embeddings = {layer: data[layer] for layer in data.files}
    return embeddings

filename="/content/drive/MyDrive/colB5BERT/embeddings_itens"

save_embeddings_to_disk(embeddings=embeddings_itens, filename=filename)

filename="/content/drive/MyDrive/colB5BERT/embeddings_posts"
save_embeddings_to_disk(embeddings=embeddings_posts, filename=filename)


  val = np.asanyarray(val)


In [None]:
def calculate_cosine_similarity2(list_A, list_B, topk=5, batch_size=10):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    result = []

    # Move list_A tensors to GPU
    list_A = [tensor.to(device) for tensor in list_A]

    # Process batch_size tensor_A at a time
    for i in tqdm(range(0, len(list_A), batch_size), desc='Processing', dynamic_ncols=True):
        batch_A = list_A[i:i+batch_size]
        batch_A = torch.nn.utils.rnn.pad_sequence(batch_A, batch_first=True).unsqueeze(2)  # [batch_size, max_n_tokens_A, 1, 768]

        for tensor_B in list_B:
            tensor_B = tensor_B.to(device).unsqueeze(0).unsqueeze(0)  # [1, 1, n_tokens_B, 768]
            similarity = torch.nn.functional.cosine_similarity(batch_A, tensor_B, dim=-1)  # [batch_size, max_n_tokens_A, n_tokens_B]

             # [batch_size, max_n_tokens_A, n_tokens_B]

            current_topk = min(topk, similarity.size(-1))  # Ensure topk isn't larger than the number of tokens in tensor_B

            topk_values, topk_indices = torch.topk(similarity, current_topk, dim=-1)  # [batch_size, max_n_tokens_A, current_topk]
            result.append((topk_values.cpu().numpy(), topk_indices.cpu().numpy()))  # Move topk_values and topk_indices to CPU
            del tensor_B  # Delete tensor_B from GPU memory

    return result


The function `calculate_cosine_similarity2` first loops over `list_A` in batches. Then for each batch of `list_A`, it computes cosine similarity with each element of `list_B` one by one. Therefore, the result object is ordered such that you have all elements of `list_A` (in batches) combined with the first element of `list_B`, then all elements of `list_A` (in batches) combined with the second element of `list_B`, and so forth.

To elaborate:

- Take first batch of `list_A`, compute cosine similarity with the first tensor of `list_B`, append to the result.
- Still with the first batch of `list_A`, compute cosine similarity with the second tensor of `list_B`, append to the result.
- This process repeats for all tensors in `list_B`.
- Then the next batch of `list_A` is processed similarly.

Thus, the function gives you all combinations of batches from `list_A` with elements of `list_B`, with the entirety of `list_A` (in batches) being combined with each single element of `list_B` before moving on to the next element of `list_B`.

In [None]:
cosim_top5_L11 = calculate_cosine_similarity2(embeddings_itens[11], embeddings_posts[11], topk = 5, batch_size=8)


In [None]:
import pickle

with open('/content/drive/MyDrive/colB5BERT/cosim_top5_L11.pkl', 'wb') as f:
    pickle.dump(cosim_top5_L11, f)

### Restructuring data

In [42]:
# Load the pickle file
with open("/content/drive/MyDrive/colB5BERT/cosim_top5_L11.pkl", "rb") as file:
    data = pickle.load(file)


type(data)
len(data)
415/(8*11537)


len(data[0][0])
data[0].shape
type(data[0])
len()

data[0][0]
data[1][7].shape


array([[[0.9997802 , 0.99976826, 0.10771918, 0.10533468, 0.09761219],
        [0.7115822 , 0.7033297 , 0.7005218 , 0.68985844, 0.68875575],
        [0.7441544 , 0.72323906, 0.7071709 , 0.7032442 , 0.6997534 ],
        [0.7313146 , 0.6971807 , 0.6952636 , 0.6908861 , 0.6886    ],
        [0.6693454 , 0.66504985, 0.65419996, 0.6516441 , 0.64172846],
        [0.66691065, 0.66085696, 0.66057056, 0.65928936, 0.65452605],
        [0.8048103 , 0.7937138 , 0.73199886, 0.7271134 , 0.7210294 ],
        [0.6911128 , 0.6901874 , 0.6858583 , 0.6786149 , 0.6764821 ],
        [0.8492347 , 0.78826624, 0.75495416, 0.74292904, 0.72694695],
        [0.66035986, 0.6451392 , 0.6402346 , 0.63212216, 0.63127613],
        [0.68583214, 0.68259925, 0.6803767 , 0.6758202 , 0.6743665 ],
        [0.7060176 , 0.69146293, 0.68282354, 0.66724193, 0.6656855 ],
        [0.9980432 , 0.9979822 , 0.141135  , 0.14008525, 0.13204612],
        [0.        , 0.        , 0.        , 0.        , 0.        ]],

       [[0.9998142

### To save the complete cosim matrices

In [None]:
cosims_L11 = [y.numpy() for y in [x[0] for x in data]]

# Save the result as a new pickle file
with open("/content/drive/MyDrive/colB5BERT/cosim_L6.pkl", "wb") as file:
    pickle.dump(cosims_L6, file)

ids_cosims_L10 = [y.numpy() for y in [x[1] for x in data]]

ids_cosims_L6[0:2]

len(cosims_L6)

with open("/content/drive/MyDrive/colB5BERT/ids_cosims_L6.pkl", "wb") as file:
    pickle.dump(ids_cosims_L6, file)

with open("/content/drive/MyDrive/colB5BERT/cosim_L6.pkl", "wb") as file:
    pickle.dump(cosims_L6, file)

with open("/content/drive/MyDrive/colB5BERT/cosim_L6_full.pkl", "wb") as file:
    pickle.dump(embeddings_L6, file)


### Saving only the scores (sum of top5 and average over tokens)
- data isa  list of numpy.ndarray of shape (b, tk, r). r is a cosine similarity.
- this code first sum along cosine similarities r reducing to (b, tk, 1)
- then average along the token dimesnion having shape (b, 1)?


In [48]:
data0 = [y for y in [x[0] for x in data]]


import numpy as np

# Initializing an empty list to hold the final results
final_data = []

# Loop through each numpy.ndarray in your data_list
for dat in data0:

    # Sum along the r-axis, resulting in an array of shape (b, tk)
    dat_sum = np.sum(dat, axis=-1)

    # Then take the mean along the tk-axis, resulting in an array of shape (b,)
    # In the first line, np.ma.masked_array(dat_sum, dat_sum <= 0)
    # creates a masked array where all values in dat_sum that are less than or equal to zero are masked.
    # Then, np.ma.mean(dat_masked, axis=-1) computes the mean of the masked array along the specified axis,
    # effectively ignoring the masked values.

    dat_masked = np.ma.masked_array(dat_sum, dat_sum <= 0)
    dat_mean = np.ma.mean(dat_masked, axis=-1)

    # Append the resulting array to your final_data list
    final_data.append(dat_mean)

# Convert final_data to numpy array
final_data = np.array(final_data)

# Now final_data is an numpy.ndarray of shape (b, )

final_data.shape

  final_data = np.array(final_data)


(599924,)

In [49]:
final_data = np.concatenate(final_data, axis=0)
final_data.shape




(4787855,)

In [50]:
with open("/content/drive/MyDrive/colB5BERT/cosim_scores_L11.pkl", "wb") as file:
    pickle.dump(final_data, file)