<a href="https://colab.research.google.com/github/rprimi/colB5BERT/blob/main/python/b5_contextualreps_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
__author__ = "Ricardo Primi adapted from modules from Christopher Potts, CS224u, Stanford, Spring 2021"

### General set-up



In [None]:
!pip3 install transformers

In [3]:
!git clone https://github.com/rprimi/colB5BERT.git

%cd /content/colB5BERT
!git pull



Cloning into 'colB5BERT'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 88 (delta 46), reused 38 (delta 15), pack-reused 0[K
Unpacking objects: 100% (88/88), 11.84 MiB | 3.19 MiB/s, done.
/content/colB5BERT
Already up to date.


In [4]:
import os
import pandas as pd
import torch
from transformers import BertModel, BertTokenizer
from transformers import RobertaModel, RobertaTokenizer


Modules `vsm`, `utils` and `sst` are from Stanford's CS224u https://github.com/cgpotts/cs224u

In [5]:
import sys
sys.path.append('/content/colB5BERT/python/')

import utils
import vsm
import sst


In [6]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


In [7]:
!nvidia-smi

Mon Jun 12 19:38:39 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Data

In [8]:
b5_data = pd.read_csv('/content/colB5BERT/data/db_textos.splitted.csv', sep=';') 
b5_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11537 entries, 0 to 11536
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              11537 non-null  int64 
 1   id_divisao      11537 non-null  int64 
 2   texto_dividido  11537 non-null  object
dtypes: int64(2), object(1)
memory usage: 270.5+ KB


In [9]:
b5_data 

Unnamed: 0,id,id_divisao,texto_dividido
0,100,1,ajudando porque a zuzu é um amor e tem a voz f...
1,100,2,vai ter share sim e se reclamar dou share mais...
2,100,3,"quanto parece , A , $NUMBER$ . MvC $NUMBER$ CL..."
3,100,4,$NUMBER$ jeitos de dar entry então é só sucess...
4,100,5,quiser se vira Esse livro é de co-autoria de $...
...,...,...,...
11532,999,6,"amo muito ! < $NUMBER$ < $NUMBER$ "" Fique por ..."
11533,999,7,pai ! Feliz aniversário ! < $NUMBER$ < $NUMBER...
11534,999,8,rei do $NAME$ Club de $NAME$ Oeste : $NAME$ $N...
11535,999,9,todo tipo de público . A realização do projeto...


### Loading Transformer models
Specify a model, a tokenizer, and load a model pretrained weights:

In [10]:
bert_weights_name = 'neuralmind/bert-base-portuguese-cased'
bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)
bert_model = BertModel.from_pretrained(bert_weights_name)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/210k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### The basics of tokenizing


In [None]:
import textwrap

def print_cell(df, row, column, wrap_length=80):
    if row < len(df) and column in df.columns:
        text = df.loc[row, column]
        print('\n'.join(textwrap.wrap(text, width=wrap_length)))
    else:
        print("Invalid row or column")

print_cell(b5_data, 3, 'texto_dividido')

ex_ids = bert_tokenizer.encode(b5_data.loc[3, 'texto_dividido'], add_special_tokens=True)
bert_tokenizer.convert_ids_to_tokens(ex_ids)


### Getting BERT embeddings

To obtain the representations for a batch of examples, we use the `forward` method of the model, as follows:

In [None]:
with torch.no_grad():
    reps = bert_model(torch.tensor([ex_ids]), output_hidden_states=True)

In [11]:
def tokenize_texts(texts):
    # Tokenize each text and convert to input IDs
    input_ids = [bert_tokenizer.encode(text, add_special_tokens=True) for text in texts]
    return input_ids


def tokenize_texts(bert_tokenizer, texts):
    tokenized_texts = []
    for text in texts:
        encoded_text = bert_tokenizer.encode(text, add_special_tokens=True)
        # truncate the encoded text to the first 512 tokens
        encoded_text = encoded_text[:512]
        # encoded_text = encoded_text
        tokenized_texts.append(encoded_text)
    return tokenized_texts



In [None]:
# tokenize_texts(ex_of_texts)
ex_of_texts = b5_data.iloc[[0, 1, 2, 3], b5_data.columns.get_loc('texto_dividido')].tolist()
lengths = [len(sublist) for sublist in tokenize_texts(bert_tokenizer, ex_of_texts)]

print(lengths)  # Output: [3, 2, 4]

[512, 512, 477, 512]


In [12]:
import torch
def get_bert_embeddings(bert_model, examples, layers):
    # Move model to GPU if available
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    bert_model = bert_model.to(device)

    embeddings = {layer: [] for layer in layers}
    for ex_ids in examples:
        # Convert data to tensor and move to GPU
        ex_ids_tensor = torch.tensor([ex_ids]).to(device)
        with torch.no_grad():
            # Output includes 'last_hidden_state', 'pooler_output', 'hidden_states'
            output = bert_model(ex_ids_tensor, output_hidden_states=True)
            hidden_states = output.hidden_states
            for layer in layers:
                # Verify layer index is valid
                if layer < 0 or layer >= len(hidden_states):
                    print(f"Invalid layer {layer}")
                else:
                    # Hidden states is a tuple. Indexing into it gives a tensor of shape 
                    # (batch_size, sequence_length, hidden_size). Since batch_size is 1,
                    # we remove the batch dimension.
                    layer_output = hidden_states[layer].squeeze(0)
                    # Convert back to CPU for further processing or storage
                    embeddings[layer].append(layer_output.to('cpu'))
    return embeddings

In [None]:
layers = [6, 9, 11, 12]  # Specify the layers you want
embeddings = get_bert_embeddings(bert_model, tokenize_texts(bert_tokenizer, ex_of_texts), layers)

import pprint
pprint.pprint(embeddings )


dir(embeddings)
vars(embeddings)
import inspect
inspect.getmembers(embeddings)

pprint.pprint(embeddings[9])
len(embeddings[9])
len(embeddings[9][2])
pprint.pprint(embeddings[9][2])
x = embeddings[9][2]
x.shape

dimensions = [len(inner_list) for inner_list in embeddings[9][0]]



torch.Size([477, 768])

### Finaly getting the embeddings

In [2]:
# Specify the layers you want
layers = [ 9, 11]  
len(b5_data['texto_dividido'].tolist())

embeddings = get_bert_embeddings(bert_model, tokenize_texts(bert_tokenizer, b5_data['texto_dividido'].tolist()), layers)

NameError: ignored

Embeddings is a dict of `layers` keys. Each component of one key is composed of `batch size` elments of a tensor with size `num_of_tokens X embedding_dim`. 

The final structure is `layers X batch size X num_of_tokens X embedding_dim`

First, to compute cosine similarity between tensors, you can use the `torch.nn.functional.cosine_similarity` function provided by PyTorch.

Here's a function that computes cosine similarity between every pair of tokens in `A` and `B`, assuming that `A` and `B` are PyTorch tensors.

```python
import torch
import torch.nn.functional as F

def compute_cosine_similarity(A, B):
    batch_size1, num_of_tokens1, embedding_dim1 = A.shape
    batch_size2, num_of_tokens2, embedding_dim2 = B.shape
    
    if embedding_dim1 != embedding_dim2:
        raise ValueError("Embedding dimensions must match!")

    if batch_size1 != batch_size2:
        raise ValueError("Batch sizes must match!")

    # We will compute cosine similarity for each example in the batch separately
    cosine_similarities = []
    for i in range(batch_size1):
        a = A[i]
        b = B[i]

        # Compute cosine similarity between all pairs of tokens.
        # The resulting matrix will have shape (num_of_tokens1, num_of_tokens2)
        similarity_matrix = torch.zeros((num_of_tokens1, num_of_tokens2))

        for j in range(num_of_tokens1):
            for k in range(num_of_tokens2):
                similarity_matrix[j, k] = F.cosine_similarity(a[j], b[k], dim=0)

        cosine_similarities.append(similarity_matrix)

    return cosine_similarities
```

This function will return a list of 2D tensors, each with shape `(num_of_tokens1, num_of_tokens2)`. Each tensor in the list corresponds to an example in the batch. The values in the 2D tensor represent the cosine similarity between the corresponding tokens in `A` and `B`. Note that this is quite a computationally expensive way to compute these similarities due to the nested for loops, and it might be worth looking into more efficient methods if this becomes a bottleneck in your code.

Can you do this using vector operations ?

```python
import torch
import torch.nn.functional as F

def compute_cosine_similarity_vectorized(A, B):
    # Check for matching dimensions
    batch_size1, num_of_tokens1, embedding_dim1 = A.shape
    batch_size2, num_of_tokens2, embedding_dim2 = B.shape

    if embedding_dim1 != embedding_dim2:
        raise ValueError("Embedding dimensions must match!")

    if batch_size1 != batch_size2:
        raise ValueError("Batch sizes must match!")

    A_unsqueezed = A.unsqueeze(2)  # Shape becomes [batch_size, num_of_tokens1, 1, embedding_dim]
    B_unsqueezed = B.unsqueeze(1)  # Shape becomes [batch_size, 1, num_of_tokens2, embedding_dim]

    # Calculate cosine similarity. The result has shape [batch_size, num_of_tokens1, num_of_tokens2]
    similarity_matrix = F.cosine_similarity(A_unsqueezed, B_unsqueezed, dim=-1)

    return similarity_matrix

def get_bert_embeddings(bert_model, examples, layers):
    # Move model to GPU if available
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    bert_model = bert_model.to(device)

    embeddings = {layer: [] for layer in layers}
    for ex_ids in examples:
        # Convert data to tensor and move to GPU
        ex_ids_tensor = torch.tensor([ex_ids]).to(device)
        with torch.no_grad():
            # Output includes 'last_hidden_state', 'pooler_output', 'hidden_states'
            output = bert_model(ex_ids_tensor, output_hidden_states=True)
            hidden_states = output.hidden_states
            for layer in layers:
                # Verify layer index is valid
                if layer < 0 or layer >= len(hidden_states):
                    print(f"Invalid layer {layer}")
                else:
                    # Hidden states is a tuple. Indexing into it gives a tensor of shape 
                    # (batch_size, sequence_length, hidden_size). Since batch_size is 1,
                    # we remove the batch dimension.
                    layer_output = hidden_states[layer].squeeze(0)
                    # Convert back to CPU for further processing or storage
                    embeddings[layer].append(layer_output.to('cpu'))
    return embeddings
```


### Miscelaneous

### Python equivalent of R str

Yes, there is a similar command in Python called `dir()` which returns a list of all the attributes and methods of any object passed to it¹. Another similar command is `vars()` which returns the __dict__ attribute of an object¹. There is also a function called `inspect.getmembers()` which returns all the members of an object in a list of (name, value) pairs sorted by name¹. I hope this helps!

Origem: conversa com o Bing, 05/06/2023
(1) Is there a Python equivalent of R's str (), returning only the .... https://stackoverflow.com/questions/27749573/is-there-a-python-equivalent-of-rs-str-returning-only-the-structure-of-an-ob.
(2) What are Python pandas equivalents for R functions like str(), summary .... https://stackoverflow.com/questions/27637281/what-are-python-pandas-equivalents-for-r-functions-like-str-summary-and-he.
(3) Qual é a diferença entre 'string' e r'string' em Python?. https://pt.stackoverflow.com/questions/80545/qual-%c3%a9-a-diferen%c3%a7a-entre-string-e-rstring-em-python.

Understanding complex data structures in Python code often requires carefully examining the code and using built-in Python functions that give insights about these structures. Here are a few steps that might help:

1. **Print Statements**: Use `print()` statements liberally to output variables and their types. This can give you an idea of what data structures are being used at various points in the code.

2. **Type Checking**: Use the `type()` function to check the type of data structures. For instance, `type(my_var)` would return the type of `my_var`.

3. **Introspection**: Use dir() to view the attributes and methods of an object. For example, `dir(my_var)` would list all the methods that can be used with `my_var`.

4. **Length and Structure**: Use `len()` to find the length of a data structure. For dictionaries, lists, tuples, etc., you can also print individual elements.

5. **Variable Explorer**: If you're using an Integrated Development Environment (IDE) like PyCharm or Jupyter notebook, you can make use of the variable explorer to inspect your variables and data structures.

6. **Debugger**: A debugger can help you step through the code one line at a time and examine the changes in your data structures as the code executes. Python's built-in debugger is pdb.

7. **Visualization Tools**: For complex data structures like nested dictionaries or dataframes, consider using data visualization tools or libraries like pandas, matplotlib, or seaborn to visualize the data.

Remember, understanding complex data structures can be challenging, but it is often a matter of breaking down the structure into smaller, more manageable parts and understanding those individually.

In Python, lists and dictionaries don't have dimensions in the way that arrays in NumPy or dataframes in pandas do. Instead, they have lengths, and those lengths can be nested. You can use the built-in `len()` function to find out the number of elements in a list or dictionary. 

For a list:

```python
my_list = [1, 2, 3, 4, 5]
print(len(my_list))  # Output: 5
```

For a dictionary:

```python
my_dict = {'one': 1, 'two': 2, 'three': 3}
print(len(my_dict))  # Output: 3
```

For nested structures, you'd need to use additional `len()` calls or use a loop or comprehension to iterate over the elements.

For example, for a list of lists (a 2D list), you could use a list comprehension:

```python
my_list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
dimensions = [len(inner_list) for inner_list in my_list_of_lists]
print(dimensions)  # Output: [3, 3, 3]
```

This tells you that you have a "3x3" list. Note that this only works for regularly-shaped data; if your lists have differing lengths, you'll get a variety of numbers.

For a nested dictionary, things can get more complex, and you may need a recursive function to fully explore the structure if the nesting can be more than one level deep.

## Some related work

1. [Ethayarajh (2019)](https://www.aclweb.org/anthology/D19-1006/) uses dimensionality reduction techniques (akin to LSA) to derive static representations from contextual models, and explores layer-wise variation in detailed, with findings that are likely to align with your experiences using the above techniques.

1. [Akbik et al (2019)](https://www.aclweb.org/anthology/N19-1078/) explore techniques similar to those of Bommasani et al. specifically for the supervised task of named entity recognition.

1. [Wang et al. (2020](https://arxiv.org/pdf/1911.02929.pdf) learn static representations from contextual ones using techniques adapted from the word2vec model.

Please note the following:

You need to ensure your GPU has enough memory to hold your model and data.
The model and data must be on the same device (either both on CPU or both on GPU) for the forward pass to work.
Tensor computations in PyTorch are performed where the tensor resides (either CPU or GPU). If you need to do any further processing or storage on the results, you might want to move them back to CPU with .to('cpu') as I have done above. Please note that transferring data between CPU and GPU can also be time consuming, so it's best to do this sparingly.