## Generating Embeddings with BERT (Bidirectional Encoder Representations from Transformers) 

### What is BERT?
BERT is a large state-of-the-art neural network that has been trained on a large corpora of text (millions of sentences). Its applications include but are not limited to:

- Sentiment analysis
- Text classification
- Question answering systems
 
In this notebook, we walk through how BERT generates fixed-length embeddings (features) from a sentence. You could think of these embeddings as an alternate feature extraction technique compared to bag of words. The BERT model has 2 main components as shown below 



In [None]:
# Install the required libraries (if not already installed)
# !pip install transformers

## 1. Tokenizer (Converting sentences into series of numerical tokens):

The tokenizer in BERT is like a translator that converts sentences into a series of numerical tokens that the BERT model can understand. Specifically, it does the following:

- Splits Text: It breaks down sentences into smaller pieces called tokens. These tokens can be as short as one character or as long as one word. For example, the word "chatting" might be split into "chat" and "##ting".

- Converts Tokens to IDs: Each token has a unique ID in BERT's vocabulary. The tokenizer maps every token to its corresponding ID. This is like looking up the "meaning" of the word in BERT's dictionary.

- Adds Special Tokens: BERT requires certain special tokens for its tasks, like [CLS] at the beginning of a sentence and [SEP] at the end or between two sentences. The tokenizer adds these in.


### Example usage of the tokenizer

In the cell below, we see how BERT tokenizes 3 sentences and decodes them back.

We'll use the following example sentences:

1. "The sky is blue."
2. "Sky is clear today."
3. "Look at the clear blue sky."


In [1]:
# Import required libraries
from transformers import AutoTokenizer

# # Load pre-trained BERT tokenizer and model
sentences = ["The sky is blue.", "Sky is clear today.", "Look at the clear blue sky."]
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(sentences, padding=True,
                         max_length=10,
                         truncation=True)['input_ids']

print('----------------------------------------------')
print('Examples of tokenizing the sentences with BERT')
print('----------------------------------------------')
for jj, txt in enumerate(sentences):
    print('%s is enocoded as : %s'%(txt, encoded_text[jj]))

print('----------------------------------------------')
print('Examples of decoding the tokens back to English')
print('----------------------------------------------')
for enc in encoded_text:
    decoded_text = tokenizer.decode(enc)
    print("Decoded tokens back into text: ", decoded_text)

ModuleNotFoundError: No module named 'transformers'

## 2. Model (Extracting meaningful feature representations from the sentences):

Once the text is tokenized and converted into the necessary format, it's fed into the BERT model. 
The **model** processes these inputs to generate contextual embeddings or representations for each token. These representations can then be utilized for various downstream tasks like classification, entity recognition, and more.

In [None]:
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
import pandas as pd
import os
# Initialize BERT tokenizer and model
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

def get_bert_embedding(sentence_list, pooling_strategy='cls'):
    embedding_list = []
    for nn, sentence in enumerate(sentence_list):
        if (nn%100==0)&(nn>0):
            print('Done with %d sentences'%nn)
        
        # Tokenize the sentence and get the output from BERT
        inputs = tokenizer(sentence, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs)
        # Take the embeddings from the last hidden state (optionally, one can use pooling techniques for different representations)
        # Here, we take the [CLS] token representation as the sentence embedding
        last_hidden_states = outputs.last_hidden_state[0]
        
        # Pooling strategies
        if pooling_strategy == "cls":
            sentence_embedding = last_hidden_states[0]
        elif pooling_strategy == "mean":
            sentence_embedding = torch.mean(last_hidden_states, dim=0)
        elif pooling_strategy == "max":
            sentence_embedding, _ = torch.max(last_hidden_states, dim=0)
        else:
            raise ValueError(f"Unknown pooling strategy: {pooling_strategy}")
        
        embedding_list.append(sentence_embedding)
    return torch.stack(embedding_list)

sentence = [sentences[0]]
embedding = get_bert_embedding(sentence)

np.set_printoptions(precision=3, suppress=True)
print('-----------------------------------------------------------------------------------------------------------')
print('The sentence "%s" has been converted to a feature representation of shape %s'%(sentence[0], embedding.numpy().shape))
print('-----------------------------------------------------------------------------------------------------------')
print(embedding.numpy()[0])

## Generate embeddings from BERT for the movie review train and test sets

Below, we generate the BERT embeddings for the movie reviews dataset provided to you. The embeddings are stored as numpy files in the data_reviews folder:

- x_train_BERT_embeddings.npy : matrix of size 2400 x 768 containing 768 length features for each of the 2400 sentences in the training set
- x_test_BERT_embeddings.npy : matrix of size 600 x 768 containing 768 length features for each of the 600 sentences in the test set

Please note that these embeddings are **already provided to you in the data_reviews folder**. You can directly use the provided feature embeddings as inputs to your choice of classifier. If you would like to generate these feature representations again, run the code cells below. 

In [6]:
print('Loading data...')
x_train_df = pd.read_csv('../data_reviews/x_train.csv')
x_test_df = pd.read_csv('../data_reviews/x_test.csv')

tr_text_list = x_train_df['text'].values.tolist()
te_text_list = x_test_df['text'].values.tolist()

Loading data...


In [33]:
print('Generating embeddings for train sequences...')
tr_embedding = get_bert_embedding(tr_text_list)

print('Generating embeddings for test sequences...')
te_embedding = get_bert_embedding(te_text_list)


Generating embeddings for train sequences...
Done with 0 sentences
Done with 100 sentences
Done with 200 sentences
Done with 300 sentences
Done with 400 sentences
Done with 500 sentences
Done with 600 sentences
Done with 700 sentences
Done with 800 sentences
Done with 900 sentences
Done with 1000 sentences
Done with 1100 sentences
Done with 1200 sentences
Done with 1300 sentences
Done with 1400 sentences
Done with 1500 sentences
Done with 1600 sentences
Done with 1700 sentences
Done with 1800 sentences
Done with 1900 sentences
Done with 2000 sentences
Done with 2100 sentences
Done with 2200 sentences
Done with 2300 sentences
Generating embeddings for test sequences...
Done with 0 sentences
Done with 100 sentences
Done with 200 sentences
Done with 300 sentences
Done with 400 sentences
Done with 500 sentences


In [39]:
tr_embeddings_ND = tr_embedding.numpy()
te_embeddings_ND = te_embedding.numpy()

save_dir = os.path.abspath('../data_reviews/')
print('Saving the train and test embeddings to %s'%save_dir)

np.save(os.path.join(save_dir, 'x_train_BERT_embeddings.npy'), tr_embeddings_ND)
np.save(os.path.join(save_dir, 'x_test_BERT_embeddings.npy'), te_embeddings_ND)

Saving the train and test embeddings to /cluster/tufts/hugheslab/prath01/projects/cs135-23f-staffonly/proj_src/projA/data_reviews


## Show similarity between reviews using the embeddings

In [19]:
from utils import calc_k_nearest_neighbors

In [21]:
# choose some query sentences
sentence_id_list = [5, 20, 70, 85, 92, 12, 521, 100, 712]

# use K-nearest neighbors to find the 5 reviews that most closely resemble the query review
for sentence_id in sentence_id_list:
    query_QF = tr_embeddings_ND[sentence_id][np.newaxis, :]
    _, nearest_ids_per_query = calc_k_nearest_neighbors(tr_embeddings_ND, query_QF, K=5)
    nearest_ids = nearest_ids_per_query[0]

    print('------------------------------------------------------------------------------------')
    print('Reviews that resemble to the sentence : \n%s'%tr_text_list[sentence_id])
    print('------------------------------------------------------------------------------------')
    for ii, idx in enumerate(nearest_ids):
        print('%d) %s'%(ii, tr_text_list[idx]))



------------------------------------------------------------------------------------
Reviews that resemble to the sentence : 
Worst customer service.
------------------------------------------------------------------------------------
0) Worst customer service.
1) Poor product.
2) poor quality and service.
3) Bad Quality.
4) Bad Reception.
------------------------------------------------------------------------------------
Reviews that resemble to the sentence : 
I'm very disappointed with my decision.
------------------------------------------------------------------------------------
0) I'm very disappointed with my decision.
1) I'm a bit disappointed.
2) I was very disappointed in the movie.  
3) I'm still trying to get over how bad it was.  
4) I can't wait to go back.
------------------------------------------------------------------------------------
Reviews that resemble to the sentence : 
There's a horrible tick sound in the background on all my calls that I have never experien