# BERT

**GOAL: At the end of this class, we will be able to USE a pre-trained BERT to (a) generate suggestions and to (b) generate embeddings for classification**



## What is BERT?

After the [transformer](https://arxiv.org/abs/1706.03762), we had many other advances. One of such, of course, is the [GPT](https://paperswithcode.com/paper/improving-language-understanding-by), which uses a decoder-only transformer architecture to predict the next word in a sentence. GPT uses a decoder-only architecture because it needs the masked multi-head attention device to avoid making trivial predictions. Ultimately, GPT generates an embedding space that increases the likelihood of choosing meaningful words for a text continuation.

The Google team found another interesting way to obtain this type of representation. They trained an *encoder*-only transformer that can predict words removed from the text - similarly to how we know what is missing in  "Luke, I am your ____". The idea here is that we can use information from the future for this task, because it is highly dependent on context. Simultaneously, they trained the model to classify whether two given phrases follow each other in a corpus. So, BERT was born.


```mermaid
graph LR;
    subgraph Input;
    T["Token embeddings"];
    P["Position embeddings"];
    S["Segment embeddings 
    (indicates if it is sentence 1
     or sentence 2 in NSP task)"];
    ADD(["\+"]);
    T --> ADD;
    P --> ADD;
    S --> ADD; 
    end;

    SEQ["Sequence Model"];
    ADD --> SEQ;
    RES["Result: 1 vector per input token"];
    SEQ --> RES;
```




Bert stands for [Bidirectional Encoder Representations from Transformers, and was introduced in this paper from 2019](https://arxiv.org/pdf/1810.04805). The greatest contribution of BERT, besides its architecture, is the idea of training the language model for different tasks at the same time.

We are definitely not going to train BERT in class, but we are using it for other tasks. We will use the [BERT implementation from Hugging Face](https://huggingface.co/google-bert/bert-base-uncased). All help files are here.

## Task 1: Masked Language Model

The first task BERT was trained for was the Masked Language Model. This was inspired in a task called ["Cloze"](https://en.wikipedia.org/wiki/Cloze_test), and the idea is to remove a word from a sentence and let the system predict what word should fill that sentence:





```mermaid
graph LR;
    subgraph Inputs;
    INPUT["[CLS]
        remove
        some
        parts
        [MASK]
        a
        sentence"];
    end;
    INPUT --> BERT["BERT"];
    subgraph Outputs;
    OUTPUT["C
    T1
    T2
    T3
    T4
    T5
    T6"];
    end;
    BERT --> OUTPUT;
    Train["Loss: T4 should be the word 'of'"]
    OUTPUT --- Train;
```


This task suggests that the embedding space created by BERT should allow representing words in the context of the rest of the sentence!

To play with this task with Hugging Face's library, you can use:



In [5]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Remove some parts [MASK] a sentence.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'score': 0.9431136250495911,
  'token': 1997,
  'token_str': 'of',
  'sequence': 'remove some parts of a sentence.'},
 {'score': 0.04985498636960983,
  'token': 2013,
  'token_str': 'from',
  'sequence': 'remove some parts from a sentence.'},
 {'score': 0.004208952654153109,
  'token': 1999,
  'token_str': 'in',
  'sequence': 'remove some parts in a sentence.'},
 {'score': 0.000622662715613842,
  'token': 2306,
  'token_str': 'within',
  'sequence': 'remove some parts within a sentence.'},
 {'score': 0.0005233758711256087,
  'token': 2076,
  'token_str': 'during',
  'sequence': 'remove some parts during a sentence.'}]

### Algorithmic bias and Hallucinations

Note that BERT is generating words that make sense. However, these continuations do not necessarily correspond to reality. In fact, these continuations are simply something that maximizes a probability related to a specific dataset!

Check, for example, the output for:


In [6]:

unmasker("Kentucky is famous for its [MASK].")


[{'score': 0.07573200762271881,
  'token': 4511,
  'token_str': 'wine',
  'sequence': 'kentucky is famous for its wine.'},
 {'score': 0.06742826849222183,
  'token': 14746,
  'token_str': 'wines',
  'sequence': 'kentucky is famous for its wines.'},
 {'score': 0.02818026952445507,
  'token': 12212,
  'token_str': 'beaches',
  'sequence': 'kentucky is famous for its beaches.'},
 {'score': 0.022783828899264336,
  'token': 12846,
  'token_str': 'cuisine',
  'sequence': 'kentucky is famous for its cuisine.'},
 {'score': 0.021198133006691933,
  'token': 5194,
  'token_str': 'horses',
  'sequence': 'kentucky is famous for its horses.'}]


Kentucky is a state in the USA that may or may not have wineries, but definitely does not have famous beaches! Now, check the output when you change Kentucky for the Brazilian state of Minas Gerais!

See - there is no "brain" inside BERT. There is merely a system that finds plausible completions for a task. This is something we have been calling "hallucinations" in LLMs. In the end, the model is just as biased as the dataset used for training it.

### Algorithmic prejudice

Despite the funny things things that the model could output, there are some assertions that can be dangerous, or outright sexist. Try to see the output of:


In [9]:

unmasker("A successful man works as a [MASK].")


[{'score': 0.06930015236139297,
  'token': 3460,
  'token_str': 'doctor',
  'sequence': 'a successful man works as a doctor.'},
 {'score': 0.06541887670755386,
  'token': 5160,
  'token_str': 'lawyer',
  'sequence': 'a successful man works as a lawyer.'},
 {'score': 0.04109674692153931,
  'token': 7500,
  'token_str': 'farmer',
  'sequence': 'a successful man works as a farmer.'},
 {'score': 0.03909466415643692,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'a successful man works as a carpenter.'},
 {'score': 0.03854001685976982,
  'token': 22701,
  'token_str': 'tailor',
  'sequence': 'a successful man works as a tailor.'}]


Now, change "man" for "woman". The result is not as pretty. But, see, this is not a problem of the language model structure per se - rather, it is a problem of the data used to train it.

We could go on finding examples of other types of prejudice - there are all sorts of sexism and racism lying in the hidden spaces of BERT.

This is bad, but remember this was 2019, and people were impressed that the system could generate coherent words at all! Nowadays, LLM outputs go through a filter that finds phrases that are potentially harmful, so they don't write ugly phrases.

Which of the phrases below are true about this?

## Task 2: Next Sentence Prediction

BERT was also trained for a task called Next Sentence Prediction. The idea of this task is to insert two sentences in the input of BERT, separating them with a special [SEP] token. Then, the system uses the output of the [CLS] token to classify whether these two sentences do or do not follow each other. It is something like:

```mermaid
graph LR;
    subgraph Inputs;
    INPUT["[CLS]
        Here
        I
        am
        [MASK]
        rock
        you
        like
        a
        hurricane"];
    end;
    INPUT --> BERT["BERT"];
    subgraph Outputs;
    OUTPUT["C
    T1
    T2
    etc"];
    end;
    BERT --> OUTPUT;
    Train["Loss: C should be equal to 1"]
    OUTPUT --- Train;
```

```mermaid
graph LR;
    subgraph Inputs;
    INPUT["[CLS]
        Here
        I
        am
        [MASK]
        rock
        your
        body"];
    end;
    INPUT --> BERT["BERT"];
    subgraph Outputs;
    OUTPUT["C
    T1
    T2
    etc"];
    end;
    BERT --> OUTPUT;
    Train["Loss: C should be equal to 0"]
    OUTPUT --- Train;
```

The consequence of this training is that the embedding $C$ of the [CLS] token represents the content of the rest of the tokens. Hence, we can use it for classification. For such, we can go straight to the HuggingFace library and use:



In [10]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)


The embedding for the [CLS] token can be accessed using:


In [11]:
output_cls = output.last_hidden_state[0,0,:]

    
There are many details in this implementation, so I made a [video exploring them all](https://youtu.be/FXtGq_TYLzM).

## Exercise

Our usual way to approach classification is to do something in the lines of:

In [15]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/tiagoft/NLP/main/wiki_movie_plots_drama_comedy.csv').sample(1000)
X = df['Plot']
y = df['Genre']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with TfidfVectorizer and LogisticRegression
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('logreg', LogisticRegression(max_iter=1000))
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Test the pipeline
y_pred = pipeline.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      comedy       0.65      0.20      0.30        87
       drama       0.60      0.92      0.72       113

    accuracy                           0.60       200
   macro avg       0.63      0.56      0.51       200
weighted avg       0.62      0.60      0.54       200



Now, instead of using a TfIdf vectorizer, calculate embeddings for the texts in the dataset using BERT. Then, use *them* to classify. Compare the results with the ones we have when we use the Bag-of-words approach.

Justify these results using the concept of embeddings we have studied in the previous lessons.

In [1]:
# Make your solution here



In [11]:
# This is my solution - do not copy it!

# Step 0: get data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/tiagoft/NLP/main/wiki_movie_plots_drama_comedy.csv').sample(1000)
X = df['Plot']
y = df['Genre']

# Step 1: preprocess the text
from transformers import BertTokenizer, BertModel
from tqdm import tqdm
import numpy as np
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
def get_embeddings(text, model, tokenizer):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[0, 0, :]
    return cls_embedding

embeddings = []
for i in tqdm(range(len(X))):
    e = get_embeddings(X.iloc[i], model, tokenizer)
    embeddings.append(e.detach().numpy())
embeddings = np.array(embeddings)
np.save('bert_embeddings.npy', embeddings)


100%|██████████| 1000/1000 [04:35<00:00,  3.63it/s]


In [12]:
embeddings = np.load('bert_embeddings.npy')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(embeddings, y, test_size=0.2, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

      comedy       0.74      0.75      0.74        79
       drama       0.83      0.83      0.83       121

    accuracy                           0.80       200
   macro avg       0.79      0.79      0.79       200
weighted avg       0.80      0.80      0.80       200

