[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/sk-classroom/asc-bert/blob/main/assignments/assignment_01.ipynb)

We will learn how to generate word embeddings using BERT. BERT produces contextualized word embeddings, where the embeddings are computed based on the context of the word. Thus, a single word can have different embeddings based on its context. 

# Preparation

In [1]:
# If you haven't installed the required packages, please install them using pip
# pip install transformers plotly

In [2]:
import pandas as pd
import numpy as np
import transformers
import torch
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.decomposition import PCA
import plotly.express as px

  from .autonotebook import tqdm as notebook_tqdm


# Data 

We will use [CoarseWSD-20](https://github.com/danlou/bert-disambiguation/tree/master/data/CoarseWSD-20). The dataset contains sentences with polysemous words and their sense labels. We will see how to use BERT to disambiguate the word senses.



In [3]:
def load_data(focal_word, is_train):
    data_type = "train" if is_train else "test"
    data_file = f"https://raw.githubusercontent.com/danlou/bert-disambiguation/master/data/CoarseWSD-20/{focal_word}/{data_type}.data.txt"
    label_file = f"https://raw.githubusercontent.com/danlou/bert-disambiguation/master/data/CoarseWSD-20/{focal_word}/{data_type}.gold.txt"

    data_table = pd.read_csv(
        data_file,
        sep="\t",
        header=None,
        dtype={"word_pos": int, "sentence": str},
        names=["word_pos", "sentence"],
    )
    label_table = pd.read_csv(
        label_file,
        sep="\t",
        header=None,
        dtype={"label": int},
        names=["label"],
    )
    combined_table = pd.concat([data_table, label_table], axis=1)
    return combined_table


focal_word = "java"

train_data = load_data(focal_word, is_train=False)

train_data.head(10)

Unnamed: 0,word_pos,sentence,label
0,5,there also exist javascript and java backends ...,1
1,4,it is found on java .,0
2,7,"typed object-oriented programming languages , ...",1
3,11,flying saucer ( also called xhtml renderer ) i...,1
4,82,november 2013 jelastic announced a partnership...,1
5,8,"they also traveled to bali , borneo , java , c...",0
6,21,sunan bayat is often mentioned in the javanese...,0
7,18,"mucommander is a lighweight , open-source , cr...",1
8,8,"following graduation , matthews spent six year...",0
9,26,time zones many computer operating systems ( s...,1


Please refer to the [README](https://github.com/danlou/bert-disambiguation/blob/master/data/CoarseWSD-20/README.txt) for the data. 

## Define BERT model

We will use `transformers` library developed by Hugging Face to define the BERT model. To use the model, we will need:  
1. BERT tokenizer that converts the text into tokens. 
2. BERT model that computes the embeddings of the tokens. 

We will use the `bert-base-uncased` model and tokenizer. Let's define the model and tokenizer. 



In [4]:
model = transformers.BertModel.from_pretrained(
    "bert-base-uncased",
    output_hidden_states=True,
)
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")

model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

With BERT, we need to prepare text in ways that BERT can understand. 
Specifically, we prepend it with ```[CLS]``` and append ```[SEP]```. We will then convert the text to a tensor of token ids, which is ready to be fed into the model. 



In [5]:
def prepare_text(text):
    text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    segments_ids = torch.ones((1, len(indexed_tokens)), dtype=torch.long)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensor = segments_ids.clone()
    return tokenized_text, tokens_tensor, segments_tensor

> What is sentence tensor?
BERT models are designed to process sentence pairs, differentiated by 0s and 1s to indicate the first and second sentence respectively. In the case of single-sentence inputs, we assign a vector of 1s to each token, indicating they all belong to the first sentence.

Let's get the BERT embeddings for the sentence "Bank is located in the city of London". 

First, let's prepare the text for BERT. 

In [6]:
text = "Bank is located in the city of London"
tokenized_text, tokens_tensor, segments_tensor = prepare_text(text)

Then, let's get the BERT embeddings for each token. 

In [7]:
outputs = model(tokens_tensor, segments_tensor)

The output includes `loss`, `logits`, and `hidden_states`. We will use `hidden_states`, which contains the embeddings of the tokens. 



In [8]:
hidden_states = outputs.hidden_states

print("how many layers? ", len(hidden_states))
print("Shape? ", hidden_states[0].shape)

how many layers?  13
Shape?  torch.Size([1, 10, 768])


The hidden states are a list of 13 tensors, where each tensor is of shape (batch_size, sequence_length, hidden_size). The first tensor is the input embeddings, and the subsequent tensors are the hidden states of the BERT layers. 

So, we have 13 choice of hidden states. Deep layers close to the output capture the context of the word from the previous layers.

Here we will take the average over the last four hidden states for each token. 

In [9]:
emb = torch.cat([hidden_states[-i] for i in range(4)], dim=0).mean(dim=0)
emb.shape

torch.Size([10, 768])

emb is of shape (sequence_length, hidden_size). Let us summarize the embeddings of the tokens into a function. 

In [10]:
def get_bert_embeddings(text):
    tokenized_text, tokens_tensor, segments_tensor = prepare_text(text)
    outputs = model(tokens_tensor, segments_tensor)
    hidden_states = outputs.hidden_states
    emb = torch.cat([hidden_states[-i] for i in range(4)], dim=0).mean(dim=0)
    return emb, tokenized_text

# Embedding
Let's embed the text and get the embedding of the focal word 

In [11]:
label_list = []
word_emb_list = []
sent_list = []
for word_pos, sentence, label in tqdm(
    train_data[["word_pos", "sentence", "label"]].itertuples(index=False),
    total=train_data.shape[0],
):
    emb, tokenized_text = get_bert_embeddings(sentence)
    word_pos += 1  # BERT tokenizer adds [CLS] and [SEP]

    # If the word is not the focal word, continue
    if tokenized_text[word_pos] != focal_word:
        continue

    word_emb_list.append(emb[word_pos].detach().numpy())
    label_list.append(label)
    sent_list.append(sentence)

word_emb_list = np.vstack(word_emb_list)
label_list = np.array(label_list)

100%|██████████| 1929/1929 [00:43<00:00, 44.42it/s]


# Results 

Let's plot the embeddings of the focal word. 

In [12]:
xy = PCA(n_components=2).fit_transform(word_emb_list)

fig = px.scatter(
    x=xy[:, 0],
    y=xy[:, 1],
    color=label_list,
    hover_data=[sent_list],
    title="PCA of Word Embeddings",
)
fig.update_layout(width=700, height=500)
fig.show()

NameError: name 'px' is not defined

# Assignment

- Run the code for any word available in [the dataset](https://github.com/danlou/bert-disambiguation/tree/master/data/CoarseWSD-20) **except** word "apple". Use test data to save time. And save the embeddings of the test data and their labels by running the following cell.  
- Make sure to place the generated file "eval_test.csv" in the "data" folder. 
- Commit the file to your assignment repository. 

In [17]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

K = len(set(label_list))

xy = LinearDiscriminantAnalysis(n_components=K - 1).fit_transform(
    word_emb_list, label_list
)

xy_df = pd.DataFrame(xy)
xy_df["label"] = label_list
xy_df.to_csv("../data/eval_test.csv", index=False)

# pd.DataFrame({"x": xy[:, 0], "y": xy[:, 1], "label": label_list}).to_csv(
#    "../data/eval_test.csv", index=False
# )