[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/sk-classroom/asc-bert/blob/main/assignments/assignment_01.ipynb)

We will learn how to generate word embeddings using BERT. BERT produces contextualized word embeddings, where the embeddings are computed based on the context of the word. Thus, a single word can have different embeddings based on its context. 

# Preparation

In [1]:
# If you haven't installed the required packages, please install them using pip
#!pip install transformers plotly

In [2]:
import pandas as pd
import numpy as np
import transformers
import torch
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.decomposition import PCA
import plotly.express as px

# Data 

We will use [CoarseWSD-20](https://github.com/danlou/bert-disambiguation/tree/master/data/CoarseWSD-20). The dataset contains sentences with polysemous words and their sense labels. We will see how to use BERT to disambiguate the word senses.



In [3]:
def load_data(focal_word, is_train, n_samples=100):
    data_type = "train" if is_train else "test"
    data_file = f"https://raw.githubusercontent.com/danlou/bert-disambiguation/master/data/CoarseWSD-20/{focal_word}/{data_type}.data.txt"
    label_file = f"https://raw.githubusercontent.com/danlou/bert-disambiguation/master/data/CoarseWSD-20/{focal_word}/{data_type}.gold.txt"

    data_table = pd.read_csv(
        data_file,
        sep="\t",
        header=None,
        dtype={"word_pos": int, "sentence": str},
        names=["word_pos", "sentence"],
    )
    label_table = pd.read_csv(
        label_file,
        sep="\t",
        header=None,
        dtype={"label": int},
        names=["label"],
    )
    combined_table = pd.concat([data_table, label_table], axis=1)
    return combined_table.sample(n_samples)


focal_word = "apple"

train_data = load_data(focal_word, is_train=True)

train_data.head(10)

Unnamed: 0,word_pos,sentence,label
988,25,transparency reports are issued today by a var...,0
879,72,maximum files per server : 16 million maximum ...,0
1101,0,"apple , google , and intel are among 1,600 tec...",0
1958,17,the town was first settled around 1763 by jean...,1
587,18,it is made of puff pastry and four fillings : ...,1
2233,20,past projects the firm has contributed to the ...,0
794,43,nutrients and potential health effects sea-buc...,1
1423,13,wheat is the most commonly grown product ; how...,1
2122,16,launched in 1981 by london-based rainbow softw...,0
1656,17,technical background with the transition to ta...,0


Please refer to the [README](https://github.com/danlou/bert-disambiguation/blob/master/data/CoarseWSD-20/README.txt) for the data. 

## Define BERT model

We will use `transformers` library developed by Hugging Face to define the BERT model. To use the model, we will need:  
1. BERT tokenizer that converts the text into tokens. 
2. BERT model that computes the embeddings of the tokens. 

We will use the `bert-base-uncased` model and tokenizer. Let's define the model and tokenizer. 



In [4]:
# TODO: Define the model and tokenizer 
# Hint: Use the transformers library to load a pre-trained BERT model and tokenizer

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

With BERT, we need to prepare text in ways that BERT can understand. 
Specifically, we prepend it with ```[CLS]``` and append ```[SEP]```. We will then convert the text to a tensor of token ids, which is ready to be fed into the model. 



In [5]:
def prepare_text(text):
    text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    segments_ids = torch.ones((1, len(indexed_tokens)), dtype=torch.long)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensor = segments_ids.clone()
    return tokenized_text, tokens_tensor, segments_tensor

> What is segment tensor?
BERT models are designed to process sentence pairs, differentiated by 0s and 1s to indicate the first and second sentence respectively. In the case of single-sentence inputs, we assign a vector of 1s to each token, indicating they all belong to the first sentence.

Let's get the BERT embeddings for the sentence "Bank is located in the city of London". 

First, let's prepare the text for BERT. 

In [6]:
text = "Bank is located in the city of London"
tokenized_text, tokens_tensor, segments_tensor = prepare_text(text)
print(tokenized_text)
print(tokens_tensor)
print(segments_tensor)

['[CLS]', 'bank', 'is', 'located', 'in', 'the', 'city', 'of', 'london', '[SEP]']
tensor([[ 101, 2924, 2003, 2284, 1999, 1996, 2103, 1997, 2414,  102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


Then, let's get the BERT embeddings for each token. 

In [7]:
outputs = model(tokens_tensor, segments_tensor)


The output includes `loss`, `logits`, and `hidden_states`. We will use `hidden_states`, which contains the embeddings of the tokens. 



In [8]:
hidden_states = outputs.hidden_states

print("how many layers? ", len(hidden_states))
print("Shape? ", hidden_states[0].shape)

how many layers?  13
Shape?  torch.Size([1, 10, 768])


The hidden states are a list of 13 tensors, where each tensor is of shape (batch_size, sequence_length, hidden_size). The first tensor is the input embeddings, and the subsequent tensors are the hidden states of the BERT layers. 

So, we have 13 choice of hidden states. Deep layers close to the output capture the context of the word from the previous layers.

Here we will take the average over the last four hidden states for each token. 

In [9]:
# TODO: Compute the embedding of the token
emb = hidden_states[0].squeeze(0)[0]
emb.shape

torch.Size([768])

emb is of shape (sequence_length, hidden_size). Let us summarize the embeddings of the tokens into a function. 

In [10]:
def get_bert_embeddings(text):
    tokenized_text, tokens_tensor, segments_tensor = prepare_text(text)
    outputs = model(tokens_tensor, segments_tensor)
    hidden_states = outputs.hidden_states
    emb = hidden_states[0].squeeze(0)[0]
    return emb, tokenized_text

# Embedding
Let's embed the text and get the embedding of the focal word 

In [11]:
labels = []  # label
emb = []  # embedding
sentences = []  # sentence

# TODO: Go through the data and get the embedding of the focal word.
for i, row in tqdm(train_data.iterrows(), total=len(train_data)):
    text = row["sentence"]
    word_pos = row["word_pos"]
    label = row["label"]
    tokenized_text, tokens_tensor, segments_tensor = prepare_text(text)
    
    # Pass tokens and segments through the model
    with torch.no_grad():  # Ensure no gradient tracking
        outputs = model(tokens_tensor, segments_tensor)
        hidden_states = outputs.hidden_states
    
    # Extract the embedding of the focal word
    word_embedding = hidden_states[0].squeeze(0)[word_pos].detach().numpy()
    
    # Append data to lists
    emb.append(word_embedding)
    labels.append(label)
    sentences.append(tokenized_text)

100%|██████████| 100/100 [00:04<00:00, 20.54it/s]


# Results 

Let's plot the embeddings of the focal word. 

In [12]:
def plot_result(emb, labels, sentences):
    
    xy = PCA(n_components=2).fit_transform(emb)
    
    fig = px.scatter(
        x=xy[:, 0],
        y=xy[:, 1],
        color=labels,
        hover_data=[sentences],
        title="PCA of Word Embeddings",
    )
    fig.update_layout(width=700, height=500)
    fig.update_traces(
        marker=dict(size=12, line=dict(width=2, color="DarkSlateGrey")),
        selector=dict(mode="markers"),
    )
    fig.show()


plot_result(emb, labels, sentences)

# Assignment

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


def save_assignment(emb, labels, assignment_id, data_dir):
    K = len(set(labels))
    xy = LinearDiscriminantAnalysis(n_components=K - 1).fit_transform(emb, labels)
    xy_df = pd.DataFrame(xy)
    xy_df["label"] = labels
    xy_df.to_csv(f"{data_dir}/eval_test_{assignment_id}.csv", index=False)

## Assignment 1
- Run this notebook for any word available in [the dataset](https://github.com/danlou/bert-disambiguation/tree/master/data/CoarseWSD-20) except for "apple". 
- Save the (dimensionality-reduced) embeddings of the test data and their labels by running the following cell.  
- Make sure to place the generated file "eval_test_01.csv" in the "data" folder. 
- Commit the file to your assignment repository. 

In [None]:
# Your code

focal_word = "crane"
test_data = load_data(focal_word, is_train=False)

emb = []
labels = []
sentences = []

for i, row in tqdm(test_data.iterrows(), total=len(test_data)):
    text = row["sentence"]
    word_pos = row["word_pos"]
    label = row["label"]
    tokenized_text, tokens_tensor, segments_tensor = prepare_text(text)
    
    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensor)
        hidden_states = outputs.hidden_states
    
    word_embedding = hidden_states[0].squeeze(0)[word_pos].detach().numpy()
    
    emb.append(word_embedding)
    labels.append(label)
    sentences.append(tokenized_text)

plot_result(emb, labels, sentences)

In [None]:
save_assignment(emb, labels, assignment_id="01", data_dir="../data")

## Assignment 2

- Use the same dataset as Assignment 1. 
- Compute the word embedding using the first hidden state of the BERT model 
- Save the embeddings of the test data and their labels by running the following cell.  
- Make sure to place the generated file "eval_test_02.csv" in the "data" folder. 
- Commit the file to your assignment repository. 

In [None]:
# Your code:
# - Use the same dataset as Assignment 1. 
# Compute the word embedding using the first hidden state of the BERT model 

focal_word = "crane"
train_data = load_data(focal_word, is_train=True)

labels = []  # label
emb = []  # embedding

for i, row in tqdm(train_data.iterrows(), total=len(train_data)):
    text = row["sentence"]
    word_pos = row["word_pos"]
    label = row["label"]
    tokenized_text, tokens_tensor, segments_tensor = prepare_text(text)
    
    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensor)
        hidden_states = outputs.hidden_states
    
    word_embedding = hidden_states[1].squeeze(0)[word_pos].detach().numpy()
    
    emb.append(word_embedding)
    labels.append(label)
    
plot_result(emb, labels, sentences)

In [None]:
save_assignment(emb, labels, assignment_id="02", data_dir="../data")