# KAIST AI605 Assignment 4: Sequence and Token Classification with BERT

TA in charge: Minki Kang (zzxc1133@kaist.ac.kr)

**Due date**: December 6 (Mon) 11:00pm, 2021

## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/aGZZ86YpCdv2zEVt9). 

You need to submit both (1) a PDF of this notebook, and (2) a link to CoLab for execution (.ipynb file is also allowed).

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points. For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will only use Python 3.7 and PyTorch 1.10, which is already available on Colab:

In [1]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.8.3
torch 1.7.0


## 1. Hugging Face Transformers
In this assignment, you will  use `transformers` library by Hugging Face. The library provides you an easy way to utilize diverse pretrained language models. 

First, install both `transformers` and `datasets` packages:

In [2]:
!pip install transformers datasets



In Lecture 16, we walked through how we can use pretrained and finetuned BERT for sequence classification (https://huggingface.co/transformers/task_summary.html#sequence-classification).


**Problem 1.1** *(1 point)* Put your favorite emoji here 😇
https://getemoji.com/

Your favorite emoji: 😎

## 2. Sequence Classification with BERT
**Problem 2.1** *(3 points)* Tutorial at https://huggingface.co/transformers/training.html#fine-tuning-in-native-pytorch shows you how you can finetune a sequence classification model from `bert-base-cased` for IMDB dataset. Repeat the same process with SST dataset and report the accuracy here (i.e. it's fine to copy & paste code from the documentation).

Note that you can load SST dataset via


>**Answer 2.1.**   
In SST dataset, the validation accuracy after 3 epochs training was 83%.

In [3]:
from datasets import load_dataset
raw_datasets = load_dataset('sst')

No config specified, defaulting to: sst/default
Reusing dataset sst (/home/sungnyun/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["sentence"])
tokenized_datasets = tokenized_datasets.remove_columns(["tokens"])
tokenized_datasets = tokenized_datasets.remove_columns(["tree"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached processed dataset at /home/sungnyun/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff/cache-2b75ce99210cfa89.arrow
Loading cached processed dataset at /home/sungnyun/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff/cache-41ad81326f71da0e.arrow
Loading cached processed dataset at /home/sungnyun/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff/cache-6650ca77ac2f166e.arrow
Loading cached shuffled indices for dataset at /home/sungnyun/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff/cache-8ea81e7e77e4c69e.arrow
Loading cached shuffled indices for dataset at /home/sungnyun/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff/cache-59a7617b505bc126.arrow


In [5]:
import torch
import torch.nn.functional as F

from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification
from transformers import AdamW
from transformers import get_scheduler


train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [6]:
from tqdm.auto import tqdm
import time

progress_bar = tqdm(range(num_training_steps))

model.train()
start = time.time()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        batch['labels'] = torch.round(batch['labels']).long()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
end = time.time() - start
print('{:.2f}s'.format(end))

  0%|          | 0/375 [00:00<?, ?it/s]

103.58s


In [7]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


metric = load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    batch['labels'] = torch.round(batch['labels']).long()
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.83}


**Problem 2.2** *(3 points)* How does your accuracy with BERT compares to your accuracy with LSTM in Assignment 1? How about training speed? If you didn't complete Assignment 1, chat with other classmates who did Assignment 1 (e.g. via GitHub Discussions). 



>**Answer 2.2.**   
For my LSTM model in Assignment 1, the validation accuracy after 10 epoch training was 66.94%, while BERT model above produces 83.0% accuracy. The training speed for LSTM was 21 second/epoch, while the training speed for BERT was 103.58second/3epoch = 35 second/epoch. BERT is much superior to LSTM regarding the accuracy, despite of the slightly slower training speed.

**Problem 2.3** *(3 points)* Try your own sentences and find three failure cases. Explain why you think the model got them wrong.

>**Answer 2.3.**   
I made three sentences on which BERT failed to predict. (see the code below) 

>1. I will fucking kill myself if this film does not get an award. -> Label: Positive
>2. Today I had to get new eyes. -> Label: Negative
>3. My favorite actor, my favorite music, my favorite director, and they made a great shit. -> Label: Negative

>First sentence includes the offensive words but it is actually a positive comment (the movie should get an award). The offensive words made the model confused. In the second sentence, it seems there is no negative words, but actually this is kind of idioms to imply that the movie is too bad to watch. BERT may not understand the context under this phrase. Third sentence mentions a positive word "favorite" three times and "great" once. However, it implies that the movie is bad despite of these good ingredients. BERT may have  misunderstood the sentiment because of positive words.

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=512, truncation=True)

input1 = tokenizer('I will fucking kill myself if this film does not get an award', padding="max_length", truncation=True)
label1 = torch.tensor([1], dtype=torch.long)

input2 = tokenizer('Today I had to get new eyes', padding="max_length", truncation=True)
label2 = torch.tensor([0], dtype=torch.long)

input3 = tokenizer('My favorite actor, my favorite music, my favorite director, and they made a great shit', padding="max_length", truncation=True)
label3 = torch.tensor([0], dtype=torch.long)


model.eval()
for inputs, labels in zip([input1, input2, input3], [label1, label2, label3]):
    batch = {}
    batch['input_ids'] = torch.tensor(inputs['input_ids']).long().unsqueeze(0).to(device)
    batch['attention_mask'] = torch.tensor(inputs['attention_mask']).long().unsqueeze(0).to(device)
    batch['token_type_ids'] = torch.tensor(inputs['token_type_ids']).long().unsqueeze(0).to(device)
    batch['labels'] = labels.to(device)
    
    ##################################

    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    print('Label {}, Prediction {}'.format(labels.item(), predictions.item()))

Label 1, Prediction 0
Label 0, Prediction 0
Label 0, Prediction 1


**Problem 2.4** *(3 points)*  Try `bert-base-uncased` and analyze if it makes any difference. What is the difference between `cased` and `uncased` in English? How about in Korean?

> **Answer 2.4.**   
Cased model is a pretrained model with distinguishing the upper case and lower case of a vocabulary. Uncased model does not distinguish them; lower casing every word and then pretrained. Therefore, in Korean, it does not make any difference because Korean does not have cases. In English, uncased model does not learn to embed upper case words.  

## 3. Token Classification with BERT
**Problem 3.1** *(3 points)* Finetune your `bert-base-cased` model for `squad` question answering dataset, following a similar procedure to Problem 2.1. Report your accuracy here. For now, if the input is longer than 256, take the first 256 words as the input and truncate the rest. You are allowed to copy any code from the documentation.  *Hint*: If you are having difficulty in implementation, take a peek at  (but do not copy!) https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering, though keep in mind that the answer extraction module there is quite complex. It is okay to keep it simple here and sacrifice the accuracy a little.



>**Answer 3.1.**   
The validation accuracy for SQuAD dataset was 30.10% in my implementation.

In [9]:
from datasets import load_dataset
from transformers import AutoTokenizer

squad_dataset = load_dataset('squad')

Reusing dataset squad (/home/sungnyun/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=256,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )
    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []
    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)
            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_datasets = squad_dataset.map(preprocess_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["context"])
tokenized_datasets = tokenized_datasets.remove_columns(["title"])
tokenized_datasets = tokenized_datasets.remove_columns(["answers"])
tokenized_datasets = tokenized_datasets.remove_columns(["question"])
tokenized_datasets = tokenized_datasets.remove_columns(["id"])
tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(1000))

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

Loading cached processed dataset at /home/sungnyun/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-bb50b6a36496624c.arrow
Loading cached processed dataset at /home/sungnyun/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-b5c15d73b5f7701f.arrow
Loading cached shuffled indices for dataset at /home/sungnyun/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-391c407b4d42ccd2.arrow
Loading cached shuffled indices for dataset at /home/sungnyun/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-e96c3f87e7967628.arrow


In [11]:
from transformers import AutoModelForQuestionAnswering
from tqdm.auto import tqdm
import time


model =  AutoModelForQuestionAnswering.from_pretrained("bert-base-cased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 10
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

progress_bar = tqdm(range(num_training_steps))

model.train()
start = time.time()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

end = time.time() - start
print('{:.2f}s'.format(end))

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

  0%|          | 0/1250 [00:00<?, ?it/s]

190.06s


In [12]:
model.eval()
cnt = 0
batch_cnt = 0
for batch in eval_dataloader:
    input_batch = {}
    batch = {k: v.to(device) for k, v in batch.items()}

    with torch.no_grad():
        outputs = model(**batch)

    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_predictions = torch.argmax(start_logits, dim=-1)
    end_predictions = torch.argmax(end_logits, dim=-1)
    
    cnt += ((start_predictions == batch["start_positions"]).float() * (end_predictions == batch["end_positions"]).float()).sum().item()
    batch_cnt += len(start_predictions)
    
print('Validation Acc: {:.2f}%'.format(cnt / batch_cnt * 100))

Validation Acc: 29.80%



**Problem 3.2** *(3 points)* Try your own context/questions and find three failure cases. Explain why you think the model got them wrong.

>**Answer 3.2.**   
I tried my own context/questions and found three failure cases (see the code below). I think the context is misleading because 'born' and 'raised' are confusing words. Also, the third question is wrong because the context does not provide the information about her full name.

In [14]:
custom1 = {}
custom1['context'] = 'Beyonce is an American singer, songwriter, record producer and actress. Born in Houston and raised in Dallas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child.'
custom1['question'] = 'Where was Beyonce born?'
input1 = tokenizer(
        custom1["question"],
        custom1["context"],
        max_length=256,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )
input1['start_positions'] = 26
input1['end_positions'] = 26

custom2 = {}
custom2['context'] = 'Beyonce is an American singer, songwriter, record producer and actress. Born in Houston and raised in Dallas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child.'
custom2['question'] = 'What is Beyonce\'s job?'
input2 = tokenizer(
        custom2["question"],
        custom2["context"],
        max_length=256,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )
input2['start_positions'] = 11
input2['end_positions'] = 18

custom3 = {}
custom3['context'] = 'Beyonce is an American singer, songwriter, record producer and actress. Born in Houston and raised in Dallas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child.'
custom3['question'] = 'What is Beyone\'s full name?'
input3 = tokenizer(
        custom3["question"],
        custom3["context"],
        max_length=256,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )
input3['start_positions'] = 0
input3['end_positions'] = 0


model.eval()
for inputs in [input1, input2, input3]:
    batch = {}
    batch['input_ids'] = torch.tensor(inputs['input_ids']).long().unsqueeze(0).to(device)
    batch['attention_mask'] = torch.tensor(inputs['attention_mask']).long().unsqueeze(0).to(device)
    batch['token_type_ids'] = torch.tensor(inputs['token_type_ids']).long().unsqueeze(0).to(device)
    batch['start_positions'] = torch.tensor(inputs['start_positions']).long().unsqueeze(0).to(device)
    batch['end_positions'] = torch.tensor(inputs['end_positions']).long().unsqueeze(0).to(device)
    
    ##################################

    with torch.no_grad():
        outputs = model(**batch)
        
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_predictions = torch.argmax(start_logits, dim=-1)
    end_predictions = torch.argmax(end_logits, dim=-1)
    print('Start Prediction {}, End Prediction {}'.format(start_predictions.item(), end_predictions.item()))

Start Prediction 30, End Prediction 30
Start Prediction 11, End Prediction 68
Start Prediction 11, End Prediction 13


**Problem 3.3** *(2 points)* Can we do better than truncating tokens if the input length is too long? Suggest (but do not code) a strategy for a problem like SQuAD when the input has an arbitrary length with a pretrained model like BERT that has a predefined input length.

>**Answer 3.3.**   
We can use "chunks" rather than a truncated sentence. We can split the sentence by several chunks that have fixed lengths, predict with each of the chunks, and then find the most confident position (the largest logit value) for the final decision.