In this demo, BERT for question answering model will be explored from the huggingface library

In Part 1 of the demo, we will use a fine-tuned BERT on the **SQuAD** dataset and apply it (test) it) on the **CoQA** dataset.  In Part 2 of the demo you will learn how to fine tune BERT for question answering on the **SQuAD** dataset yourselves.

# PART 1

### Initialization & Setup


In [None]:
!pip install transformers



In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

### Loading the CoQA dataset

In [None]:
coqa = pd.read_json('http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json')
coqa

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...,...
7194,1,"{'source': 'gutenberg', 'id': '34j10vatjfyw0ao..."
7195,1,"{'source': 'cnn', 'id': '3vj40nv2qinjocrcy7k4z..."
7196,1,"{'source': 'race', 'id': '3rjsc4xj10uw0to3vq0v..."
7197,1,"{'source': 'wikipedia', 'id': '3gs6s824sqxty8v..."


### Inspecting the data

In [None]:
coqa["data"][0]

{'source': 'wikipedia',
 'id': '3zotghdk5ibi9cex97fepx7jetpso7',
 'filename': 'Vatican_Library.txt',
 'story': 'The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to 

The **CoQA** dataset contains ~7200 rows, and each row contains one paragraph and multiple question and answer pairs related to that paragraph.

If we print the first row, we see that there are 20 questions and answers for the first paragraph and that answers are in the form of start index and end index within the paragraph.  This is the standard format of any closed domain question answering dataset.

In [None]:
# deleting an unnecessary column
del coqa["version"]
coqa

Unnamed: 0,data
0,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...
7194,"{'source': 'gutenberg', 'id': '34j10vatjfyw0ao..."
7195,"{'source': 'cnn', 'id': '3vj40nv2qinjocrcy7k4z..."
7196,"{'source': 'race', 'id': '3rjsc4xj10uw0to3vq0v..."
7197,"{'source': 'wikipedia', 'id': '3gs6s824sqxty8v..."


### Converting the CoQA dataset to a more convenient format
We convert the CoQA dataset to a more convenient format by creating one question and answer pair per row.  This results in repeated content in the "text" column - once per questions and answer for the respective paragraph, we will be repeating the paragraph in the "text" column.

In [None]:
cols = ["text","question","answer"]
comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns = cols)
new_df

Unnamed: 0,text,question,answer
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project
...,...,...,...
108642,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was a sub?,Xabi Alonso
108643,(CNN) -- Cristiano Ronaldo provided the perfec...,Was it his first game this year?,Yes
108644,(CNN) -- Cristiano Ronaldo provided the perfec...,What position did the team reach?,third
108645,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was ahead of them?,Barca.


### Loading BERT fine-tuned on SQuAD
Loading BERT for question answering which is already fine-tuned on SQuAD, as well as the corresponding BERT tokenizer (each pre-trained BERT model has a corresponding tokenizer)


In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Experimenting with BERT


In [None]:
# picking out a random question and answer pair from the dataset
random_num = np.random.randint(0,len(new_df))
question = new_df["question"][random_num]
text = new_df["text"][random_num]

In [None]:
# tokeninzing the question and answer pair
input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))

The input has a total of 402 tokens.


We inspect the resulting tokens and observe that each word is assigned a unique token, and that some rare words are getting split into multiple tokens. The token 101 is always the first token indicating the start of the input text, and token 102 is the separator token, which comes between the question and the answer and also at the end

In [None]:
# inspecting the resulting tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
which      2,029
is         2,003
where      2,073
?          1,029
[SEP]        102
seoul     10,884
,          1,010
south      2,148
korea      4,420
(          1,006
cnn       13,229
)          1,007
-          1,011
-          1,011
korean     4,759
is         2,003
considered   2,641
one        2,028
of         1,997
the        1,996
hardest   18,263
languages   4,155
in         1,999
the        1,996
world      2,088
to         2,000
master     3,040
,          1,010
but        2,021
an         2,019
elephant  10,777
in         1,999
a          1,037
south      2,148
korean     4,759
zoo        9,201
is         2,003
making     2,437
a          1,037
good       2,204
start      2,707
.          1,012
ko        12,849
##shi      6,182
##k        2,243
,          1,010
a          1,037
22         2,570
-          1,011
year       2,095
-          1,011
old        2,214
asian      4,004
elephant  10,777
has        2,038
stunned    9,860
experts    8,519
and        

In [None]:
# Visualizing the number of token in question and text
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index: ", sep_idx)
num_seg_a = sep_idx + 1
print("Number of tokens in segment A (question): ", num_seg_a)
num_seg_b = len(input_ids) - num_seg_a
print("Number of tokens in segment B (answer): ", num_seg_b)

SEP token index:  5
Number of tokens in segment A (question):  6
Number of tokens in segment B (answer):  396


In [None]:
#creating the segment ids and making sure every input token has a segment id
segment_ids = [0]*num_seg_a + [1]*num_seg_b
assert len(segment_ids) == len(input_ids)

Now the tokens and the segment ids will be passed to the model

In [None]:
# token input_ids to represent the input and token segment_ids to differentiate
# our segments - question and text
output = model(torch.tensor([input_ids]),  token_type_ids = torch.tensor([segment_ids]))

Getting the start and end tokens from the output

In [None]:
#tokens with highest start and end scores
# Define a default answer
answer = "I am unable to find the answer to this question. Can you please ask another question?"
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

print("Text:\n{}".format(new_df["text"][random_num]))
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))

Text:
Seoul, South Korea (CNN) -- Korean is considered one of the hardest languages in the world to master, but an elephant in a South Korean zoo is making a good start. 

Koshik, a 22-year-old Asian elephant has stunned experts and his keepers at Everland Zoo near Seoul by imitating human speech. Koshik can say the Korean words for "hello," "sit down," "no," "lie down" and "good." His trainer, Kim Jong Gap, first started to realize Koshik was mimicking him several years ago. 

""In 2004 and 2005, Kim didn't even know that the human voice he heard at the zoo was actually from Koshik," zoo spokesman In Kim In Cherl said. "But in 2006, he started to realize that Koshik had been imitating his voice and mentioned it to his boss." 

Why do elephants have hair on their heads? 

His boss initially called him "crazy." 

Koshik's remarkable antics grabbed the interest of an elephant vocalization expert thousands of kilometers away at the University of Vienna in Austria. 

""There was a YouTube 

Cleaning up the answer is needed when there are multiple tokens for a word. The double hash symbols indicate that a word split into multiple tokens (separated by ##)

In [None]:
# cleaning up the answer
answer = tokens[answer_start]
for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2] == "##":
        answer += tokens[i][2:]
    else:
        answer += " " + tokens[i]

In [None]:
print("Answer:\n{}.".format(answer.capitalize()))

Answer:
Seoul , south korea.


In [None]:
# retrieve and print the answer to this question that we had in the training set
answer = new_df["answer"][random_num]
answer

'Seoul'

# PART 2

## Initialization & Setup

In [None]:
# importing required libraries
import requests
import json
import torch
import os
from tqdm import tqdm
from transformers import BertTokenizerFast
from torch.utils.data import DataLoader
from transformers import BertForQuestionAnswering
from transformers import AdamW

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# creating a directory in Google drive
if not os.path.exists('/content/drive/MyDrive/BERT-SQuAD'): os.mkdir('/content/drive/MyDrive/BERT-SQuAD')

## Loading the SQuAD dataset

In [None]:
# getting the SQuAD dataset
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2023-11-18 06:46:02--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2023-11-18 06:46:02 (430 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2023-11-18 06:46:02--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2023-11-18 06:46:02 (138 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [None]:
# Load the training dataset and inspecting it
with open('train-v2.0.json', 'rb') as f:
  squad = json.load(f)

In [None]:
# Each 'data' dict has two keys (title and paragraphs)
squad['data'][0].keys()

dict_keys(['title', 'paragraphs'])

In [None]:
squad['data'][0]

{'title': 'Beyoncé',
 'paragraphs': [{'qas': [{'question': 'When did Beyonce start becoming popular?',
     'id': '56be85543aeaaa14008c9063',
     'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
     'is_impossible': False},
    {'question': 'What areas did Beyonce compete in when she was growing up?',
     'id': '56be85543aeaaa14008c9065',
     'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
     'is_impossible': False},
    {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
     'id': '56be85543aeaaa14008c9066',
     'answers': [{'text': '2003', 'answer_start': 526}],
     'is_impossible': False},
    {'question': 'In what city and state did Beyonce  grow up? ',
     'id': '56bf6b0f3aeaaa14008c9601',
     'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
     'is_impossible': False},
    {'question': 'In which decade did Beyonce become famous?',
     'id': '56bf6b0f3aeaaa14008c9602',
     'answers': [{'text

Here we see that for each topic there are multiple paragraphs, and for each paragraph there are mutliple question and answer pairs

In [None]:
# checking the number of topics
len(squad['data'])

442

In [None]:
# loading the data in triplets of context, questions and answers
def read_data(path):

  with open(path, 'rb') as f:
    squad = json.load(f)

  contexts = []
  questions = []
  answers = []

  for group in squad['data']:
    for passage in group['paragraphs']:
      context = passage['context']
      for qa in passage['qas']:
        question = qa['question']
        for answer in qa['answers']:
          contexts.append(context)
          questions.append(question)
          answers.append(answer)

  return contexts, questions, answers

In [None]:
train_contexts, train_questions, train_answers = read_data('train-v2.0.json')
valid_contexts, valid_questions, valid_answers = read_data('dev-v2.0.json')

In [None]:
print(f'There are {len(train_questions)} training set questions')
print(f'There are {len(valid_questions)} dev set questions')

There are 86821 training set questions
There are 20302 dev set questions


## Dataset pre-processing

In [None]:
# fixing some data quality issues
def add_end_idx(answers, contexts):
  for answer, context in zip(answers, contexts):
    gold_text = answer['text']
    start_idx = answer['answer_start']
    end_idx = start_idx + len(gold_text)

    # sometimes squad answers are off by a character or two so we fix this
    if context[start_idx:end_idx] == gold_text:
      answer['answer_end'] = end_idx
    elif context[start_idx-1:end_idx-1] == gold_text:
      answer['answer_start'] = start_idx - 1
      answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
    elif context[start_idx-2:end_idx-2] == gold_text:
      answer['answer_start'] = start_idx - 2
      answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers[:1000], train_contexts[:1000])
add_end_idx(valid_answers[:100], valid_contexts[:100])

## Fine-tuning BERT on SQuAD

In [None]:
# getting the model and its tokenizer (currently training on only 1000 rows as it is very time consuming)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_contexts[:1000], train_questions[:1000], truncation=True, padding=True)
valid_encodings = tokenizer(valid_contexts[:100], valid_questions[:100], truncation=True, padding=True)

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

Visualing the output of tokenizer, input ids are the token indices with padding of 0s, token_type_ids are different integers for different sequences and attention mask states which positions to give attention to while training

In [None]:
train_encodings["input_ids"][0]

[101,
 20773,
 21025,
 19358,
 22815,
 1011,
 5708,
 1006,
 1013,
 12170,
 23432,
 29715,
 3501,
 29678,
 12325,
 29685,
 1013,
 10506,
 1011,
 10930,
 2078,
 1011,
 2360,
 1007,
 1006,
 2141,
 2244,
 1018,
 1010,
 3261,
 1007,
 2003,
 2019,
 2137,
 3220,
 1010,
 6009,
 1010,
 2501,
 3135,
 1998,
 3883,
 1012,
 2141,
 1998,
 2992,
 1999,
 5395,
 1010,
 3146,
 1010,
 2016,
 2864,
 1999,
 2536,
 4823,
 1998,
 5613,
 6479,
 2004,
 1037,
 2775,
 1010,
 1998,
 3123,
 2000,
 4476,
 1999,
 1996,
 2397,
 4134,
 2004,
 2599,
 3220,
 1997,
 1054,
 1004,
 1038,
 2611,
 1011,
 2177,
 10461,
 1005,
 1055,
 2775,
 1012,
 3266,
 2011,
 2014,
 2269,
 1010,
 25436,
 22815,
 1010,
 1996,
 2177,
 2150,
 2028,
 1997,
 1996,
 2088,
 1005,
 1055,
 2190,
 1011,
 4855,
 2611,
 2967,
 1997,
 2035,
 2051,
 1012,
 2037,
 14221,
 2387,
 1996,
 2713,
 1997,
 20773,
 1005,
 1055,
 2834,
 2201,
 1010,
 20754,
 1999,
 2293,
 1006,
 2494,
 1007,
 1010,
 2029,
 2511,
 2014,
 2004,
 1037,
 3948,
 3063,
 4969,
 1010,
 36

In [None]:
train_encodings["token_type_ids"][0]

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [None]:
train_encodings["attention_mask"][0]

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


In [None]:
# printing the number of training data samples
no_of_encodings = len(train_encodings['input_ids'])
print(f'We have {no_of_encodings} context-question pairs')

We have 1000 context-question pairs


In [None]:
# adding the answers in the training set for fine tuning
def add_token_positions(encodings, answers):
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
    start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
    end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

    # if start position is None, the answer passage has been truncated
    if start_positions[-1] is None:
      start_positions[-1] = tokenizer.model_max_length
    if end_positions[-1] is None:
      end_positions[-1] = tokenizer.model_max_length

  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers[:1000])
add_token_positions(valid_encodings, valid_answers[:100])

In [None]:
# creating the dataset in the format it is required for fine tuning BERT
class SQuAD_Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __getitem__(self, idx):
    return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  def __len__(self):
    return len(self.encodings.input_ids)

In [None]:
train_dataset = SQuAD_Dataset(train_encodings)
valid_dataset = SQuAD_Dataset(valid_encodings)

In [None]:
# Define the dataloaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=8)

In [None]:
# loading the BERT model which we will fine tune
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# checking the device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Working on {device}')

Working on cuda


In [None]:
# Fine tuning it per batch
N_EPOCHS = 5
optim = AdamW(model.parameters(), lr=5e-5)

model.to(device)
model.train()

for epoch in range(N_EPOCHS):
  loop = tqdm(train_loader, leave=True)
  for batch in loop:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    loss.backward()
    optim.step()

    loop.set_description(f'Epoch {epoch+1}')
    loop.set_postfix(loss=loss.item())

Epoch 1: 100%|██████████| 125/125 [01:26<00:00,  1.45it/s, loss=2.95]
Epoch 2: 100%|██████████| 125/125 [01:26<00:00,  1.45it/s, loss=1.25]
Epoch 3: 100%|██████████| 125/125 [01:26<00:00,  1.45it/s, loss=0.743]
Epoch 4: 100%|██████████| 125/125 [01:25<00:00,  1.45it/s, loss=0.335]
Epoch 5: 100%|██████████| 125/125 [01:25<00:00,  1.45it/s, loss=0.655]


In [None]:
# checking the performance
model.eval()

acc = []

for batch in tqdm(valid_loader):
  with torch.no_grad():
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_true = batch['start_positions'].to(device)
    end_true = batch['end_positions'].to(device)

    outputs = model(input_ids, attention_mask=attention_mask)

    start_pred = torch.argmax(outputs['start_logits'], dim=1)
    end_pred = torch.argmax(outputs['end_logits'], dim=1)

    acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
    acc.append(((end_pred == end_true).sum()/len(end_pred)).item())

acc = sum(acc)/len(acc)

100%|██████████| 13/13 [00:02<00:00,  5.23it/s]


In [None]:
acc

0.6634615384615384

# Homework assignment

**Exercise 1: Fine-tune BERT for question answering on the CoQA dataset using the same process as shown in Part 2 for the SQuAD dataset.**

In [None]:
import pandas as pd
import torch
from transformers import BertTokenizerFast, BertForQuestionAnswering, AdamW
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm

# Load COQA dataset
def load_coqa_data(url):
    coqa_data = pd.read_json(url)
    return coqa_data.drop("version", axis=1)

In [None]:
# Extract and organize data
def extract_coqa_data(coqa_data_json):
    stories, queries, ans = [], [], []
    for data_point in coqa_data_json['data']:
        story = data_point['story']
        for question, answer in zip(data_point['questions'], data_point['answers']):
            stories.append(story)
            queries.append(question['input_text'])
            ans.append(answer)
    return stories, queries, ans

In [None]:
# URL for COQA dataset
url = 'http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json'
coqa_data_json = load_coqa_data(url)
stories, queries, answers = extract_coqa_data(coqa_data_json)

In [None]:
coqa_data_json

Unnamed: 0,data
0,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...
7194,"{'source': 'gutenberg', 'id': '34j10vatjfyw0ao..."
7195,"{'source': 'cnn', 'id': '3vj40nv2qinjocrcy7k4z..."
7196,"{'source': 'race', 'id': '3rjsc4xj10uw0to3vq0v..."
7197,"{'source': 'wikipedia', 'id': '3gs6s824sqxty8v..."


In [None]:
# Split dataset for training and validation
train_split = 1000
train_stories, train_queries, train_answers = stories[:train_split], queries[:train_split], answers[:train_split]
valid_stories, valid_queries, valid_answers = stories[train_split:2000], queries[train_split:2000], answers[train_split:2000]

# Initialize tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [None]:
# Function to tokenize and align labels
def tokenize_and_align(tokenizer, stories, queries, answers):
    tokenized_data = tokenizer(stories, queries, padding=True, truncation=True)
    start_positions, end_positions = [], []
    for i, ans in enumerate(answers):
        start_idx = ans['span_start'] if ans['span_start'] >= 0 else 0
        end_idx = ans['span_end'] if ans['span_end'] >= 0 else start_idx + len(ans['input_text'])

        start_token = tokenized_data.char_to_token(i, start_idx)
        end_token = tokenized_data.char_to_token(i, end_idx)

        start_positions.append(start_token if start_token is not None else tokenizer.model_max_length)
        end_positions.append(end_token if end_token is not None else tokenizer.model_max_length)

    tokenized_data.update({'start_positions': start_positions, 'end_positions': end_positions})
    return tokenized_data

In [None]:
train_data = tokenize_and_align(tokenizer, train_stories, train_queries, train_answers)
valid_data = tokenize_and_align(tokenizer, valid_stories, valid_queries, valid_answers)

# Define custom dataset
class CoQADataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

# Create data loaders
train_dataset = CoQADataset(train_data)
valid_dataset = CoQADataset(valid_data)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=8, shuffle=False)

In [None]:
# Prepare model and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased").to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop
def train_model(model, train_loader, optimizer, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}"):
            optimizer.zero_grad()
            input_ids, attention_mask = batch['input_ids'].to(device), batch['attention_mask'].to(device)
            start_true, end_true = batch['start_positions'].to(device), batch['end_positions'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_true, end_positions=end_true)
            loss = outputs.loss
            loss.backward()
            optimizer.step()

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Evaluate model performance
def evaluate_model(model, valid_loader):
    model.eval()
    accuracy = []
    for batch in tqdm(valid_loader, desc="Evaluating"):
        with torch.no_grad():
            input_ids, attention_mask = batch['input_ids'].to(device), batch['attention_mask'].to(device)
            start_true, end_true = batch['start_positions'].to(device), batch['end_positions'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            start_pred = torch.argmax(outputs.start_logits, dim=1)
            end_pred = torch.argmax(outputs.end_logits, dim=1)
            accuracy.extend([(start_pred == start_true).sum().item(), (end_pred == end_true).sum().item()])
    avg_accuracy = sum(accuracy) / (2 * len(accuracy))
    return avg_accuracy


In [None]:
# Train and evaluate
train_model(model, train_loader, optimizer, num_epochs=5)

Training Epoch 1:   0%|          | 0/125 [00:00<?, ?it/s]

Training Epoch 2:   0%|          | 0/125 [00:00<?, ?it/s]

Training Epoch 3:   0%|          | 0/125 [00:00<?, ?it/s]

Training Epoch 4:   0%|          | 0/125 [00:00<?, ?it/s]

Training Epoch 5:   0%|          | 0/125 [00:00<?, ?it/s]

In [None]:
avg_accuracy = evaluate_model(model, valid_loader)
print(f"Validation Accuracy: {avg_accuracy}")

Validation Accuracy: 0.682


In [None]:
# Saving the trained model
model.save_pretrained(".")
tokenizer.save_pretrained(".")

# Saving the trained model to a local file
model.save_pretrained("./model-satya")
tokenizer.save_pretrained("./model-satya")

('./model-satya/tokenizer_config.json',
 './model-satya/special_tokens_map.json',
 './model-satya/vocab.txt',
 './model-satya/added_tokens.json',
 './model-satya/tokenizer.json')

**Exercise 2: Import the BERT model fine-tuned for classification and test its performance on any text classification dataset such as the twitter dataset.**

In [None]:
import pandas as pd
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

# Load the saved model and tokenizer
model_path = './model-satya'
model = BertForSequenceClassification.from_pretrained(model_path)
tokenizer = BertTokenizer.from_pretrained(model_path)

# Load dataset
df = pd.read_csv('/content/IMDB Dataset.csv')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./model-satya and are newly initialized: ['classifier.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
# Convert sentiments to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0}).astype(int)

In [None]:
# Balance the dataset
df_pos = df[df['sentiment'] == 1].sample(1500, random_state=42)
df_neg = df[df['sentiment'] == 0].sample(1500, random_state=42)
df_balanced = pd.concat([df_pos, df_neg])

In [None]:
# Tokenization function
max_length = 256
def tokenize_data(texts):
    return tokenizer.batch_encode_plus(
        texts,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

In [None]:
# Split dataset for testing
_, test_texts, _, test_labels = train_test_split(
    df_balanced['review'], df_balanced['sentiment'], test_size=0.2, random_state=42
)

In [None]:
# Tokenizing and encoding
test_encodings = tokenize_data(test_texts)

# Convert labels to tensors
test_labels = torch.tensor(test_labels.values)

# Create TensorDataset
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

In [None]:
# DataLoader
batch_size = 16
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Move model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

In [None]:
# Evaluation
model.eval()
predictions, true_labels = [], []
for batch in test_loader:
    batch = [t.to(device) for t in batch]
    inputs = {'input_ids': batch[0], 'attention_mask': batch[1]}
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    logits = logits.detach().cpu().numpy()
    label_ids = batch[2].to('cpu').numpy()
    predictions.append(logits)
    true_labels.append(label_ids)

In [None]:
flat_predictions = np.concatenate(predictions, axis=0)
flat_predictions = np.argmax(flat_predictions, axis=1)
flat_true_labels = np.concatenate(true_labels, axis=0)

# Print classification report
print(classification_report(flat_true_labels, flat_predictions))

              precision    recall  f1-score   support

           0       0.49      0.82      0.62       287
           1       0.58      0.23      0.33       313

    accuracy                           0.51       600
   macro avg       0.54      0.53      0.48       600
weighted avg       0.54      0.51      0.47       600



**Exercise 3: Fine-tune the BERT model from Exercise 2 on the text classification dataset you used for testing (in Exercise 2) and evaluate its performance (on a test set from the dataset that you set aside prior to fine tuning the model)**

In [None]:
# Load dataset
df = pd.read_csv('/content/IMDB Dataset.csv')

In [None]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [None]:
# Convert sentiments to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0}).astype(int)

# Balance the dataset
df_pos = df[df['sentiment'] == 1].sample(1500, random_state=42)  # Sample 1500 positive reviews
df_neg = df[df['sentiment'] == 0].sample(1500, random_state=42)  # Sample 1500 negative reviews
df_balanced = pd.concat([df_pos, df_neg])


In [None]:
# Tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_length = 256

def tokenize_data(texts):
    return tokenizer.batch_encode_plus(
        texts,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

In [None]:
# Split dataset
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df_balanced['review'], df_balanced['sentiment'], test_size=0.2, random_state=42
)

In [None]:
# Tokenizing and encoding
train_encodings = tokenize_data(train_texts)
test_encodings = tokenize_data(test_texts)

# Convert labels to tensors
train_labels = torch.tensor(train_labels.values)
test_labels = torch.tensor(test_labels.values)

In [None]:
# Create TensorDataset
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

# DataLoader
batch_size = 16
train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
test_loader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=batch_size)

# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

In [None]:
# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop
epochs = 4
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(train_loader):
        if step % 50 == 0 and step != 0:
            print(f'Epoch {epoch + 1}, Step {step}, Loss: {total_loss / step:.2f}')

        batch = [b.to(device) for b in batch]
        inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}

        model.zero_grad()
        outputs = model(**inputs)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

    print(f'Training Loss after Epoch {epoch + 1}: {total_loss / len(train_loader):.2f}')



Epoch 1, Step 50, Loss: 0.63
Epoch 1, Step 100, Loss: 0.52
Training Loss after Epoch 1: 0.44
Epoch 2, Step 50, Loss: 0.22
Epoch 2, Step 100, Loss: 0.20
Training Loss after Epoch 2: 0.21
Epoch 3, Step 50, Loss: 0.11
Epoch 3, Step 100, Loss: 0.09
Training Loss after Epoch 3: 0.11
Epoch 4, Step 50, Loss: 0.07
Epoch 4, Step 100, Loss: 0.04
Training Loss after Epoch 4: 0.06


In [None]:
# Evaluation
model.eval()
predictions, true_labels_list = [], []
for batch in test_loader:
    batch = [t.to(device) for t in batch]
    inputs = {'input_ids': batch[0], 'attention_mask': batch[1]}
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    logits = logits.detach().cpu().numpy()
    predictions.append(logits)
    true_labels_list.append(batch[2].cpu().numpy())

# Flatten predictions and true labels
flat_predictions = np.concatenate(predictions, axis=0)
flat_predictions = np.argmax(flat_predictions, axis=1)

# Ensure true_labels_list is a list of arrays
flat_true_labels = np.concatenate(true_labels_list, axis=0)

In [None]:
# Print classification report
report = classification_report(flat_true_labels, flat_predictions)
print(f'Classification Report after Epoch {epoch + 1}:\n{report}')

Classification Report after Epoch 4:
              precision    recall  f1-score   support

           0       0.90      0.89      0.89       287
           1       0.90      0.91      0.90       313

    accuracy                           0.90       600
   macro avg       0.90      0.90      0.90       600
weighted avg       0.90      0.90      0.90       600

