# QA BERT - With Hugging Face

Ref : https://towardsdatascience.com/question-answering-with-a-fine-tuned-bert-bc4dafd45626


## Installs and Imports

In [None]:
## installs

!pip install transformers



In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

## Data load from Stanford Website


In [None]:
coqa = pd.read_json('http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json')
coqa.head()

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."


## Data Pre processing

In [None]:
# Drop column

# We will be dealing with the “data” column, so let’s just delete the “version” column.
del coqa["version"]

In [None]:
# loop through the rows
# for every "story" extract question answer pairs and create a DF

#required columns in our dataframe
cols = ["text","question","answer"]
#list of lists to create our dataframe
comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols)
#saving the dataframe to csv file for further loading


new_df.to_csv("drive/MyDrive/LLM_data/CoQA_data.csv", index=False)

# Data Loading from Local CSV File


In [None]:
data = pd.read_csv("drive/MyDrive/LLM_data/CoQA_data.csv")
data.head()

Unnamed: 0,text,question,answer
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project


In [None]:
print("Number of question and answers: ", len(data))

Number of question and answers:  108647


## Building the chatbot with a Pre Trained Model

- For question answering tasks, we can even use the already trained model and get decent results even when our text is from a completely different domain.


- To get decent results, we are using a BERT model which is fine-tuned on the SQuAD benchmark.

- For our task, we will use the BertForQuestionAnswering class from the transformers library.

In [None]:
# choose pre trained model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# choose tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Ask a random question

- We select a random question answer pair
- Tokenize the text as a pair
- print and check the tokenized output




In [None]:
## Randomly select a question answer pair
random_num = np.random.randint(0,len(data))

question = data["question"][random_num]
text = data["text"][random_num]

In [None]:
print(question)

Did people thing slavery was the law of God?


In [None]:
print(text)

Mark twain tells a boy's story in The Adventure of Huckleberry Finn. Huck is a poor child, without a mother or home. His father drinks too much alcohol and always beats him. 

Huck's situation has freed him from the restriction of society. He explores in the woods and goes fishing. He stays out all night and does not go to school. He smokes. 

Huck runs away from home. He meets Jim, a black man who has escaped from slavery . They travel together on a raft made of wood down the Mississippi River. 

Mark twain started writing "Huckleberry Finn" as a children's story. But it soon became serious. The story tells about the social evil of slavery, seen through the eyes of an innocent child. Huck's ideas about people were formed by the white society in which he lived. So, at first, he does not question slavery. Huck knows that important people believe slavery is natural, the law of God. So, he thinks it is his duty to tell Jim's owners where to find him. 

Later, Huck comes to understand that

In [None]:
# tokenize
input_ids = tokenizer.encode(question, text)
# Check token length
print("The input has a total of {} tokens.".format(len(input_ids)))


The input has a total of 286 tokens.


In [None]:
# display tokens and ids
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
did        2,106
people     2,111
thing      2,518
slavery    8,864
was        2,001
the        1,996
law        2,375
of         1,997
god        2,643
?          1,029
[SEP]        102
mark       2,928
twain     24,421
tells      4,136
a          1,037
boy        2,879
'          1,005
s          1,055
story      2,466
in         1,999
the        1,996
adventure   6,172
of         1,997
hu        15,876
##ckle    19,250
##berry    9,766
finn       9,303
.          1,012
hu        15,876
##ck       3,600
is         2,003
a          1,037
poor       3,532
child      2,775
,          1,010
without    2,302
a          1,037
mother     2,388
or         2,030
home       2,188
.          1,012
his        2,010
father     2,269
drinks     8,974
too        2,205
much       2,172
alcohol    6,544
and        1,998
always     2,467
beats     10,299
him        2,032
.          1,012
hu        15,876
##ck       3,600
'          1,005
s          1,055
situation   3,663
has        2

## Create Segment Embeddings

**Steps**

Find pos of 'SEP'
All the positiosns for 1st set mark as 0 ; rest mark as 1
# check length of segment id vector should be same as input id vector


In [None]:
#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index: ", sep_idx)
#number of tokens in segment A (question) - this will be one more than the sep_idx as the index in Python starts from 0
num_seg_a = sep_idx+1
print("Number of tokens in segment A: ", num_seg_a)
#number of tokens in segment B (text)
num_seg_b = len(input_ids) - num_seg_a
print("Number of tokens in segment B: ", num_seg_b)
#creating the segment ids
segment_ids = [0]*num_seg_a + [1]*num_seg_b
#making sure that every input token has a segment id
assert len(segment_ids) == len(input_ids)

SEP token index:  11
Number of tokens in segment A:  12
Number of tokens in segment B:  274


### Create input for model and call model

In [None]:
#token input_ids to represent the input and token segment_ids to differentiate our segments - question and text
output = model(torch.tensor([input_ids]),  token_type_ids=torch.tensor([segment_ids]))

# check the context


In [None]:
print("\ntext:\n{}".format(text))


text:
Mark twain tells a boy's story in The Adventure of Huckleberry Finn. Huck is a poor child, without a mother or home. His father drinks too much alcohol and always beats him. 

Huck's situation has freed him from the restriction of society. He explores in the woods and goes fishing. He stays out all night and does not go to school. He smokes. 

Huck runs away from home. He meets Jim, a black man who has escaped from slavery . They travel together on a raft made of wood down the Mississippi River. 

Mark twain started writing "Huckleberry Finn" as a children's story. But it soon became serious. The story tells about the social evil of slavery, seen through the eyes of an innocent child. Huck's ideas about people were formed by the white society in which he lived. So, at first, he does not question slavery. Huck knows that important people believe slavery is natural, the law of God. So, he thinks it is his duty to tell Jim's owners where to find him. 

Later, Huck comes to understa

## Check model for highest score index pos for start and end

- Check if end > start
- Join tokens to create answer
- print question and answer

In [None]:
#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
    print("\nQuestion:\n{}".format(question.capitalize()))
    print("\nAnswer:\n{}.".format(answer.capitalize()))
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")




Question:
Did people thing slavery was the law of god?

Answer:
People believe slavery is natural , the law of god.


## Convert to functional form

In [None]:
def question_answer(question, text):

    #tokenize question and text as a pair
    input_ids = tokenizer.encode(question, text)

    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    #number of tokens in segment A (question)
    num_seg_a = sep_idx+1
    #number of tokens in segment B (text)
    num_seg_b = len(input_ids) - num_seg_a

    #list of 0s and 1s for segment embeddings
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)

    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))

    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]

    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."

    print("\nPredicted answer:\n{}".format(answer.capitalize()))

## Test the function

In [None]:
text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""
question = "Where was the Auction held?"
question_answer(question, text)
#original answer from the dataset
print("Original answer:\n", data.loc[data["question"] == question]["answer"].values[0])


Predicted answer:
Hard rock cafe in new york ' s times square
Original answer:
 Hard Rock Cafe


## Loop to play around model

# Britannia Wiki Data read


In [None]:
# Read wiki dump in .txt format
path = "drive/MyDrive/LLM_data/britannia_wiki.txt"
britannia_file = open(path, 'r')
text = britannia_file.read()

# Write your question regarding Britannia

In [None]:
question = "Which city did Britannia start from ?"
response = question_answer(question, text)


Predicted answer:
Kolkata


In [None]:
question = "Who founded Britannia Industries  ?"
response = question_answer(question, text)



Predicted answer:
A group of british businessmen


In [None]:
question = "in Which city was Britannia industries started ?"
response = question_answer(question, text)


Predicted answer:
Kolkata


### Code logic to loop for multiple questions on same text block

In [None]:
text = input("Please enter your text: \n")
question = input("\nPlease enter your question: \n")
while True:
    question_answer(question, text)

    flag = True
    flag_N = False

    while flag:
        response = input("\nDo you want to ask another question based on this text (Y/N)? ")
        if response[0] == "Y":
            question = input("\nPlease enter your question: \n")
            flag = False
        elif response[0] == "N":
            print("\nBye!")
            flag = False
            flag_N = True

    if flag_N == True:
        break

### End of Note Book