In [2]:
!pip install transformers



In [3]:
!pip install pandas numpy torch



In [4]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
coqa = pd.read_json('http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json')
coqa.head()

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."


Data Cleaning

In [6]:
del coqa["version"]   #yalnızca data column i ile çalışacağız version column unu siliyoruz

In [7]:
cols=["text","question","answer"]

comp_list=[]

for index, row in coqa.iterrows():         #pandas ın bir fonksiyonu dataframe in her satırını tek tek döner
    for i in range(len(row["data"]["questions"])):
        temp_list=[]
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)

new_df=pd.DataFrame(comp_list,columns=cols)
new_df.to_csv("CoQA_data.csv", index=False)

In [8]:
data=pd.read_csv("CoQA_data.csv")
data.head()

Unnamed: 0,text,question,answer
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project


In [9]:
print("Number of questions and answers", len(data))

Number of questions and answers 108647


MODEL

In [10]:
model=BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer=BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Asking Question

In [11]:
random_num=np.random.randint(0,len(data))

question=data["question"][random_num]
text=data["text"][random_num]

In [12]:
input_ids=tokenizer.encode(question,text)
print("The input has a total of {} tokens".format(len(input_ids)))

The input has a total of 395 tokens


In [13]:
tokens=tokenizer.convert_ids_to_tokens(input_ids)

for token, id in zip(tokens,input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
many       2,116
guests     6,368
are        2,024
from       2,013
what       2,054
?          1,029
[SEP]        102
soccer     4,715
star       2,732
david      2,585
beck      10,272
##ham      3,511
will       2,097
be         2,022
there      2,045
with       2,007
his        2,010
pop        3,769
star       2,732
wife       2,564
victoria   3,848
.          1,012
elton     19,127
john       2,198
is         2,003
attending   7,052
with       2,007
partner    4,256
david      2,585
fur        6,519
##nish    24,014
.          1,012
the        1,996
guest      4,113
list       2,862
for        2,005
the        1,996
april      2,258
29         2,756
union      2,586
of         1,997
prince     3,159
william    2,520
and        1,998
kate       5,736
middleton  17,756
is         2,003
still      2,145
being      2,108
kept       2,921
secret     3,595
,          1,010
but        2,021
details    4,751
have       2,031
begun      5,625
to         2,000
leak      17

aynı tokenizer ve aynı model kullanıldığı sürece, belirli bir kelimenin veya token'ın ID'si hep aynı olur. Token ID'leri, modelin eğitiminde kullanılan sözlüğe (vocabulary) bağlıdır ve bu sözlük, modelle birlikte sabit olarak gelir.

Token'lar kelimelere, kelime parçalarına veya işaretlere karşılık gelir.

 we can see two special tokens [CLS] and [SEP]. [CLS] token stands for classification and is there to represent sentence-level classification and is used when we are classifying. Another token used by BERT is [SEP]. It is used to separate the two pieces of text. You can see two [SEP] tokens in the above screenshots, one after the question and another after the text.

 [CLS]:
Tüm giriş dizisinin tek bir vektörde temsil edilmesini sağlar.
Sınıflandırma, regresyon veya diğer metin düzeyindeki görevler için bu vektör kullanılabilir.
[SEP]:
Modelin girişteki metin bölümlerini anlamasına yardımcı olur.
Çift metin görevlerinde, modelin hangi kısmın hangi metin olduğunu bilmesini sağlar.
Ayrıca, cümle çiftleri arasında ilişki olup olmadığını anlamak için kullanılır (örneğin, Next Sentence Prediction görevlerinde).

BERT, her token için aşağıdaki üç embedding türünü toplar:

Token Embedding: Kelimenin kendisini temsil eder.
Segment Embedding: Hangi metne ait olduğunu belirtir (soru mu, cevap mı?).
Position Embedding: Token'ın sırasını belirtir.
Bu üç embedding birleşerek, her token'ın bağlamda tam olarak ne anlama geldiğini modelin anlamasını sağlar.

In [15]:
#first occurance of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index:", sep_idx)

#number of tokens in segment A(question)
num_seg_a=sep_idx+1
print("Number of tokens in segmnet A:", num_seg_a)

#number of tokens in segment b (answer)
num_seg_b= len(input_ids)-num_seg_a
print("Number of tokens in segmnet B:",num_seg_b)

#creating the segment ids
segment_ids=[0]*num_seg_a+[1]*num_seg_b

#making sure that every input token has a segment id
assert len(segment_ids)==len(input_ids)

SEP token index: 7
Number of tokens in segmnet A: 8
Number of tokens in segmnet B: 387


In [16]:
output = model (torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))

PyTorch tensörleri, PyTorch framework'ünde veri işlemleri ve hesaplamalar için kullanılan temel veri yapılarıdır. Tensorler, Numpy dizilerine benzer şekilde çalışır, ancak GPU hızlandırmasıyla derin öğrenme modelleri için optimize edilmiştir. 

Girdiyi bir 2-boyutlu tensöre dönüştürmek için dış liste ([ ]) ekler.
Modelin giriş boyutlarına uygun veri hazırlamak için kullanılır.

In [18]:
#tokens with highest start and end scores
answer_start= torch.argmax(output.start_logits)
answer_end=torch.argmax(output.end_logits)

if answer_end >= answer_start:
    answer= " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

print("\n Question:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}".format(answer.capitalize()))


 Question:
Many guests are from what?

Answer:
Charities they work with


start_logits, modelin cevap olabilecek her token için tahmin ettiği başlangıç konumlarına dair puanları içerir.
end_logits, modelin cevap olabilecek her token için tahmin ettiği bitiş konumlarına dair puanları içerir.

torch.argmax fonksiyonu, PyTorch'ta bir tensör üzerindeki maksimum değerin bulunduğu indeksi döndürür.

Consider the words, run, running, runner. Without wordpiece tokenization, the model has to store and learn the meaning of all three words independently. However, with wordpiece tokenization, each of the three words would be split into ‘run’ and the related ‘##SUFFIX’ (if any suffix at all — for example, “run”, “##ning”, “##ner”). Now, the model will learn the context of the word “run” and the rest of the meaning would be encoded in the suffix, which would be learned from other words with similar suffixes.

In [20]:
answer=tokens[answer_start]
for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2]=='##':
        answer+=tokens[i][2:]
    else:
        answer+=" "+tokens[i]

turn this question-answering process into a function

In [21]:
def question_answer(question, text):

    #tokenize question and text as a pair
    input_ids=tokenizer.encode(question,text)

    #string version of tokenized ids
    tokens=tokenizer.convert_ids_to_tokens(input_ids)

    #segment ids
    sep_idx=input_ids.index(tokenizer.sep_token_id)

    num_seg_a=sep_idx+1
    num_seg_b=len(input_ids)-num_seg_a

    segmen_ids=[0]*num_seg_a+[1]*num_seg_b

    assert len(segmen_ids)==len(input_ids)

    output=model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segmen_ids]))

    answer_start=torch.argmax(output.start_logits)
    answer_end=torch.argmax(output.end_logits)

    if answer_end >= answer_start:
        answer=tokens[answer_start]
        for i in range(answer_start+1,answer_end+1):
            if tokens[i][0:2]=='##':
                answer+=tokens[i][2:]
            else:
                answer+=" "+tokens[i]
    if answer.startswith("[CLS]"):
        answer="Unable to find the answer to your question."
    print("\nPredicted answer:\n{}".format(answer.capitalize()))

In [23]:
text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""
question = "Where was the Auction held?"
question_answer(question, text)
#original answer from the dataset
print("Original answer:\n", data.loc[data["question"] == question]["answer"].values[0])


Predicted answer:
Hard rock cafe in new york ' s times square
Original answer:
 Hard Rock Cafe


In [27]:
text = input("Please enter your text: \n")
question = input("\nPlease enter your question: \n")
while True:
    question_answer(question, text)
    
    flag = True
    flag_N = False
    
    while flag:
        response = input("\nDo you want to ask another question based on this text (Y/N)? ")
        if response[0] == "Y":
            question = input("\nPlease enter your question: \n")
            flag = False
        elif response[0] == "N":
            print("\nBye!")
            flag = False
            flag_N = True
            
    if flag_N == True:
        break

Please enter your text: 
 The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula.   The Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail.   In March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online.   The Vatican Secret Archives were separated from the library at

TypeError: 'tuple' object is not callable