# Introduction



“Why is the sky blue?”

This is a question an open-domain question answering (QA) system should be able to respond to. QA systems emulate how people look for information by reading the web to return answers to common questions. Machine learning can be used to improve the accuracy of these answers.

Existing natural language models have been focused on extracting answers from a short paragraph rather than reading an entire page of content for proper context. As a result, the responses can be complicated or lengthy. A good answer will be both succinct and relevant.

In this competition, your goal is to predict short and long answer responses to real questions about Wikipedia articles. The dataset is provided by Google's Natural Questions, but contains its own unique private test set. A visualization of examples shows long and—where available—short answers. In addition to prizes for the top teams, there is a special set of awards for using TensorFlow 2.0 APIs.

If successful, this challenge will help spur the development of more effective and robust QA systems.

## What should I expect the data format to be?
Each sample contains a Wikipedia article, a related question, and the candidate long form answers. The training examples also provide the correct long and short form answer or answers for the sample, if any exist.

## What am I predicting?
For each article + question pair, you must predict / select long and short form answers to the question drawn directly from the article. - A long answer would be a longer section of text that answers the question - several sentences or a paragraph. - A short answer might be a sentence or phrase, or even in some cases a YES/NO. The short answers are always contained within / a subset of one of the plausible long answers. - A given article can (and very often will) allow for both long and short answers, depending on the question.

There is more detail about the data and what you're predicting [on the Github page](https://github.com/google-research-datasets/natural-questions/blob/master/README.md) for the Natural Questions dataset. This page also contains helpful utilities and scripts. Note that we are using the simplified text version of the data - most of the HTML tags have been removed, and only those necessary to break up paragraphs / sections are included.



# Imports

In [None]:
import numpy as np 
import pandas as pd 
from IPython.core.display import HTML
import json
import gc
import json
import subprocess
from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt
import seaborn as sns

PATH = '/kaggle/input/tensorflow2-question-answering/'
!ls {PATH}

# Parmeters 

In [None]:
#Train & test dataset paths
PATH_TRAIN = PATH + 'simplified-nq-train.jsonl'
PATH_TEST = PATH + 'simplified-nq-test.jsonl'
nrows = 1000 #Number of lines to take from test dataset

#Init parmeters

train_ds = []


# Extract data

Common way to convert .jsonl file into pd.DataFrame is pd.read_json(FILENAME, orient='records', lines=True), Nontheless, This is humongous train dataset , Kaggle notebook RAM cannot support it. Instead, I'll load train dataset iteratively:

In [None]:

with open(PATH+'simplified-nq-train.jsonl', 'rt') as f:
    for i in range(nrows):
        train_ds.append(json.loads(f.readline()))

train_df = pd.DataFrame(train_ds)


In [None]:
train_df['yes_no'] = train_df.annotations.apply(lambda x: x[0]['yes_no_answer'])
train_df['long'] = train_df.annotations.apply(lambda x: [x[0]['long_answer']['start_token'], x[0]['long_answer']['end_token']])
train_df['short'] = train_df.annotations.apply(lambda x: x[0]['short_answers'])

In [None]:
train_df.head()

In [None]:
train_df["annotations"][0]

# Data explained

question_text - the question that the user asked (into google engine)
document_url  - link to where answer exists 
annotations - Is it a yes/no answer (short answer-can also be NONE) , if not this is long answer - put start and end token for best answer

# long_answer_candidates

start_token: The token position in the text where the answer begins;

end_token: The token position in the text where the answer ends;

top_level: Whether this answer is contained inside another answer in the text{True/False}


*** There can be multiple candidate-answers to a single question

# Get Number of words per train/test sets

In [None]:
N_TRAIN_bytes = subprocess.check_output('wc -l {}'.format(PATH_TRAIN), shell=True)
N_TEST_bytes = subprocess.check_output('wc -l {}'.format(PATH_TEST), shell=True)

N_TRAIN = int(N_TRAIN_bytes.split()[0])
N_TEST = int(N_TEST_bytes.split()[0])

In [None]:
print("Train examples : " , N_TRAIN)
print("Test examples : " , N_TEST)

# Visulaziation 

In [None]:
question_text_train_len = np.zeros(N_TRAIN)
document_text_train_len = np.zeros(N_TRAIN)

question_text_test_len = np.zeros(N_TEST)
document_text_test_len = np.zeros(N_TEST)

t_yesno_train = []
n_long_candidates_train = np.zeros(N_TRAIN)
t_long_train = np.zeros((N_TRAIN,2))
n_long_candidates_test = np.zeros(N_TEST)

In [None]:
with open(PATH_TRAIN, 'rt') as f:
    for train_idx in tqdm(range(N_TRAIN)):
        dic = json.loads(f.readline())
        question_text_train_len[train_idx] = len(dic['question_text'].split())
        document_text_train_len[train_idx] = len(dic['document_text'].split())
        t_yesno_train.append(dic['annotations'][0]['yes_no_answer'])
        n_long_candidates_train[train_idx] = len(dic['long_answer_candidates'])
        t_long_train[train_idx,0] = dic['annotations'][0]['long_answer']['start_token']
        t_long_train[train_idx,1] = dic['annotations'][0]['long_answer']['end_token']


In [None]:
with open(PATH_TEST, 'rt') as f:
    for test_idx in tqdm(range(N_TEST)):
        dic = json.loads(f.readline())
        question_text_test_len[test_idx] = len(dic['question_text'].split())
        document_text_test_len[test_idx] = len(dic['document_text'].split())
        n_long_candidates_test[test_idx] = len(dic['long_answer_candidates'])
     

In [None]:
plt.hist(question_text_train_len, density=True , alpha=0.9, color='orange', label='train') 
plt.hist(question_text_test_len, density=True, alpha=0.5, color='b', label='test')
plt.legend()#['test','train']
plt.title('Question text length Vs Sample proportion')
plt.xlabel('Question text length')
plt.ylabel('Sample proportion')


In [None]:
plt.hist(document_text_train_len, color='orange',density=True, alpha=0.9, label='Train') 
plt.hist(document_text_test_len,  color='b',density=True, alpha=0.5, label='Test') 
plt.legend()
plt.title('Document text length Vs Sample proportion')
plt.xlabel('Document text length')
plt.ylabel('Sample proportion')


In [None]:
plt.hist(t_yesno_train, bins=[0,1,2,3], align='left', density=True, rwidth=0.5, label='train')
plt.legend()
plt.title('yes-no answer sample proportion')
plt.xlabel('yes-no answer')
plt.ylabel('sample proportion')


In [None]:
plt.hist(n_long_candidates_train, alpha=0.5, color='b', label='train') 
plt.legend()
plt.title('Long answer candidates Vs Samples')
plt.xlabel('Long answer candidates')
plt.ylabel('Samples')


# Check amonut of exists answers  

In [None]:
display(train_df.long.apply(lambda x: "Answer Doesn't exist" if x[0] == -1 else "Answer Exists").value_counts(normalize=True))

# Short answers distribution -as we can see less than 50% are yes/no questions

In [None]:
mask_answer_exists = train_df.long.apply(lambda x: "Answer Doesn't exist" if x == -1 else "Answer Exists") == "Answer Exists"

yes_no_dist = train_df[mask_answer_exists].yes_no.value_counts(normalize=True)
yes_no_dist

In [None]:
short_dist = train_df[mask_answer_exists].short.apply(lambda x: "Short answer exists" if len(x) > 0 else "Short answer doesn't exist").value_counts(normalize=True)
plt.figure(figsize=(8,6))
sns.barplot(x=short_dist.index,y=short_dist.values).set_title("Short answers distribution Vs questions with answers")

In [None]:
short_size_dist = train_df[mask_answer_exists].short.apply(len).value_counts(normalize=True)
short_size_dist_pretty = pd.concat([short_size_dist.loc[[0,1,],], pd.Series(short_size_dist.loc[2:].sum(),index=['>=2'])])
short_size_dist_pretty = short_size_dist_pretty.rename(index={0: 'No Short answer',1:"1 Short answer",">=2":"More than 1 short answers"})
plt.figure(figsize=(12,6))
sns.barplot(x=short_size_dist_pretty.index,y=short_size_dist_pretty.values).set_title("Short Answers Distribution Vs questions with answers")