In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.core.display import HTML
import json

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

PATH = '/kaggle/input/tensorflow2-question-answering/'
!ls {PATH}

# Introduction

“Why is the sky blue?”

This is a question an open-domain question answering (QA) system should be able to respond to. QA systems emulate how people look for information by reading the web to return answers to common questions. Machine learning can be used to improve the accuracy of these answers.

Existing natural language models have been focused on extracting answers from a short paragraph rather than reading an entire page of content for proper context. As a result, the responses can be complicated or lengthy. A good answer will be both succinct and relevant.

In this competition, your goal is to predict short and long answer responses to real questions about Wikipedia articles. The dataset is provided by Google's Natural Questions, but contains its own unique private test set. A visualization of examples shows long and—where available—short answers. In addition to prizes for the top teams, there is a special set of awards for using TensorFlow 2.0 APIs.

If successful, this challenge will help spur the development of more effective and robust QA systems.

# Data description
In this competition, we are tasked with selecting the best short and long answers from Wikipedia articles to the given questions.

## What should I expect the data format to be?
Each sample contains a Wikipedia article, a related question, and the candidate long form answers. The training examples also provide the correct long and short form answer or answers for the sample, if any exist.

## What am I predicting?
For each article + question pair, you must predict / select long and short form answers to the question drawn directly from the article. - A long answer would be a longer section of text that answers the question - several sentences or a paragraph. - A short answer might be a sentence or phrase, or even in some cases a YES/NO. The short answers are always contained within / a subset of one of the plausible long answers. - A given article can (and very often will) allow for both long and short answers, depending on the question.

There is more detail about the data and what you're predicting [on the Github page](https://github.com/google-research-datasets/natural-questions/blob/master/README.md) for the Natural Questions dataset. This page also contains helpful utilities and scripts. Note that we are using the simplified text version of the data - most of the HTML tags have been removed, and only those necessary to break up paragraphs / sections are included.

## File descriptions
* simplified-nq-train.jsonl - the training data, in newline-delimited JSON format.
* simplified-nq-kaggle-test.jsonl - the test data, in newline-delimited JSON format.
* sample_submission.csv - a sample submission file in the correct format

# Input

In [None]:
train_head = []
nrows = 5

with open(PATH+'simplified-nq-train.jsonl', 'rt') as f:
    for i in range(nrows):
        train_head.append(json.loads(f.readline()))

train_df = pd.DataFrame(train_head)

In [None]:
train_df

In [None]:
index = 0

# Document text
The text of the article in question (with some HTML tags to provide document structure). The text can be tokenized by splitting on whitespace.

In [None]:
ans = train_df.loc[index,'document_text'].split()
' '.join(ans[:30])

# Question text
The question to be answered

In [None]:
train_df.loc[index,'question_text']

Following function has been taken from this [kernel](https://www.kaggle.com/yutanakamura/nlp-express-0-data-loading-visualization)

In [None]:
#  this function taken from this kernel
def show_example(example_id):
#     example_id = 5655493461695504401

    example = train_df[train_df['example_id']==example_id]
    document_text = example['document_text'].values[0]
    question = example['question_text'].values[0]

    annotations = example['annotations'].values[0]
    la_start_token = annotations[0]['long_answer']['start_token']
    la_end_token = annotations[0]['long_answer']['end_token']
    long_answer = " ".join(document_text.split(" ")[la_start_token:la_end_token])
    short_answers = annotations[0]['short_answers']
    sa_list = []
    for sa in short_answers:
        sa_start_token = sa['start_token']
        sa_end_token = sa['end_token']
        short_answer = " ".join(document_text.split(" ")[sa_start_token:sa_end_token])
        sa_list.append(short_answer)
    
    document_text = document_text.replace(long_answer,'<LALALALA>')
    sa=False
    la=''
    for sa in short_answers:
        sa_start_token = sa['start_token']
        sa_end_token = sa['end_token']
        for i,laword in enumerate(long_answer.split(" ")):
            ind = i+la_start_token
            if ind==sa_start_token:
                la = la+' SASASASA'+laword
            elif ind==sa_end_token-1:
                la = la+' '+laword+'SESESESE'
            else:
                la = la+' '+laword
    #print(la)
    html = '<div style="font-weight: bold;font-size: 20px;color:#00239CFF">Example Id</div><br/>'
    html = html + '<div>' + str(example_id) + '</div><hr>'
    html = html + '<div style="font-weight: bold;font-size: 20px;color:#00239CFF">Question</div><br/>'
    html = html + '<div>' + question + ' ?</div><hr>'
    html = html + '<div style="font-weight: bold;font-size: 20px;color:#00239CFF">Document Text</div><br/>'
    
    if la_start_token==-1:
        html = html + '<div>There are no answers found in the document</div><hr>'
    else:
        la = la.replace('SASASASA','<span style="background-color:#C7D3D4FF; padding:5px"><font color="#000">')
        la = la.replace('SESESESE','</font></span>')
        document_text = document_text.replace('<LALALALA>','<div style="background-color:#603F83FF; padding:5px"><font color="#fff">'+la+'</font></div>')

        #for simplicity, trim words from end of the document
        html = html + '<div>' + " ".join(document_text.split(" ")[:la_end_token+200]) + ' </div>'
    display(HTML(html))

In [None]:
show_example(train_df.loc[index, "example_id"])

# Long answer candidates
The first task in Natural Questions is to identify the smallest HTML bounding box that contains all of the information required to infer the answer to a question. These long answers can be paragraphs, lists, list items, tables, or table rows. While the candidates can be inferred directly from the HTML or token sequence, they also include a list of long answer candidates for convenience. Each candidate is defined in terms of document tokens.

In [None]:
train_df.loc[index,'long_answer_candidates']

* **Token** : Either a word or a HTML tag that defines a heading, paragraph, table, or list. So if we tokenize the document_text by ' ' we get that lists of tokens.
* **Top_level** : boolean flag top_level that identifies whether a candidate is nested below another (top_level = False) or not (top_level = True). Please be aware that this flag is only included for convenience and it is not related to the task definition in any way.

In [None]:
train_df.loc[index,'long_answer_candidates'][:6]

In this example, you can see that from the second to last long answer candidate is contained within the first. They do not disallow nested long answer candidates, they just asked annotators to find the smallest candidate containing all of the information required to infer the answer to the question. However, they do observe that 95% of all long answers (including all paragraph answers) are not nested below any other candidates.

# Annotations
The NQ **training data has a single annotation** with each example and the **evaluation data has five**. Each annotation defines 
* a "long_answer" span 
* a list of short_answers and 
* a yes_no_answer.  

In [None]:
# as this is train dataset, only one annotation
annotation = train_df.loc[index, "annotations"][0]
annotation

## Long answer
If the annotator has marked a long answer, then the long answer dictionary identifies this long answer using token offsets and an index into the list of long answer candidates. If the annotator has marked that no long answer is available, all of the fields in the long answer dictionary are set to -1.

For this example the 'candidate_index': 54 means the long answer is the 54th index in long_answer_candidates column

In [None]:
long_candidate = annotation['long_answer']
long_candidate

In [None]:

' '.join(ans[long_candidate['start_token']: long_candidate['end_token']])

## Short answer
Each of the short answers is also identified using token indices. There is no limit to the number of short answers. There is also often no short answer, since some questions such as "describe google's founding" do not have a succinct extractive answer. When this is the case, the long answer is given but the "short_answers" list is empty.

In [None]:
for short_candidate in annotation['short_answers']:
    print(' '.join(ans[short_candidate['start_token']: short_candidate['end_token']]))

## Yes no
Finally, if no short answer is given, it is possible that there is a yes_no_answer for questions such as "did larry co-found google". The values for this field YES, or NO if a yes/no answer is given. The default value is NONE when no yes/no answer is given.

In [None]:
annotation['yes_no_answer']

# Document url
The URL for the full article. Provided for informational purposes only. This is NOT the simplified version of the article so indices from this cannot be used directly. The content may also no longer match the html used to generate document_text. Only provided for train.

In [None]:
train_df.loc[index,'document_url']

# Data Statistics
From [here](https://github.com/google-research-datasets/natural-questions/blob/master/README.md#data-statistics) .The NQ training data contains 307,373 examples. 152,148 have a long answer and 110,724 have a short answer. Short answers can be sets of spans in the document (106,926), or yes or no (3,798). Long answers are HTML bounding boxes, and the distribution of NQ long answer types is as follows:

| HTML tags   |    Percent of long answers   |
|------|------|
| `<P>`	| 72.9% |
| `<Table>`	| 19.0%   |
| `<Tr>`	| 1.5%   |
| `<Ul>, <Ol>, <Dl>`	| 3.2%   |
| `<Li>, <Dd>, <Dt>`	| 3.4%   |

# Evaluation 
Submissions are evaluated using[ micro F1](https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1) between the predicted and expected answers. Predicted long and short answers must match exactly the token indices of one of the ground truth labels ((or match YES/NO if the question has a yes/no short answer). There may be up to five labels for long answers, and more for short. If no answer applies, leave the prediction blank/null.

The metric in this competition diverges from the original metric in two key respects: 1) short and long answer formats do not receive separate scores, but are instead combined into a micro F1 score across both formats, and 2) this competition's metric does not use confidence scores to find an optimal threshold for predictions.

# Submission File
For each ID in the test set, you must predict a) a set of start:end token indices, b) a YES/NO answer if applicable (short answers ONLY), or c) a BLANK answer if no prediction can be made. The file should contain a header and have the following format:
```
-7853356005143141653_long,6:18
-7853356005143141653_short,YES
-545833482873225036_long,105:200
-545833482873225036_short,
-6998273848279890840_long,
-6998273848279890840_short,NO
```