# Let’s Begin

#### This notebook is aimed to give an end-to-end review of the problem statement and simple solution.
#### This problem is related to the standard extractive question answering problem( i.e extracting an answer from a text given a question ).




## If it's just a standard extractive question answering, why there is a competition?

#### While computers are getting stronger at understanding natural languages, computers are still not capable of completely understanding all the languages there.

#### To build a strong natural language understanding model, along with a good knowledge of related filed one needs to generate a lot of data specific to that language ( or need to extract data from the huge amount of unstructured data available out there on the internet ).

#### This is a research competition, which means a lot of research will be required not only in designing the powerful algorithms but also in gathering the relevant data ( language-specific )


# Problem statemnt 

#### Given a text ( eg. history of india ) and a Question ( How is the father of india ? ) find the answer from the text. 

#### well finding answer does not mean dedusding the answer, but it means that the answer lies in the given text we just have to find the start and end point of the answer from the text.

#### EG. ( 100s of words ) ........ The Father of India is Mahatma Gandhi and he ........ ( 100s of words ) . 

#### As we are extracting the answer from the context ( given text ) it's called Extractive question answering.

#### Here every thing is same just the difference is, Language is not english but it is eigther hindi or tamil.

#### To solve such problems machine needs to develop good understanding of the language. 


# EDA


## Lets have a look at the given data

In [None]:
import pandas as pd

DF = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/train.csv')

DF.sample(frac=1).head()

## Context, question, and answer_text are understandable but what is answer_start? 

#### answer_start defines the index of the character from where the answer begins ( this helps in the training  of the model ) 

#### language identifies the type of language being used 


##  Let’s see the dataset division for Hindi and Tamil 

In [None]:
import matplotlib.pyplot as plt

In [None]:
hindi_count = 0
tamil_count = 0

for l in DF['language']:
    if l == 'hindi': hindi_count+=1
    else : tamil_count+=1

languages  = ['Hindi', 'Tamil']
    
data = [hindi_count,tamil_count]

fig = plt.figure(figsize =(12, 9))
plt.pie(data, labels = languages)

plt.show()

#### It looks like we have a lot of data points for Hindi, but relatively fewer data points for Tamil 
#### ( we might need to add some data points for Tamil )


## Let’s have a closer look at one of a data point/example

In [None]:
DF.iloc[700]['context']

In [None]:
DF.iloc[700]['question']

In [None]:
DF.iloc[700]['answer_text']

#### This must-have cleared some confusion 

## Let’s find out more about the given data

#### In the above example, we saw that the context is quite large ( and  it's not even the largest )




## Does context size matter?

#### The answer is YES,
#### For an extractive question answering model, context and question is the input to the model and the model has to predict the output based on this, But high context size means high input size which means high processing power is required to process a huge input, that impacts the training/inference time

## Let’s have a look at the context size

In [None]:

def plot_bar_graph(Values,Labels,color):
    
    fig = plt.figure(figsize = (50, 20))

    X = Values[0]
    Y = Values[1]

    plt.bar(X,Y,color = color,width = 0.4)

    plt.xlabel(Labels[0])
    plt.ylabel(Labels[1])
    plt.title(Labels[2])
    plt.show()

In [None]:
#ploting the context len

len_context = [len(c) for c in DF['context']]
Ids = [id_ for id_ in DF['id']]

X_label = "Ids"
Y_label = "Length"
Label = "Context Length "

Values = [Ids,len_context]
Labels = [X_label,Y_label,Label]

plot_bar_graph(Values,Labels,'maroon')


#### We can see that the context length can reach up to 50,000 ( and that's a large value for a single data point )
    

## How to solve this problem?

#### There are several ways we can tackle this problem (a few are listed below)

* #### Remove the data points with very large context length ( Data loss )
* #### Use the Sliding windows approach ( divide the context in relatively small parts and use them instead)
* #### Different Sliding windows
    * #### Instead of dividing the context into fixed parts we just move over a fixed number of sentences.
* #### ...
 


#### Well looking at this one might ask



## Does answer size matter?

#### well the answer is Yes and No

#### Answer is the output of the model but we do not need to predict the entire answer instead we just predict the start and endpoint of the answer in the given text

#### i.e the size of the output vector does not depend on the answer length, rather it's fixed.



## Let’s have a closer look to find out more

In [None]:
#ploting the answer len

len_context = [len(c) for c in DF['answer_text']]
Ids = [id_ for id_ in DF['id']]

X_label = "Ids"
Y_label = "Length"
Label = "Answer Length "

Values = [Ids,len_context]
Labels = [X_label,Y_label,Label]

plot_bar_graph(Values,Labels,'blue')

#### The above graph clearly shows that there are some very long answers, 
#### at the beginning of the model development, we might want to remove those few data points with large answer text.


### There is still a lot to explore not only with the given data but also with the data out there on the internet, Research continues. 

# A simple way to generate output ( This will give some intuition about the problem )

## Using Pre-trained model provided by hugging faces




#### In this part we will use the pre-trained model provided by the Huggeging faces, this will give us a good idea of how things flow in the extractive question answering model.

#### (this knowledge will help us when we are developing the model from scratch )


In [None]:
# import 
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

In [None]:
# define tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepset/xlm-roberta-large-squad2")

# define model
model = AutoModelForQuestionAnswering.from_pretrained("deepset/xlm-roberta-large-squad2")

* #### Tokenizer converts the input text into tensor (vector storing numbers )
* #### this tensor goes into the model, 
* #### the model generates the prediction tensor 
* #### Again the output gets converted into answer text using tokenizer

#### All this process is raped into a pipeline ( in some sense )


In [None]:
# define pipeline 

QA_pipeline = pipeline('question-answering', model = model, tokenizer = tokenizer, device = 0)

## let’s have a look over test.csv and subbmition.csv

In [None]:
td  = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/test.csv')
td

In [None]:
sd = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/sample_submission.csv')
sd

#### We can see that there are some context and questions in the test.csv for which we have to generate the answer text and put it in the sumbbmition.csv


In [None]:
submission_data = pd.DataFrame(columns = sd.columns)

In [None]:
for id_, c, q in td[["id","context", "question"]].to_numpy():
    
    Output = QA_pipeline(context=c, question=q)
    submission_data.loc[len(submission_data.index)] = [id_,Output['answer']] 

In [None]:
submission_data

### Save the submission_data

In [None]:
submission_data.to_csv('submission.csv', index=False)



### Happy to hear your thoughts / Suggestions on this notebook :)

### Thank you for readings 