<img src="https://i.imgur.com/RFR6UZX.jpg" width="100%"/>


# 1. The competition
### [chaii - Hindi and Tamil Question Answering](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering) - A quick overview for QA noobs

Hi and welcome! This is the first kernel of the series `chaii - Hindi and Tamil Question Answering - A quick overview for QA noobs`.

**In this short kernel we will go over the competition specifics, define Question Answering and respond to some common questions regarding the submission requirements, laying the foundation for digging into the data and the public models.**

This series aims to get a good understanding of the specific topic (Question Answering for a non-English language), including going over the dataset, learning common approaches, and understanding the best models proposed by the community from both a technical and theoretical point of view. 

The ideal reader is a Data Scientist noob with some general knowledge about Deep Learning, but no technical expertise in Question Answering. 

---

The full series consists of the following notebooks:
1. _[The competition](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs) (This notebook)_
2. [The dataset](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs) 
3. [The metric (Jaccard)](https://www.kaggle.com/julian3833/3-the-metric-jaccard-qa-for-qa-noobs) 
4. [Exploring Public Models](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs/)
5. [ðŸ¥‡ XLM-Roberta + Torch's extra data [LB: 0.749]](https://www.kaggle.com/julian3833/5-xlm-roberta-torch-s-extra-data-lb-0-749)
6. [ðŸ¤— Pre & post processing](https://www.kaggle.com/julian3833/6-pre-post-processing-qa-for-qa-noobs/)



This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:
* Exploring Public Models Revisited
* Reviewing `squad2`, `mlqa` and others
* About `xlm-roberta-large-squad2`
* Own improvements


---

# The competition: [chaii - Hindi and Tamil Question Answering](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering)

<p style="text-align:center">
<img src="https://i.imgur.com/74vfye4.png" width="70%"/>
</p>



The main features of this competition are outlined below:

## The task
1. **Task:** Question Answering
1. **Metric:** Jaccard coefficient (see the [third notebook](https://www.kaggle.com/julian3833/3-the-metric-jaccard-qa-for-qa-noobs) to understand the metric)
1. **Particularity:** Non-English language. Actually, two Indian languages: `Hindi` and `Tamil`. (see the [second notebook](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs) to understand the dataset)
1. **Prize:** 10K. Weirdly shared equally as 2K for each of the top 5.

&nbsp;
&nbsp;
&nbsp;


## Question Answering


While QA might refer to [slightly different NLP tasks](https://en.wikipedia.org/wiki/Question_answering), we can stick to the current mainstream use, `Extractive Question Answering`, which is the following task:


<h2 style="text-align: center; background-color:#C8FF33;padding:40px;border-radius: 30px;">
    Given a <i>question</i> and a <i>context</i>, <b>extract</b> the <i>answer</i> to the <i>question</i> from the <i>context</i>.
</h2>

&nbsp;
&nbsp;


In terms of inputs and outputs, we can put it this way:

<h2 style="text-align: center;padding:10px;border-radius: 30px;">
    (<i>question</i>, <i>context</i>) $\rightarrow$ <i>answer</i>

</h2>


For example:
<h2 style="text-align: center;padding:10px;">
    (<i>"How old is Kevin?"</i>, <i>"Kevin just turned 28 last week."</i>) $\rightarrow$ <i>"28"</i>

</h2>


Note that since this is **Extractive** Question Answering, the answer **must be contained** in the context. It should be a **substring**, not a rephrasing or a generated piece of information.

<p style="text-align:center">
<img src="https://i.imgur.com/TNqhXx0.png" width="50%" />
</p>





Let's see some code:


### Example of Extractive Question Answering

Consider the following text (a reduced version of the first two sentences of [Wikipedia's article](https://en.wikipedia.org/wiki/World_War_II) about the `Second World War`):

>World War II, often abbreviated as WW2, was a global war that lasted from 1939 to 1945. It involved the vast majority of the world's countries forming two opposing military alliances: the Allies and the Axis powers.

Then, the following could be some valid _inputs_ for an extractive QA system:

In [None]:
import pandas as pd; pd.set_option("display.max_colwidth", 60)
# Context
ctx = """World War II, often abbreviated as WW2, was a global war that lasted from 1939 to 1945. It involved the vast majority of the world's countries forming two opposing military alliances: the Allies and the Axis powers."""

q1 = "How is the World War II often abbreviated?"

q2 = "What was the World War II?"

q3 = "How long did the WW2 last?"

q4 = "Which were the two bands in the WW2?"

valid_inputs = [(q1, ctx), (q2, ctx), (q3, ctx), (q4, ctx)]

pd.DataFrame(valid_inputs, columns=['Question', 'Context'])

Finnaly, adding the expected answers will turn this data into _training samples_ for a potential ML model:



In [None]:
a1 = "WW2" # How is the World War II often abbreviated?

a2 = "a global war" # What was the World War II?

a3 = "from 1939 to 1945" # How long did the WW2 last?

a4 = "the Allies and the Axis powers" # Which were the two bands in the WW2?

valid_training_samples = [(q1, ctx, a1), (q2, ctx, a2), (q3, ctx, a3), (q4, ctx, a4)]

pd.DataFrame(valid_training_samples, columns=['Question', 'Context', 'Answer'])

Note that since this is **Extractive** Question Answering, the answer **must be** contained in the context. So, for example, Answer 3 cannot be `6 years`, although it would be a more accurate answer to question 3.

A last note is the following: real models typically get the answer by signalling the start token and the end token of the answer (or the initial token and the amount of them):

<h2 style="text-align: center">(<i>question</i>, <i>context</i>) $\rightarrow$ <i>(start position, end position)</i></h2>

This is, simply put, the problem at hand. Let's check the code requirements of the competition now.

# Code requirements 

You can check them in the [Code Requirements](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview/code-requirements)' tab of the competition, but I summarize them below:

* Kernel Only
* Internet disabled
* 5-hour runtime 
* Freely & publicly available external data is allowed, including pre-trained models

What does all this mean?

# Kernel Only
`Kernel Only` means that the only way to submit results is from Kaggle notebooks. We cannot generate a submission on another computer and upload it as a CSV. Instead, we should create a Kernel that writes a `submission.csv` file. In turn, the `sample_submission.csv` and `test.csv` files we have access to are a placeholder with just a few examples, and they don't matter as we will see in the [next notebook](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs).

The actual test set is always hidden, and our notebooks are fed with it when we decide to submit a version of them to the competition. 

# Internet disabled: How to use pretrained-models?

In order to be allowed to submit it, you should make sure the Internet is off:

<p style="text-align:center" >
<img src="https://i.imgur.com/6TeYNK6.png" width="40%"  />
</p>

(Also, since you will probably be working with transformers, you might want to turn on the GPU!)

This requirement is a measure of security, but it's quite messy when you start, because the transformers' [modelhub](https://huggingface.co/models) relies on the Internet and using transformers is the go-to nowadays.

The common workaround here is as follows: people upload the pre-trained models as Kaggle Datasets, and you can add them with the "+ Add data" button at the top-right corner of the Kernel edition window:


<p style="text-align:center" >
<img src="https://i.imgur.com/l2tKMLe.png" width="40%"  />
</p>

See, for example, these Datasets:
* [Huggingface BERT Variants](https://www.kaggle.com/sauravmaheshkar/huggingface-bert-variants)
* [Huggingface Roberta Variants](https://www.kaggle.com/sauravmaheshkar/huggingface-roberta-variants)

Then when using your `transformers` loader, instead of specifying the model name as, let's say, `xlm-roberta-large`, we have to specify the full path to the added inputs folder, which looks like this: `../inputs/huggingface-roberta-variants/xlm-roberta-large`.


&nbsp;
&nbsp;
&nbsp;

For example, the models we will review in the fourth notebook [
4 - Exploring Public Models [QA for QA noobs]](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs), use the following dataset, added using "+Add data" button we just mentioned:



<p style="text-align:center" >
<img src="https://i.imgur.com/T9zUJcL.png" width="60%"  />
</p>

<!--

The dataset, in turn, is just a copy of [xlm-roberta-large-squad2](https://huggingface.co/deepset/xlm-roberta-large-squad2) from the transformers' modelhub, but now instead of using this this token when creating the models, we will need to use a path.

Don't worry if all this doesn't make sense for you now. If this doesn't make sense for you, don't worry. You can continue without this until the notebook 4 and come back later. This is quite advance for now.

-->

The `5-hour runtime` will probably matter later on, when ensembles start to appear. For now, we can ignore that.

That's it for now. We reviewed the main features of the competition.

## What's next?

We just gained a basic idea of what is Question Answering and what is this competition about. Let's go and take a look at the Dataset in the [next notebook](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs/)!

&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;

## Remember to upvote the notebook if you found it useful! ðŸ¤—


<!--

https://www.kaggle.com/c/quora-question-pairs/overview/description
https://www.kaggle.com/c/tensorflow2-question-answering/overview
https://www.kaggle.com/c/google-quest-challenge/overview

-->