We wants to solve open-domain QA task.

My process is as follows:
#### 0. [Orientation](https://www.kaggle.com/adldotori/notebook-to-read-before-start-nlp-step-0/)
   * ver 1 : init (2021/09/21)
   * ver 2 : update description (2021/10/05)

#### 1. [Tokenization](https://www.kaggle.com/adldotori/tokenizing-hindi-and-tamil-language-nlp-step-1)
#### 2. [Demo](https://www.kaggle.com/adldotori/demo-training-nlp-step-2/)
#### 3. Research QA Model
#### 4. Training
#### 5. Inference

# Notebook to read before start QA

In this notebook, you'll find information that's good to know before starting the task.

In [None]:
import os
import os.path as osp

import pandas as pd

In [None]:
INPUT_PATH = '../input/chaii-hindi-and-tamil-question-answering/'

In [None]:
train = pd.read_csv(osp.join(INPUT_PATH, 'train.csv'))
test = pd.read_csv(osp.join(INPUT_PATH, 'test.csv'))
sub = pd.read_csv(osp.join(INPUT_PATH, 'sample_submission.csv'))

## Understand the data
Let's look at the data in detail.

### Train

In [None]:
train.head()

Id column does not appear to be important as a hash value.  

Let's see context column.

In [None]:
''.join(train.iloc[0]['context'].split('\n'))

The text is supposed to be in Tamil. If you run it in **google translate**, you will get the following text:
> The skeleton of a normal adult human contains the following 206 (208 if the sternum is divided into three parts). This number may vary depending on anatomical differences. For example, a very small number of people have an extra rib (neck) or an extra lower spine; If some of the joined bones are not considered a separate bone, the five joint vertebrae; The number of vertebrae in the 26 (including 3 - 5) vertebrae can be considered as 33. The human skull contains 22 bones (excluding the ear canals); These are divided into eight cranium bones and 14 facial bones. (Bold numbers denote the numbers seen in the adjacent figure.) Skull bones (8) 1 frontal bone 2 parietal bone (2) 3 temporal bone (2) 4 occipital bone sphenoid bone) ethmoid bone (14) 7 mandible (mandible) 6 maxilla (2) palatine bone (2) 5 zygomatic bone (2) 9 nasal bone bone) (2) lacrimal bone (2) laryngeal bone (vomer) inferior nasal conchae (2) in the middle nostrils (6): malleus (incus) stapes in the throat ( 1): hyoid shoulder (4): 25. clavicle 29. scapula thorax (25): 10. sternum (1) into three joints Can be considered: manubrium, body of sternum, xiphoid process 28. Ribs (rib) (24) vertebral column (33): 8. cervical vertebra (7) thoracic vertebra (12) 14. lumbar vertebra (5) 16. Sacrum Tail bone (coccyx) (arm) west (arm) (1): 11. humerus (humerus) 26. condyles of humerus forearm (forearm) (4): 12. ulna (2) 13. Radius (radius) (2) 27. Radius head (radius) In the hands (54): Wrist (carpal): Boat. (scaphoid) (2) lunate (2) triangular bone triquetrum (2) pea bone (pisiform) (2) trapezium (2) trapezoid (2) capitate ( 2) homate (hamate) (2) metacarpal: (5 × 2) phalange: proximal phalanges (5 × 2) middle finger Intermediate phalanges (4 × 2) distal phalanges (5 × 2) pelvis (2): 15. ilium and ischium legs (8): 18. Femur (2) 17. Hip joint (joint, bone) 22. Large trochanter of femur 23. Femoral condyles of femur 19. Patella (2 ) 20. Tibia (2) 21. Fibula (2) Fibers (52): Tarsal: heel (calcaneus) (2) Knee (talus) (2) ) Navicular bone (2) inner wedge bone (2) intervertebral bone (2) outer wedge bone (2) cuboidal bone (2) metatarsal (5 × 2) phalange : Proximal phalanges (5 × 2) Intermediate phalanges (4 × 2) Distal phalanges (5 × 2) Child Skeleton The following bones are more common in children's skeletons: the skull and the cranial bones (21), which together form the skull. Sacral vertebrae (4 or 5), in adults they form the coccygeal vertebrae (3 to 5), and in adults they form the tail together.

Next is a question column.

In [None]:
train.iloc[0]['question']

If you run it in **google translate**, you will get the following text:

> How many bones are there in the human body?

We can answer in this question from the context. Answer will be 206.  

This answer is the value in the **answer_text column**, and the value in the **anwer_start column** indicates where the answer exists in the context.

In [None]:
import plotly.express as px
fig = px.bar(
    train.language.value_counts(), 
    title='language value counts',
    labels={
        'value':'count',
        'index':'language'
    }
)
fig.update_layout(showlegend=False)
fig.show()

#### Train columns
1. id - hash value (not important)
2. context - Text with information to answer the question
3. question - question
4. answer_text - answer
5. answer_start - answer location in context
6. language - language (tamil, hindi)

### Test

In [None]:
test.head()

#### Test columns
1. id - hash value (not important)
2. context - Text with information to answer the question
3. question - question
6. language - language (tamil, hindi)

### Submission

In [None]:
sub.head()

#### Submission columns
1. id - hash value (not important)
2. PredictionString - Answer

## Missing Values Check

In [None]:
train.isnull().sum()

There isn't missing values!

### Other

In [None]:
len(train), len(test), len(sub)

Train set has only 1114 datas. It is really small datasets, so we think how to solve this problem.  
Test set has only 5 datas.

In [None]:
def f(row):
    return len(row['context'])
train['len_context'] = train.apply(f, axis=1)
def f(row):
    return len(row['question'])
train['len_question'] = train.apply(f, axis=1)
def f(row):
    return len(row['answer_text'])
train['len_answer'] = train.apply(f, axis=1)
train['answer_pos_rate'] = train['answer_start'] / train['len_context']

In [None]:
train[['len_context', 'len_question', 'len_answer', 'answer_start', 'answer_pos_rate']].describe()

In [None]:
(
    train.groupby('language')
    [['len_context', 'len_question', 'len_answer', 
      'answer_start', 'answer_pos_rate']]
    .mean()
)

On average, the length of the context is 11000 characters, the length of the qustion is 42 characters, the length of the answer is 13 characters.

**answer_loc_rate column** indicates the position of the answer in the context. Since the average is 16.9, it can be seen that most are located in the front. It is expected that this information will be usefully used for modeling in the future.

Let's see more detail.

In [None]:
train['answer_pos_rate_round'] = round(train['answer_pos_rate'], 1)

In [None]:
fig = px.bar(
    train['answer_pos_rate_round'].value_counts(),
    title ='answer pos rate',
    labels={
        'value':'count',
        'index':'position'
    }
)
fig.update_layout(showlegend=False)
fig.show()

You can see that most of the answers are leaning forward.