# chaii - Hindi and Tamil Question Answering

Hello Kagglers! This competition is pretty cool and a bit harder than the Q&A datasets we generally work on. Most of the datasets and the research done focuses on the English language. Although these models show good performance on the English language datasets, they don't work very well on the Indian languages. The [Indian languages ecosystem](https://en.wikipedia.org/wiki/Languages_of_India#Prominent_languages_of_India) is as diverse as India is. If you consider all the languages and dialects, then almost **19,000** languages or dialects are spoken by Indians daily.

# The Task

You are given questions in Tamil/Hindi about some Wikipedia articles, and you have to generate the answers for those questions using your model.

## Dataset

We have been provided with a new question-answering dataset with question-answer pairs, and it goes by the name **`chaii-1`**.

## Evaluation Metric

The predictions would be evaluated using **`word-level Jaccard score`**. A sample code has also been provided for the same.

```python

def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

```

Let's statrt diving into the data!

In [None]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from wordcloud import WordCloud
from collections import Counter

from spacy.lang.hi import Hindi
from spacy.lang.ta import Tamil
from spacy.lang.hi import STOP_WORDS as hindi_stopwords
from spacy.lang.ta import STOP_WORDS as tamil_stopwords


seed=111
np.random.seed(seed)

%config IPCompleter.use_jedi = False

In [None]:
# Path to the data diectory
data_dir = Path("../input/chaii-hindi-and-tamil-question-answering/")

# Read the training and test csv files
train_df = pd.read_csv(data_dir / "train.csv", encoding="utf8")
test_df = pd.read_csv(data_dir / "test.csv", encoding="utf8")

# How many training and test samples have been provided?
print("Number of training samples: ", len(train_df))
print("Number of test samples: ", len(test_df))

There are only ~1100 training samples, meaning we are in a low data regime, suggesting that transfer-learning and fine-tuning are the best shots if we are going to use DNNs for this task. This doesn't mean you shouldn't build your models!

Let's take a look at the training data and the test data

In [None]:
train_df.head()

In [None]:
test_df.head()

A few things to note:

1. There can be English words as well in the given questions
2. `answer_start` column isn't in the test dataset, but it gives important information about the training dataset, the starting character for the context
3. The `language` column is present in both `train` and `test`. One of the things that we can try is to build two separate models, one for `Hindi` and one for `Tamil`, and then make the predictions accordingly using the values in this column

# Distribution of the languages in the training dataset

Let's see how many samples we have for each language in the training dataset. For this we can use `countplot(..)`

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(data=train_df, x="language")
plt.show()

This suggests that the number of instances for `Hindi` language is almost double the number of instance of `Tamil` language in the training dataset. Let's also get the actual count to see the difference 

In [None]:
# Get the actual count values
train_df["language"].value_counts()

# Remove punctuation

All the questions presented here are represented with a question mark. We will simply remove it and along with it, we will alos strip any whitespace around the text

In [None]:
train_df["question"] = train_df["question"].str.replace("?", "", regex=False).str.strip()
train_df.head()

# WordCloud

We will generate two wordclouds, one for each language. 

In [None]:
# Get the text for both the languages
tamil_text = " ".join(train_df[train_df["language"]=="tamil"]["question"])
hindi_text = " ".join(train_df[train_df["language"]=="hindi"]["question"])

For generating the wordlcoud, we need the right `font`

1. [Font for Hindi](http://www.lipikaar.com/support/download-unicode-fonts-for-hindi-marathi-sanskrit-nepali)
2. [Font for Tamil](http://www.lipikaar.com/support/download-unicode-fonts-for-tamil)

**Note:** I haven't checked how accurate the gven stopwords are, this is something you need to cross-check!

In [None]:
# Download and extract the fonts
!wget -q http://www.lipikaar.com/sites/www.lipikaar.com/themes/million/images/support/fonts/Devanagari.zip
!wget -q http://www.lipikaar.com/sites/www.lipikaar.com/themes/million/images/support/fonts/Tamil.zip

!unzip -qq Devanagari.zip
!unzip -qq Tamil.zip

In [None]:
# Get the tokens and frequencies for Hindi language

hindi_nlp = Hindi()
hindi_doc = hindi_nlp(hindi_text)
hindi_tokens = set([token.text for token in hindi_doc])
hindi_tokens_counter = Counter(hindi_tokens)


# Get the tokens and frequencies for Tamil language
tamil_nlp = Tamil()
tamil_doc = hindi_nlp(tamil_text)
tamil_tokens = set([token.text for token in tamil_doc])
tamil_tokens_counter = Counter(tamil_tokens)

In [None]:
def plot_wordcloud(
    font_path,
    frequencies,
    stopwords,
    width=500,
    height=500,
    background_color="white",
    collocations=True,
    min_font_size=5,
):
    """Generates wordcloud from word frequencies."""
    
    wordcloud = WordCloud(font_path=font_path,
                      width=width,
                      height=height,
                      background_color=background_color,
                      stopwords=stopwords,
                      collocations=collocations,
                      min_font_size=min_font_size).generate_from_frequencies(frequencies)

    
    plt.figure(figsize=(10, 10))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

In [None]:
# Plot the wordcloud for hindi langauge
plot_wordcloud(font_path="Devanagari/Lohit-Devanagari.ttf",
               frequencies=hindi_tokens_counter,
               stopwords=hindi_stopwords
              )

In [None]:
# Plot the wordcloud for tamil language
plot_wordcloud(font_path="Tamil/Lohit-Tamil.ttf",
               frequencies=tamil_tokens_counter,
               stopwords=tamil_stopwords
              )

That's it for the EDA! We will build a LM in the next notebook!