## Contents:
* [Competition Objective](#Competition-Objective)
* [Length of Questions & Answers](#Length-of-Questions-&-Answers)
* [Wordcloud](#Wordcloud)
* [Additional Public Datasets](#Additional-Public-datasets)
* [Other Useful Resources](#Good-Additional-Resources)

# Competition Objective
Popular Natural Language Understanding (NLU) models perform worse with Indian languages compared to English, the effects of which lead to subpar experiences in downstream web applications for Indian users. 
We are given questions in Tamil & Hindi about some Wikipedia articles, and we have to get the answers for those questions from the articles.

## Important Points
* The answers are drawn directly from a limited context. So no rephrasing etc. is to be done.
* **context** is the text (the Wikipedia article) of the Hindi/Tamil sample from which answers should be derived
* The evaluation metric in this competition is the word-level **Jaccard score** (As described in the [evaluation tab](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview/evaluation) )

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt

from spacy.lang.hi import Hindi, STOP_WORDS as hindi_stopwords
from spacy.lang.ta import Tamil, STOP_WORDS as tamil_stopwords

In [None]:
train_df = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/train.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
sns.displot(data=train_df,x='language', color='orange')

# Length of Questions & Answers

In [None]:
chars_per_ques = train_df['question'].str.len()
chars_per_ques.describe()

In [None]:
chars_per_ans = train_df['answer_text'].str.len()
chars_per_ans.describe()

As we can see, the average length of the answers is much shorter- less than one third of the average question length. Although there are a few very long answers with the longest answer being 286 characters long.

In [None]:
sns.displot(data=train_df,x=train_df['question'].str.len())

In [None]:
sns.displot(data=train_df,x=chars_per_ans, color='green')

Box Plot of answer length (in number of characters)

In [None]:
sns.boxplot(data=chars_per_ans)

# Wordcloud
Thanks to [NAIN's notebook](https://www.kaggle.com/aakashnain/chaii-explore-the-data) for these good wordclouds

In [None]:
# Download and extract the fonts
!wget -q http://www.lipikaar.com/sites/www.lipikaar.com/themes/million/images/support/fonts/Devanagari.zip
!wget -q http://www.lipikaar.com/sites/www.lipikaar.com/themes/million/images/support/fonts/Tamil.zip

!unzip -qq Devanagari.zip
!unzip -qq Tamil.zip

In [None]:
# Get the text for both the languages
tamil_text = " ".join(train_df[train_df["language"]=="tamil"]["question"])
hindi_text = " ".join(train_df[train_df["language"]=="hindi"]["question"])

In [None]:
# Get the tokens and frequencies for Hindi language

hindi_nlp = Hindi()
hindi_doc = hindi_nlp(hindi_text)
hindi_tokens = set([token.text for token in hindi_doc])
hindi_tokens_counter = Counter(hindi_tokens)


# Get the tokens and frequencies for Tamil language
tamil_nlp = Tamil()
tamil_doc = hindi_nlp(tamil_text)
tamil_tokens = set([token.text for token in tamil_doc])
tamil_tokens_counter = Counter(tamil_tokens)

In [None]:
def plot_wordcloud(
    font_path,
    frequencies,
    stopwords,
    background_color="white",
    collocations=True,
    min_font_size=8,
):
    """Generates wordcloud from word frequencies."""
    
    wordcloud = WordCloud(font_path=font_path,
                      width=400,
                      height=400,
                      background_color=background_color,
                      stopwords=stopwords,
                      collocations=collocations,
                      min_font_size=min_font_size).generate_from_frequencies(frequencies)

    
    plt.figure(figsize=(10, 10))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

In [None]:
# Plot the wordcloud for hindi langauge
plot_wordcloud(font_path="Devanagari/Lohit-Devanagari.ttf",
               frequencies=hindi_tokens_counter,
               stopwords=hindi_stopwords
              )

In [None]:
# Plot the wordcloud for tamil language
plot_wordcloud(font_path="Tamil/Lohit-Tamil.ttf",
               frequencies=tamil_tokens_counter,
               stopwords=tamil_stopwords
              )

# Additional Public datasets
As the training data size in the given dataset is quite small, we are encouraged by the hosts to share and use more public data sources. Following are some of the data sources that have been shared by other members so far:
* [Samanantar](https://indicnlp.ai4bharat.org/samanantar/#en-indic)
* [Facebook Multilingual QA datasets](https://github.com/facebookresearch/MLQA)
* [Hindi Wikipedia Articles - 172k](https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k)

The main discussion thread for sharing datasets is:
https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264581

# Good Additional Resources
* Official Starter Notebook: https://www.kaggle.com/deeplearning10/chaii-1-starter-notebook
* AI4Bharat IndicNLP [homepage](https://indicnlp.ai4bharat.org/home/) , project led by volunteers from IIT Madras and other organizations
* pre-trained language models - [IndicBERT](https://indicnlp.ai4bharat.org/indic-bert/)
* [Multilingual Transfer Learning for QA Using Translation as Data Augmentation](https://arxiv.org/pdf/2012.05958.pdf)

## Some Issues
* As pointed out by some members in [this discussion thread](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264395) , there are some noisy labels in the training data. Although the competition host has assured that in the test set each instance is 3-way annotated (unlike the 1-way annotated train data) so that is very unlikely to have such issues.