# Hindi and Tamil Question Answering


In this competition, our goal is to predict answers to real questions about Wikipedia articles. 
Predicting answers to questions is a common NLU task, but not for Hindi and Tamil. Popular Natural Language Understanding (NLU) models perform worse with Indian languages compared to English and the intent of this competition is to bridge that gap

In [None]:
from pathlib import Path
import pandas as pd

# Lets look at the data

In [None]:
data_dir = Path("../input/chaii-hindi-and-tamil-question-answering/")
train_df = pd.read_csv(data_dir / "train.csv", encoding="utf8")
test_df = pd.read_csv(data_dir / "test.csv", encoding="utf8")


print("*"*10)
print("Number of training samples: ", len(train_df))
print("Number of TAMIL samples: ", len(train_df[train_df.language=='tamil']))
print("Number of HINDI samples: ", len(train_df[train_df.language=='hindi']))
print("*"*10)
print("Number of test samples: ", len(test_df))


In [None]:
train_df.head(10)

# Google Translate API - Lets translate some QA to get a sense
<h4 style="background-color:DodgerBlue;">THIS CAN BE VERY EXPENSIVE ! Please use with care if you decide to include your Google Credentials</h4>


In [None]:
from google.cloud import translate_v2 as translate
def translate_text(target, text, translate_client):
    try:
        result = translate_client.translate(text, target_language=target)
        translatedText = result["translatedText"]
        return translatedText
    except Exception as e:
        return ''
    
credentials_path = "../input/google-credentials-key/translation-322918-001a60851ad9.json"
translate_client = translate.Client.from_service_account_json(credentials_path)

for i in range(3):
    print("*"*10)
    print(train_df.language[i])
    print(translate_text("en", train_df.question[i], translate_client))
    print(translate_text("en", train_df.answer_text[i], translate_client))

## Lets understand the LENGTH of the context/questions/answers

In [None]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)

#number of chars
train_df['context_chars'] = train_df['context'].str.len()
train_df['question_chars'] = train_df['question'].str.len()
train_df['answer_chars'] = train_df['answer_text'].str.len()

#number of words
train_df['context_words'] = train_df['context'].str.split().map(lambda x: len(x))
train_df['question_words'] = train_df['question'].str.split().map(lambda x: len(x))
train_df['answer_words'] = train_df['answer_text'].str.split().map(lambda x: len(x))

tamil_df = train_df[train_df.language=='tamil']
hindi_df = train_df[train_df.language=='hindi']

train_df.describe()

# Wordcount Distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=3,figsize=(20, 6))
sns.kdeplot(train_df['context_words'],shade=True, color='#ff0000', ax=axes[0])


sns.kdeplot(tamil_df['context_words'],shade=True, color='#00ff00', ax=axes[1])
sns.kdeplot(hindi_df['context_words'],shade=True, color='#0000ff', ax=axes[2])
axes[0].set_title('Context Word Count - Overall',fontdict= { 'fontsize': 12, 'fontweight':'bold'})
axes[1].set_title('Context Word - Hindi',fontdict= { 'fontsize': 12, 'fontweight':'bold'})
axes[2].set_title('Context Word - Tamil',fontdict= { 'fontsize': 12, 'fontweight':'bold'})


In [None]:
train_df.context_words.hist()

# Answer Start Position Distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=3,figsize=(20, 6))
sns.kdeplot(train_df['answer_start'],shade=True, color='#ff0000', ax=axes[0])

sns.kdeplot(tamil_df['answer_start'],shade=True, color='#00ff00', ax=axes[1])
sns.kdeplot(hindi_df['answer_start'],shade=True, color='#0000ff', ax=axes[2])
axes[0].set_title('Answer Start Position - Overall',fontdict= { 'fontsize': 12, 'fontweight':'bold'})
axes[1].set_title('Answer Start Position - Hindi',fontdict= { 'fontsize': 12, 'fontweight':'bold'})
axes[2].set_title('Answer Start Position - Tamil',fontdict= { 'fontsize': 12, 'fontweight':'bold'})

In [None]:
train_df.answer_start.hist()

### ******** Insights from above *******
1. A majority of answers start very early in the passage. However, we don't know if this is the case in the Test Data. It will be a interesting probe to make on the LB

2. There are some answers that are very long(51 words). The Longest question is only 22 words

# Lets look at the Hindi Samples

In [None]:
display(hindi_df.tail(10))

#### Lets look at one Hindi sample

In [None]:
a = 942
print(a)
print("Answer = ", hindi_df.answer_text[a])
print("Answer Start = ", hindi_df.answer_start[a])
print("*"*10)
print("Context Text = ", hindi_df.context[a])

# Google Translate API

# Lets look at the Tamil Samples

In [None]:
display(tamil_df.tail(10))

In [None]:
a = 5
print(a)
print("Answer = ", tamil_df.answer_text[a])
print("Answer Start = ", tamil_df.answer_start[a])
print("*"*10)
print("Context Text = ", tamil_df.context[a])

# Looking at the test data

In [None]:
test_df.head()

In [None]:
#number of chars
test_df['context_chars'] = test_df['context'].str.len()
test_df['question_chars'] = test_df['question'].str.len()

#number of words
test_df['context_words'] = test_df['context'].str.split().map(lambda x: len(x))
test_df['question_words'] = test_df['question'].str.split().map(lambda x: len(x))

test_df

# Thank you !! I will add more info as I do my own EDA over a cup of Chaii !