# Introduction

A novice attempt at Question and Answering in Hindi and Tamil. Will be improved over time.

# Understanding the Problem

Given a pair of inputs, Context and Question, return a String that Answers the question for the given context. The Answers are drawn directly from the Context (Answers are a subset of Context). The Answer includes punctuation.

Answers will be evaluated using the word-level Jaccard Score as provided by the competition.

In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

# Preprocessing

First load the dataset and tokenize the text.

We must also calculate the End Index from the start index and the answer text.

To-Do: Split the Context at predefined sizes in a way that allows the model to train.

In [None]:
import pandas as pd

In [None]:
base_path = "/kaggle/input/chaii-hindi-and-tamil-question-answering/"
test_path = base_path + "test.csv"
train_path = base_path + "train.csv"

test_set = pd.read_csv(test_path)
train_set = pd.read_csv(train_path)

In [None]:
# Add End Index
train_set['answer_end'] = train_set['answer_text'].str.len() + train_set['answer_start']
train_set.head()

In [None]:
# Find Max Context Length to be used as a hyperparameter
# Don't have enough memory for training so we only get to look at the first 1660 characters

# To-do: Improve this pipeline by splitting text into smaller pieces for processing.
max_sequence_length = int(train_set['context'].map(lambda x: len(x)).max() / 30 ) 
print(max_sequence_length)

In [None]:
# Using Keras for text preprocessing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [None]:
# Setting the char_level value to true because the
# size of the word_index is smaller and this saves memory when training the model.

# To-Do: Optimize pipeline so words are used instead of char_level
tokenizer = Tokenizer(char_level = True)
tokenizer.fit_on_texts(train_set['context'])
max_word_index = len(tokenizer.word_index) + 1
print(max_word_index)

In [None]:
def prep_text(texts, tokenizer, max_sequence_length):
    text_sequences = tokenizer.texts_to_sequences(texts)
    return pad_sequences(text_sequences, maxlen=max_sequence_length)

# Convert each of the texts into sequences.
train_context_sequence = prep_text(train_set['context'], tokenizer, max_sequence_length)
train_question_sequence = prep_text(train_set['question'], tokenizer, max_sequence_length)
test_context_sequence = prep_text(test_set['context'], tokenizer, max_sequence_length)
test_question_sequence = prep_text(test_set['question'], tokenizer, max_sequence_length)

# Model

A test model with two inputs and two outputs.
The Inputs are the context and questions as integer sequences.
Instead of One-Hot encoding the model learns the embeddings at training time, for both inputs.
An LSTM processes the input's Embeddings and their results are contatenated.
The concatenated layer values are passed to a Dense Layer for predictions of the start and end indicies.

To-Do: Replace the Dense Layers with something better.

In [None]:
from keras.models import Model
from keras.layers import Embedding, SpatialDropout1D, LSTM, concatenate, Dense
from keras import Input

text_vocabulary_size = 1000
question_vocabulary_size = 1000
answer_vocabulary_size = 1

text_input = Input(shape=(None,), dtype='int32', name='text')
embedded_text = Embedding(max_word_index, text_vocabulary_size)(text_input)
embedded_text = SpatialDropout1D(0.2)(embedded_text)
encoded_text = LSTM(32)(embedded_text)

question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = Embedding(max_word_index, question_vocabulary_size)(question_input)
embedded_question = SpatialDropout1D(0.2)(embedded_question)
encoded_question = LSTM(32)(embedded_question)

concatenated = concatenate([encoded_text, encoded_question])

start_index = Dense(1, activation='softmax')(concatenated)

end_index = Dense(1, activation='softmax')(concatenated)

model = Model([text_input, question_input], outputs=[start_index, end_index])
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

In [None]:
model.fit([train_context_sequence, train_question_sequence], [train_set['answer_start'], train_set['answer_end']], epochs=10, batch_size=128)

In [None]:
predictions = model.predict([test_context_sequence, test_question_sequence])