<h2>chaii QA - 5 Fold XLMRoberta Training + Inference in Torch w/o Trainer API</h2>
    
<h3><span "style: color=#444">Introduction</span></h3>

This kernel preprocesses MLQA, XQUAD Hindi Corpus. For more information check the finetuning notebook.

This is a three part kernel,

- [External Data - MLQA, XQUAD Preprocessing](https://www.kaggle.com/rhtsingh/external-data-mlqa-xquad-preprocessing) which preprocesses the Hindi Corpus of MLQA and XQUAD. I have used these data for training.

- [chaii QA - 5 Fold XLMRoberta Torch | FIT](https://www.kaggle.com/rhtsingh/chaii-qa-5-fold-xlmroberta-torch-fit/edit) This kernel showcases Finetuning (FIT) on competition + external data combining different strategies.

- [chaii QA - 5 Fold XLMRoberta Torch | Infer](https://www.kaggle.com/rhtsingh/chaii-qa-5-fold-xlmroberta-torch-infer) The Inference kernel where we ensemble our 5 Fold XLMRoberta Models and do the submission.

## MLQA

In [None]:
!wget https://dl.fbaipublicfiles.com/MLQA/MLQA_V1.zip

In [None]:
import zipfile
with zipfile.ZipFile('/kaggle/working/MLQA_V1.zip') as zip_ref:
    zip_ref.extractall('/kaggle/working/')

In [None]:
import os
import sys
import random
import argparse
import json
import nltk
import numpy as np
import pandas as pd
from tqdm import tqdm

# sys.setdefaultencoding('utf8')
random.seed(42)
np.random.seed(42)

In [None]:
mlqa_train_data = '/kaggle/working/MLQA_V1/dev/dev-context-hi-question-hi.json'
mlqa_test_data = '/kaggle/working/MLQA_V1/test/test-context-hi-question-hi.json'

with open(mlqa_train_data, 'r') as file_input:
    train_file = json.load(file_input)
    
with open(mlqa_test_data, 'r') as file_input:
    test_file = json.load(file_input)

In [None]:
def preprocess(dataset, tier):
    num_exs = 0 
    examples = []

    for articles_id in tqdm(range(len(dataset['data'])), desc="Preprocessing {}".format(tier)):
        article_paragraphs = dataset['data'][articles_id]['paragraphs']
        for pid in range(len(article_paragraphs)):
            context = article_paragraphs[pid]['context']
            context = context.replace("''", '" ')
            context = context.replace("``", '" ')
            qas = article_paragraphs[pid]['qas'] 
            for qn in qas:
                question = qn['question'] 
                ans_text = qn['answers'][0]['text']
                ans_start_charloc = qn['answers'][0]['answer_start']
                ans_end_charloc = ans_start_charloc + len(ans_text)
                examples.append(
                    {
                        # 'id':articles_id,
                        'context':context, 
                        'question':question, 
                        'answer_text':ans_text, 
                        'answer_start':ans_start_charloc, 
                        # 'answer_end':ans_end_charloc
                    }
                )

                num_exs += 1
    print(num_exs)    
    return examples

In [None]:
examples_train = preprocess(train_file, 'dev')
examples_test = preprocess(test_file, 'test')

In [None]:
examples = examples_train + examples_test
mlqa = pd.DataFrame(examples)
mlqa['language'] = 'hindi'

### XQUAD

In [None]:
!git clone https://github.com/deepmind/xquad.git

In [None]:
xquad_train_file = '/kaggle/working/xquad/xquad.hi.json'

with open(xquad_train_file, 'r') as file_input:
    train_file = json.load(file_input)
    
examples_train = preprocess(train_file, 'dev')
xquad = pd.DataFrame(examples_train)
xquad['language'] = 'hindi'

### Remove downloaded files

In [None]:
import os, shutil
folder = '/kaggle/working/'
for filename in os.listdir(folder):
    file_path = os.path.join(folder, filename)
    try:
        if os.path.isfile(file_path) or os.path.islink(file_path):
            os.unlink(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
    except Exception as e:
        print('Failed to delete %s. Reason: %s' % (file_path, e))

### Save Data

In [None]:
mlqa.to_csv('mlqa_hindi.csv', index=False)
xquad.to_csv('xquad.csv', index=False)

In [None]:
xquad.head(5)

In [None]:
mlqa.head(5)