# ChAII starter notebook

** Please do not edit this notebook. Make a copy of this notebook and have fun experimenting with different methods. **

This is a starter notebook for running a baseline mBERT model on the task. This is a standalone notebook which will allow you to do the following:
1. Train an mBERT model on the ChAII data (utilizing Colab GPUs), 
2. Get dev evaluation numbers, 
3. Generate a submission for the Kaggle leaderboard with the appropriate format.

We will not submit this notebook, rather we will use this notebook to verify our results and use the [ChAII-1 Inference](https://www.kaggle.com/deeplearning10/chaii-1-inference?scriptVersionId=71218416) for submitting the results.

This notebook uses the [Xtreme](https://github.com/google-research/xtreme) codebase for training QA-finetuned models. Feel free to run your own scripts/variants locally, experiment with other models and pipelines, not necessarily limited to Xtreme. Here are some caveats of this method:
1. Since runtimes (GPU/TPU) are re-allocated on restarting the notebook, some pip packages installations need to be rerun everytime the notebook has to be reconnected. (Refer to [this](https://www.kaggle.com/samsammurphy/pip-install-forever) if you want to use Kaggle notebooks for the task).

Given the above conditions, we encourage you to have local installations and clones of the Xtreme codebase (with some changes mentioned below), use this notebook for training with GPU (and inference), and conduct evaluations locally.

Since the internet should be disabled while submitting the notebook, we will add all the external packages and codebase as a Kaggle dataset. You can find a list of all these resources below. 
1. [Modified Xtreme codebase](https://www.kaggle.com/deeplearning10/modified-xtreme): This link contains the Xtreme codebase along with some convenience scripts to run the experiments. 
2. [External Packages](https://www.kaggle.com/deeplearning10/external-packages): Since we will be building Xtreme codebase from source we will need some python packages. Normally you can install all the packages needed via `!pip install package`, but it needs internet. So we have downloaded the .whl files for these packages and uploaded them as a Kaggle dataset, now we can build the python packages offline. 
3. [Bert-base-multilingual-cased](https://www.kaggle.com/deeplearning10/bert-base-multilingual-cased): This is a pretrained mBert model provided by Huggingface. 

In [None]:
!which python # should return /usr/local/bin/python
!python --version

In [None]:
!echo $PYTHONPATH # returns /env/python

In [None]:
# unset PYTHONPATH to prevent problems later
%env PYTHONPATH= 

In [None]:
# To verify the Miniconda installation
!conda --version # now returns 4.10.3
!python --version # now returns Python 3.6.13 :: Anaconda, Inc.

In [None]:
import sys
print(sys.path)

In [None]:
import sys
sys.path.append("/kaggle/working/chaii-packages")

In [None]:
%%bash
mkdir /kaggle/working/chaii-packages
cd /kaggle/working/chaii-packages
cp /kaggle/input/external-packages/* /kaggle/working/chaii-packages
mv ./botocore-1.21.17.xyz ./botocore-1.21.17.tar.gz
mv ./jieba-0.42.1.xyz ./jieba-0.42.1.tar.gz
mv ./seqeval-1.2.2.xyz ./seqeval-1.2.2.tar.gz

### Xtreme codebase setup

Now, we will set up the Xtreme repo ([link](https://github.com/google-research/xtreme)). The below cells do the following:
1. Clone Xtreme - Here we use a modified version of xtreme which can be added in kaggle notebook as a dataset. 
2. Create a Conda env called ```xtreme``` and install dependencies into it.

This cell below is a modified version of ```xtreme/install_tools.sh```. 

**Note:** who are using Xtreme repo locally may also encounter errors with the original script, such as with ```conda activate```. You can copy-paste this script to resolve the error.

In [None]:
%%bash
# First, we need to install required dependencies. Instead of running their install_tools.sh, run this cell, which has a few minor modifications. This may take a few minutes to run.
# TODO: look into pip install forever
cd /kaggle/input/ # Optional but recommended
cd modified-xtreme/
# Copyright 2020 Google and DeepMind.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

REPO=$PWD
echo $REPO
LIB=$REPO/third_party
mkdir -p $LIB

# install latest transformer
cd $LIB
cd transformers
pip install . --no-index --find-links /kaggle/working/chaii-packages/
cd $LIB

# pip install seqeval --no-index --find-links /kaggle/working/chaii-packages/
# pip install tensorboardx --no-index --find-links /kaggle/working/chaii-packages/
# pip install tqdm --no-index --find-links /kaggle/working/chaii-packages/

# # install XLM tokenizer
# pip install sacremoses --no-index --find-links /kaggle/working/chaii-packages/
# pip install pythainlp --no-index --find-links /kaggle/working/chaii-packages/
# pip install jieba --no-index --find-links /kaggle/working/chaii-packages/

# #git clone https://github.com/neubig/kytea.git && cd kytea
# #./configure --prefix=${CONDA_PREFIX}
# #make && make install
# pip install kytea --no-index --find-links /kaggle/working/chaii-packages/

## Training

Although this codebase can be used for many varieties of training methods and experiments, we will only train a straightforward baseline. We will create a monolingual Hindi QA model. We encourage you to read and experiment with the Xtreme codebase, and also with other repos. Some promising avenues:

* Train model on both Hindi and Tamil ChAII data,
* Multi-task learning with Xtreme,
* Annotate your own data into a QA format and augment training,
* Zero-shot transfer learning

The cells below do the following:

1. Convert the given ChAII data (from competition) to QA (SQuAD) format, split into train and dev sets.
2. Finetune mBERT (bert-base-multilingual-cased) on the ChAII data.
3. Save the model and dev predictions into Kaggle outputs folder for evaluation later

In [None]:
# Load ChAII dataset
import json
import random
import pandas as pd
from pathlib import Path

pd.set_option("display.max_rows", 20, "display.max_columns", None)

data_path = Path("/kaggle/input/chaii-hindi-and-tamil-question-answering/")
json_dicts = []

def get_dataframe(file_path):
    df = pd.DataFrame()
    with open(file_path,'r') as f:
        df = pd.read_csv(f)
    df = df.astype(str)
    df = df.apply(lambda x: x.str.strip())
    return df

train_data = get_dataframe(data_path / "train.csv")
test_data = get_dataframe(data_path / "test.csv")
test_data

### Data conversion to QA format

The below cells convert TyDiQA and ChAII Kaggle data format to the SQuAD QA format, so it can be used with the Xtreme pipeline. 

In [None]:
# Convert TyDiQA format to a QA format
def convert_to_qa_format_kaggle(row):
    answer = {}
    try:
        answer["text"] = row["answer_text"]
        answer["answer_start"] = int(row["answer_start"])
    except:
        answer["text"] = ''
        answer["answer_start"] = -1
    qa_json = {
        "title": "",
        "paragraphs": [
            {
                "context": row["context"],
                "qas": [
                    {
                        "question": row["question"],
                        "id": row["language"] + '-' + str(row["id"]),
                        "answers": [answer]
                    }
                ]
            }
        ],
    }
    
    return qa_json

# Process one language at a time
# Here chaii_data is a pandas dataframe
def get_qa_data_from_kaggle_format(chaii_data, language):
    qa_data = {"data":[], "version":f"chaii_{language}"}
    for index, row in chaii_data.iterrows():
        if row["language"] == language:
            qa_datapoint = convert_to_qa_format_kaggle(row)
            qa_data["data"].append(qa_datapoint)

    print("QA (SQuAD) format:")
    print(qa_data["data"][0])
    return qa_data

hi_qa_data = get_qa_data_from_kaggle_format(train_data, 'hindi')
hi_test_qa_data = get_qa_data_from_kaggle_format(test_data, 'hindi')
ta_qa_data = get_qa_data_from_kaggle_format(train_data, 'tamil')
ta_test_qa_data = get_qa_data_from_kaggle_format(test_data, 'tamil')

In [None]:
# Split datapoints language-wise and into QA format
# Run this cell only if you need to convert from TyDiQA to SQuAD format, otherwise run the nexy one.
import re

from pprint import pprint

def byte_str(text):
  return text.encode("utf-8")

def byte_len(text):
  # Python 3 encodes text as character sequences, not byte sequences
  # (like Python 2).
  return len(byte_str(text))

def byte_slice(text, start, end, errors="replace"):
  # Python 3 encodes text as character sequences, not byte sequences
  # (like Python 2).
  return byte_str(text)[start:end].decode("utf-8", errors=errors)

def convert_to_qa_format_tydiqa(tydi_json):
  answer = {}
  for annotation in tydi_json["annotations"]:
    minimal_answer = annotation["minimal_answer"]
    if minimal_answer["plaintext_start_byte"] != -1 and minimal_answer["plaintext_end_byte"] != -1:
      answer["text"] = byte_slice(tydi_json["document_plaintext"],minimal_answer["plaintext_start_byte"],minimal_answer["plaintext_end_byte"])
      answer["answer_start"] = [m.start() for m in re.finditer(answer["text"],tydi_json["document_plaintext"])][0]
      break
  if answer == {}:
    return {}
  
  qa_json = {
      "title" : tydi_json["document_title"],
      "paragraphs" : [
                      {
                          "context": tydi_json["document_plaintext"],
                          "qas" : [
                                   {
                                    "question" : tydi_json["question_text"],
                                    "id" : tydi_json["language"] + '-' + str(tydi_json["example_id"]),
                                    "answers" : [answer],
                                   }
                          ]
                      }
      ],
  }

  return qa_json

# Here chaii_data is json list
def get_qa_data_from_tydiqa_format(chaii_data, language):
    language = 'hindi'
    qa_data = {"data":[], "version":f"chaii_{language}"}
    for json_dict in json_dicts:
      if json_dict["language"] == language:
        qa_datapoint = convert_to_qa_format_tydiqa(json_dict)
        if qa_datapoint != {}:
          qa_data["data"].append(qa_datapoint)
        qa_data['data'].append(json_dict)

    print("QA (SQuAD) format:")
    print(qa_data["data"][0])

In [None]:
# Splitting data into train and dev and saving converted QA formats
def split_data(qa_data, test_qa_data, lang_code):
    split_data_path = Path("/kaggle/working/chaii_data/")
    !mkdir /kaggle/working/chaii_data

    qa_data_datapoints = qa_data["data"]
    test_qa_data_datapoints = test_qa_data["data"]
    random.shuffle(qa_data_datapoints)
    train_size = int(len(qa_data_datapoints)*0.8)
    train_qa_data_datapoints, dev_qa_data_datapoints = qa_data_datapoints[:train_size], qa_data_datapoints[train_size:]
    
    train_qa_data = {"data":train_qa_data_datapoints, "version":f"chaii_{lang_code}_train"}
    dev_qa_data = {"data":dev_qa_data_datapoints, "version":f"chaii_{lang_code}_dev"}
    test_qa_data = {"data": test_qa_data_datapoints, "version":f"chaii_{lang_code}_test"}

    with open(split_data_path / f"train.{lang_code}.qa.jsonl",'w') as f:
      json.dump(train_qa_data,f)

    with open(split_data_path / f"dev.{lang_code}.qa.jsonl",'w') as f:
      json.dump(dev_qa_data,f)

    with open(split_data_path / f"test.{lang_code}.qa.jsonl",'w') as f:
      json.dump(test_qa_data,f)

    print(f"{lang_code} Training data size: %d" % len(train_qa_data_datapoints))
    print(f"{lang_code} Dev data size: %d" % len(dev_qa_data_datapoints))
    print(f"{lang_code} Test data size: %d" % len(test_qa_data_datapoints))
    
split_data(hi_qa_data, hi_test_qa_data, 'hi')
split_data(ta_qa_data, ta_test_qa_data, 'ta')

The cell below is optional (we have not used it for our baseline model), but it downloads the original TyDiQA data in the QA format. You can combine it with our ChAII data and boost training!

### Training mBERT on Hindi ChAII data

The below script uses the Xtreme script to train the data. Here, we need to modify the code in the folders to train it on the ChAII data. You can double click on the scripts, modify the code and change them. 

For the baseline, the following changes were made to the Xtreme repo code:


1.   In ```scripts/train.sh```, an additional task called "chaii_hi" was added as such:
```
...
elif [ $TASK == 'chaii_hi' ]; then
  bash $REPO/scripts/train_qa.sh $MODEL chaii_hi $TASK $GPU $DATA_DIR $OUT_DIR
...
```
2.   In ```scripts/train_qa.sh```, the following flags were added:
```
TRAIN_LANG="en"
EVAL_LANG="en"
```
Another elif condition was added as such to modify path of data dir:
```
...
elif [ $SRC == 'chaii_hi' ]; then
  TASK_DATA_DIR=${DATA_DIR}
  TRAIN_FILE=${TASK_DATA_DIR}/train.hi.qa.jsonl
  PREDICT_FILE=${TASK_DATA_DIR}/dev.hi.qa.jsonl
  TRAIN_LANG="hi"
  EVAL_LANG="hi"
...
```
Finally, TRAIN_LANG and EVAL_LANG replaced the hardcoded "en":
```
 --weight_decay 0.0001 \
  --threads 8 \
  --train_lang ${TRAIN_LANG} \
  --eval_lang ${EVAL_LANG}
```

If you want to make your own changes for experimentation, clone the xtreme repo locally in the mount folder and mount it as part of the docker container. 

Finally, we create a run.sh script in the current root directory, and paste the following commands:

```
#!/bin/bash

TASK=${1:-chaii_hi}
DATA_DIR=${2:-"/root/xtreme/download/chaii_data/"}
OUT_DIR=${3:-"/root/xtreme/outputs-temp/"}
MODEL=${4:-bert-base-multilingual-cased}
GPU=${5:-0}
TRAIN_FILE_NAME=${6}
PREDICTIONS_DIR=${7:-"/kaggle/working/predictions/"}
PREDICT_FILE_NAME=${8}

source activate xtreme
cd /root/xtreme
bash scripts/train.sh $MODEL $TASK $GPU $DATA_DIR $OUT_DIR $TRAIN_FILE_NAME $PREDICTIONS_DIR $PREDICT_FILE_NAME

```
Your model should be stored in ```/kaggle/working/outputs-temp/```.
Similar instructions were followed for chaii_ta task. 

In [None]:
# Now that the data is downloaded, you can run the training script directly from the repo. Here the best way to do it, is to create a new file called run.sh in home folder, and copy paste the below commands, then just run this cell:
# Also ensure that you set your runtime type to GPU for training.

# Since you would want to experiment with different models, download the modified-xtreme codebase, modify the 
# code and upload the codebase as a new dataset (private or public). 

# run.sh ${TASK} ${DATA_DIR} ${OUT_DIR} ${MODEL} ${GPU} ${TRAIN_FILE_NAME} ${PREDICT_FILE_NAME} ${MODEL_NAME}
# train.sh ${MODEL} ${TASK} ${GPU} ${DATA_DIR} ${OUT_DIR} ${TRAIN_FILE_NAME} ${PREDICT_FILE_NAME} ${MODEL_NAME}
# train_qa.sh ${MODEL} ${MODEL_NAME} ${SRC} ${TGT} ${GPU} ${DATA_DIR} ${OUT_DIR} ${TRAIN_FILE_NAME} ${PREDICT_FILE_NAME}
# predict_qa.sh ${MODEL} ${MODEL_TYPE} ${MODEL_PATH} ${TGT} ${GPU} ${DATA_DIR} ${PREDICTIONS_DIR} ${PREDICT_FILE_NAME}

!bash /kaggle/input/modified-xtreme/run.sh chaii_hi /kaggle/working/chaii_data /kaggle/working/outputs-temp /kaggle/input/bert-base-multilingual-cased 0 train.hi.qa.jsonl /kaggle/working/eval_dir/predictions dev.hi.qa.jsonl 

In [None]:
# Tamil Training

!bash /kaggle/input/modified-xtreme/run.sh chaii_ta /kaggle/working/chaii_data /kaggle/working/outputs-temp /kaggle/input/bert-base-multilingual-cased 0 train.ta.qa.jsonl /kaggle/working/eval_dir/predictions dev.ta.qa.jsonl

## Inference and Evaluation

For inference, we do the following modifications to Xtreme repo:
1. In ```predict_qa.sh```, add the following (line 40):
```
elif [ $TGT == 'chaii_hi' ]; then
  langs=( hi )
```


Also, we create a bash file (similar to ```run.sh```) called ```predict.sh```, and copy the commands below into it:

```
#!/bin/bash

source activate xtreme
cd /root/xtreme

MODEL_PATH=${1:-"/root/xtreme/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384"}
TASK=${2:-chaii_hi}
DATA_DIR=${2:-"/root/xtreme/download/chaii_data/"}
PREDICTIONS_DIR=${3:-"/root/xtreme/predictions/"}
MODEL=${4:-bert-base-multilingual-cased}
MODEL_TYPE=${5:-bert}
GPU=${6:-0}
PREDICT_FILE_NAME=${7}
 
bash scripts/predict_qa.sh bert-base-multilingual-cased bert $MODEL_PATH $TASK $GPU $DATA_DIR $PREDICTIONS_DIR $PREDICT_FILE_NAME
```

In [None]:
# Predict on train to see performance

!bash /kaggle/input/modified-xtreme/predict.sh "/kaggle/working/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384" \
      chaii_hi "/kaggle/working/chaii_data/" "/kaggle/working/eval_dir/predictions/" "bert-base-multilingual-cased" "bert" 0 train.hi.qa.jsonl

In [None]:
!bash /kaggle/input/modified-xtreme/predict.sh "/kaggle/working/outputs-temp/chaii_ta/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384" \
      chaii_ta "/kaggle/working/chaii_data/" "/kaggle/working/eval_dir/predictions/" "bert-base-multilingual-cased" "bert" 0 train.ta.qa.jsonl

In [None]:
# If you trained the model on a local machine and want to evaluate you can use this cell
# predict.sh ${MODEL_PATH} ${TASK} ${DATA_DIR} ${PREDICTIONS_DIR} ${MODEL} ${MODEL_TYPE} ${GPU} ${PREDICT_FILE_NAME}
# predict_qa.sh ${MODEL} ${MODEL_TYPE} ${MODEL_PATH} ${TGT} ${GPU} ${DATA_DIR} ${PREDICTIONS_DIR} ${PREDICT_FILE_NAME}

!bash /kaggle/input/modified-xtreme/predict.sh "/kaggle/working/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384" \
      chaii_hi "/kaggle/working/chaii_data/" "/kaggle/working/eval_dir/predictions/"

In [None]:
# Tamil Inference

!bash /kaggle/input/modified-xtreme/predict.sh "/kaggle/working/outputs-temp/chaii_ta/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384" \
      chaii_ta "/kaggle/working/chaii_data/" "/kaggle/working/eval_dir/predictions/"

In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

def evaluate(lang_code):
    # For evaluating the predictions, we will use our custom script which uses jaccard mean 
    import json
#     with open(f"/kaggle/working/outputs-temp/chaii_{lang_code}/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384/predictions_{lang_code}_.json") as f:
#       preds = json.load(f)
    with open(f"/kaggle/working/eval_dir/predictions/predictions_{lang_code}_.json") as f:
        preds = json.load(f)

    with open(f"/kaggle/working/chaii_data/dev.{lang_code}.qa.jsonl") as f:
      dev_data = json.load(f)
    
    submission_preds = [{'id':k.split('-')[1], 'PredictionString': v} for k, v in preds.items()]
    
    # write submissions file
    df_ = pd.DataFrame.from_dict(submission_preds)
    df_.to_csv(f'/kaggle/working/eval_dir/chaii_{lang_code}_submission.csv', index=False)
    
    from pprint import pprint
    jaccard_mean = 0
    dev_answer_pair_matches = []
    for d in dev_data['data']:
        for para in d['paragraphs']:
            for qa in para['qas']:
                sample_jaccard = jaccard(qa['answers'][0]['text'], preds[qa['id']])
                jaccard_mean += sample_jaccard
                dev_answer_pair_matches.append({'context':para['context'],'question':qa['question'],'gold_answer':qa['answers'],'mbert_pred':preds[qa['id']],'id':qa['id']})

    jaccard_mean /= len(dev_answer_pair_matches)
    print(f"Jaccard Mean for chaii_{lang_code}: {jaccard_mean}")
    
    return dev_answer_pair_matches
    
    
    
dev_answer_pair_matches_hi = evaluate("hi")
dev_answer_pair_matches_ta = evaluate("ta")

In [None]:
%%bash
# Combine predictions for all languages into a single submission.csv file
cd /kaggle/working/eval_dir
cat chaii_hi_submission.csv >> /kaggle/working/submission.csv
tail -n +2 chaii_ta_submission.csv >> /kaggle/working/submission.csv

In [None]:
!wc -l /kaggle/working/eval_dir/chaii_hi_submission.csv
!wc -l /kaggle/working/eval_dir/chaii_ta_submission.csv
!wc -l /kaggle/working/submission.csv

In [None]:
def write_dev_answer_pair_matches(dev_answer_pair_matches, lang_code):
    #Matches in predictions
    correct_ans = [d for d in dev_answer_pair_matches if d['mbert_pred']==d['gold_answer'][0]['text']]
    with open(f'/kaggle/working/eval_dir/correct_chaii_{lang_code}_mbert.txt','w',encoding='utf-8') as f:
      for c in correct_ans:
        f.write(f"id:{c['id']}\n")
        f.write(f"context:{c['context']}\n")
        f.write(f"question:{c['question']}\n")
        f.write(f"gold_answer:{c['gold_answer'][0]['text']}\n")
        f.write(f"mbert_pred:{c['mbert_pred']}\n")
        f.write("\n\n")
        
    #Mismatches in predictions
    wrong_ans = [d for d in dev_answer_pair_matches if d['mbert_pred']!=d['gold_answer'][0]['text']]
    with open(f'/kaggle/working/eval_dir/wrong_chaii_{lang_code}_mbert.txt','w',encoding='utf-8') as f:
      for c in wrong_ans:
        f.write(f"id:{c['id']}\n")
        f.write(f"context:{c['context']}\n")
        f.write(f"question:{c['question']}\n")
        f.write(f"gold_answer:{c['gold_answer'][0]['text']}\n")
        f.write(f"mbert_pred:{c['mbert_pred']}\n")
        f.write("\n\n")
    
    return correct_ans, wrong_ans
        
correct_ans, wrong_ans = write_dev_answer_pair_matches(dev_answer_pair_matches_hi, "hi")

In [None]:
len(correct_ans),len(wrong_ans)