# **fine-tune the transformer model for our use case**




### In this notebook we will demostrate the process of fine tuneing the pretrained hugging faces transformer model, I hope this notebook will help to understand the use of pretrained transformer model in the context of this compitation.

### In my previous notebook([here](https://www.kaggle.com/vaibhavrmankar/simple-start-eda-submission)) I have done the EDA, if you are new in the competition you might wanna look into the notebook for a the understanding of the problem statement and given data.

### This notebook is build using the example notebook provided by hugging faces.
### Ref :[here](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)



## SETUP

### we need to install the huggingface Datasets.

In [None]:
! pip install datasets transformers

In [None]:
import transformers

print(transformers.__version__)

### Note: if you get the error message after running the following cell, please restart the kernel.


In [None]:
from datasets import load_dataset
from datasets import Dataset

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer
from transformers import default_data_collator
from transformers import BertTokenizer, pipeline

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Which Question answering model to fine-tune ? 

### There are several question-answering models to think of, We can try out different models and compare the results.


In [None]:
model_checkpoint = "mrm8488/bert-tiny-5-finetuned-squadv2"
batch_size = 4

max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.


## load the data

In [None]:
df = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/train.csv').sample(frac = 1)
df

### As we know that there are a lot of data points of Hindi language and fewer data points for Tamil, here we will be using equal no. of data points for both the languages.


In [None]:
df_hindi = df[df['language'] == 'hindi'].head(368)
df_hindi

In [None]:
df_tamil = df[df['language'] == 'tamil']
df_tamil

In [None]:
df = pd.concat([df_hindi,df_tamil])
df

### In order to use the given data set we need to convert the pandas data frame into hugging faces dataset object, which is done by the convert_to_dataset function.
* loop through the dataset 
* represent the datapoint in the required format ( data dict )
* convert the data dict to hugging faces dataset object


In [None]:

def convert_to_dataset(DF): 
    data = {'answers':[],'context':[],'id':[],'question':[],'title':[]}

    for i in range(len(DF)):

        row = DF.iloc[i]
        data['answers'].append({'answer_start': [row['answer_start']], 'text': [row['answer_text']]})
        data['context'].append(row['context'])
        data['id'].append(row['id'])
        data['question'].append(row['question'])
        data['title'].append('NA')


    dataset = Dataset.from_dict(data)

    return dataset


### Split the data into train and test.

In [None]:
train, test = train_test_split(df, test_size=0.1)

train_dataset = convert_to_dataset(train)
test_dataset = convert_to_dataset(test)

## Preprocess data

### data preprocessing is the important step which includes the use of a tokenizer from the pre-trained model.

### To do this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which ensures:

* we get a tokenizer that corresponds to the model architecture we want to use,
* we download the vocabulary used when pretraining this specific checkpoint.
 


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
pad_on_right = tokenizer.padding_side == "right"

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:

train_tokenized_dataset = train_dataset.map(prepare_train_features, batched=True, remove_columns=train_dataset.column_names)

In [None]:

test_tokenized_dataset = test_dataset.map(prepare_train_features, batched=True, remove_columns=test_dataset.column_names)

### Now we have prepared our data for the given task. We can start fine-tuning the model.


## Fine-tuning the model

### We first load the model using AutoModelForQuestionAnswering.from_pretrained function and then we fine-tune the model on our data.


In [None]:

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [None]:
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1, 
    weight_decay=0.01,
)

### We use the default data collector to batch our processed examples together.


In [None]:
data_collator = default_data_collator

### We use the trainer for model training.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_tokenized_dataset,
    eval_dataset=train_tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

### Once we get to good enough accuracy we save the model for future use 

In [None]:
trainer.save_model("test-squad-trained")

### We need to chnage the file structure, for esly loading the model and tokenizer.

In [None]:
!mkdir tokenizer 
!mkdir model

In [None]:
import shutil  
shutil.move('./test-squad-trained/config.json','./model/config.json')
shutil.move('./test-squad-trained/pytorch_model.bin','./model/pytorch_model.bin')

In [None]:
import os 
os.rename('./test-squad-trained','./tokenizer')

#### Saved model can be used offline for the submission.


## Use fine-tuned model for Submission 


### After we have saved the model we need to use the model for generating the output.

### Load the saved model useing from_pretrained method.

In [None]:
tokenizer = BertTokenizer.from_pretrained("./tokenizer")

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained("./model")

### Setup the question-answering pipeline.

In [None]:
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer,device = 0)

### Iterate through the submission dataset and add the predictions.


In [None]:

data = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/test.csv')
SUB = pd.DataFrame(columns = ['id','PredictionString'])

for id_,C,Q,lan in data[["id","context", "question","language"]].to_numpy():
    
    result = nlp(context=C, question=Q)    
    SUB.loc[len(SUB.index)] = [id_,result['answer']] 
    
SUB

In [None]:
SUB.to_csv('submission.csv', index=False)

## Thank you for reading, Happy to hear any thoughts/suggestions :) 
