<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Introduction</p>

![](https://assets.thehansindia.com/h-upload/2020/11/25/1600x960_1014218-indian-languages.jpg)

With nearly 1.4 billion people, India is the second-most populated country in the world. Yet Indian languages, like Hindi and Tamil, are underrepresented on the web. Popular Natural Language Understanding (NLU) models perform worse with Indian languages compared to English, the effects of which lead to subpar experiences in downstream web applications for Indian users. With more attention from the Kaggle community and your novel machine learning solutions, we can help Indian users make the most of the web.

**In this notebook, I aim to build a pytorch baseline for this competition. I have used to pytorch + custom QA model to train the model. I will update this notebook for better in coming days...**

<p p style = "font-family: garamond; font-size:20px; font-style: normal;background-color: #f6f5f6; color :#E94421; border-radius: 10px 10px; text-align:center">your upvotes and comments motivates me to do more..</p>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Import everything required</p>

In [None]:
import torch
import numpy as np
import pandas as pd
import torch.nn as nn
import seaborn as sns
from tqdm import tqdm
from collections import Counter
import matplotlib.pyplot as plt
from torch.optim import Adam,AdamW
from torch.utils.data import SequentialSampler
from torch.utils.data import Dataset,DataLoader
from sklearn.model_selection import StratifiedKFold
from transformers import XLMRobertaTokenizer,XLMRobertaModel,AutoTokenizer,XLMRobertaModel,XLMRobertaConfig,AutoModel,AutoConfig
from transformers import logging
from transformers import (
    get_cosine_schedule_with_warmup, 
    get_cosine_with_hard_restarts_schedule_with_warmup
)
import wandb
import random
import os
logging.set_verbosity_error()

In [None]:
import sys
sys.path.append("../input/torchcontrib/contrib-master")
import torchcontrib

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Configurations</p>


In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_1 = user_secrets.get_secret("WANDB_KEY")
os.environ["WANDB_API_KEY"] = secret_value_1


def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONASSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

In [None]:
model_type='base'
config = {
    'base_model':f"../input/xlm-roberta-squad2/deepset/xlm-roberta-{model_type}-squad2",
    'batch_size':16,
    "epochs":3,
    'folds':7,
    'device':torch.device('cuda'),
    'num_reinit_layers':1,
    'dropout':0.1,
    'seed':42,
    'lr':5e-5,
    'eval_every':1 ##changed later
}

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Dataset</p>

Since BERT can only handle maximum token length on 512 and most the context in our dataset exceeds this limit we will process the samples in a way to handle this token length limit of BERT.

To do that, we will use the `tokenizers` library, for each sample in the dataset we will convert them into more than once example.

```python
tokenizer(
           self.df['context'].values.tolist(),
           self.df['question'].values.tolist(),
           truncation="only_first",
           max_length=self.max_len,
          stride=self.doc_stride,
          return_overflowing_tokens=True,
          return_offsets_mapping=True,
          padding="max_length")
```

We use context and question to form each example,the context is `truncated` whenver it exceeds the max token length limit. It also adds a `stride` of 128 tokens, this helps in creating better context for each example.
The `overflow_tokens` is used to map these output examples to input dataset indices.

For each sample from this list,
if the `answer_text` is fully present in `context` the offset start and end is returned
else it is considered as no answer in the context



In [None]:
class ChaiiDataset(Dataset):
    
    def __init__(self,df,max_len=400,doc_stride=128):
        
        self.df = df
        self.max_len = max_len 
        self.doc_stride = doc_stride
        self.labelled = 'answer_text' in df
        self.tokenizer = AutoTokenizer.from_pretrained(config['base_model'],add_special_tokens=True)        
        self.tokenized_samples = self.tokenizer(
                                self.df['context'].values.tolist(),
                                self.df['question'].values.tolist(),
                                truncation="only_first",
                                max_length=self.max_len,
                                stride=self.doc_stride,
                                return_overflowing_tokens=True,
                                return_offsets_mapping=True,
                                padding="max_length")
        
    
        
    def __getitem__(self,idx):
        
        data = {}
        ids,mask,offset = self.tokenized_samples['input_ids'][idx],\
                        self.tokenized_samples['attention_mask'][idx],\
                        self.tokenized_samples['offset_mapping'][idx]
        
        data['index'] = idx
        data['ids'] = torch.tensor(ids)
        data['mask'] = torch.tensor(mask)
        data['offset'] = offset
        if self.labelled:
            
            answer_text,start,end = self.get_targets(idx)
            data['answer_text'] = answer_text
            data['start'] = torch.tensor(start)
            data['end'] = torch.tensor(end)
            
        
        return data
        
    
    def get_targets(self,idx):
        
        df_index = self.tokenized_samples['overflow_to_sample_mapping'][idx]
        start_char = (self.df.iloc[df_index]['answer_start'])
        end_char = start_char + len(self.df.iloc[df_index]['answer_text'])
        offset = self.tokenized_samples['offset_mapping'][idx]
        sequence_ids = self.tokenized_samples.sequence_ids(idx)
        end_offset = len(self.tokenized_samples['input_ids'][idx])-1
        start_offset = 1
        while sequence_ids[end_offset] != 0:
            end_offset -= 1
            
            
        start_idx = 0;end_idx=0
        ## answer not in context
        if (start_char > offset[end_offset][0] or end_char < offset[start_offset][0]):
            #print("In first loop")
            start_idx = 0;end_idx=0
            answer_text=""
            
        ## answer partially in context
        elif ((start_char <= offset[end_offset][0]) and (end_char >  offset[end_offset][0])):
            #print("in second loop")
            start_idx = 0;end_idx=0
            answer_text = ""
        
        ## answer fully inside context
        else:
            #print("In third loop")
            i=0
            while (start_idx < len(offset) and offset[i][0]<=start_char and offset[i][1]<start_char):
                start_idx+=1
                i+=1
            end_idx = i
            while (end_idx < len(offset) and offset[i][1]<end_char):
                end_idx+=1
                i+=1
            answer_text = self.df.iloc[df_index]['answer_text'].strip()
            
        
        return answer_text,start_idx, end_idx 
    
    
    def post_process(self,batch,pred_start,pred_end):
        batch_pred,indices = [],[]
        for idx,start,end in zip(batch['index'],pred_start,pred_end):
            a,b = self.tokenized_samples['offset_mapping'][idx][start][0],self.tokenized_samples['offset_mapping'][idx][end][1]
            df_index = self.tokenized_samples['overflow_to_sample_mapping'][idx]

            if a>b:
                batch_pred.append("")
                indices.append(df_index)
            else: 
                pred_string = self.df.iloc[df_index]['context'][a:b].strip()   
                batch_pred.append(pred_string.strip())
                indices.append(df_index)

        return batch_pred,indices

    
    
    
    def __len__(self):
        return len(self.tokenized_samples['overflow_to_sample_mapping'])
                
        
                
                
        
            
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Model</p>
I will use the roberta last layer hidden state output to model the target

In [None]:
class ChaiiModel(nn.Module):
    
    def __init__(self):
        super(ChaiiModel,self).__init__()
        
        self.model_config = AutoConfig.from_pretrained(config['base_model'])
        self.model_config.return_dict=True
        self.model_config.hidden_dropout_prob = config['dropout']
        self.model_config.attention_probs_dropout_prob = config['dropout']
        self.model_config.output_hidden_states=True
        self.model = AutoModel.from_pretrained(config['base_model'],config=self.model_config)
        self.dropout = nn.Dropout(config['dropout'])
        self.fc = nn.Linear(self.model_config.hidden_size,2)
        self.__init_weights(self.fc)
        
    def __init_weights(self,module):
        if isinstance(module,nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.model_config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        
    def forward(self,input_ids,attention_mask):
        
        output = self.model(input_ids,attention_mask)
        hidden_states = output['hidden_states'][-1]
        #x = torch.stack([hidden_states[-1],hidden_states[-2],hidden_states[-3],hidden_states[-4]])
        #x = torch.mean(x,0)
        x = self.dropout(hidden_states)
        x = self.fc(x)
        start_logits,end_logits = x.split(1,dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)
                
        return start_logits, end_logits
        
        
        
        

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Differential LR</p>
Different learning rate for encoder and head.

In [None]:
def get_params(mode,model,lr):
    
    if mode == 'i':
            return [
                {
                    "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                    "weight_decay": 0.01,'lr':lr,
                },
                {
                    "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                    "weight_decay": 0.0,'lr':lr,
                },
            ]
            
    elif mode =='k':
    
        non_bert = [(n,p) for n,p in model.named_parameters() if 'encoder' not in n]
        no_decay = ['bias']

        return [{'params':[p for n,p in non_bert if any(k not in n for k in no_decay)],
         'lr':lr,'weight_decay':0.01,'name':'non_bert_weights'},
         {'params':[p for n,p in non_bert if any(k in n for k in no_decay)],
         'lr':lr,'weight_decay':0.00,'name':'non_bert_bias'},
         {'params':[p for (n,p) in model.named_parameters() if (n not in [i for i,m in non_bert]) and any(k not in n for k in no_decay)],
         'lr':lr*2.6,'weight_decay':0.01},
         {'params':[p for (n,p) in model.named_parameters() if (n not in [i for i,m in non_bert]) and any(k in n for k in no_decay)],
         'lr':lr*2.6,'weight_decay':0.00}

        ]
    else:
        pass

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Reinitializing pretrained tranformer blocks</p>
The idea is to reinitialize some final layers of the model. The idea is motivated by computer vision transfer learning results where we know that lower pre-trained layers learn more general features while higher layers closer to the output specialize more to the pre-training tasks. 
Existing methods using Transformer show that using the complete network is not always the most effective choice and usually slows down training and hurts performance.

thanks to this great [notebook](https://www.kaggle.com/rhtsingh/on-stability-of-few-sample-transformer-fine-tuning) by @rhtsingh

In [None]:
def reinit_layers(model):
    
    print("Reinitilizing layers")
    for layer in model.model.encoder.layer[-config['num_reinit_layers']:]:

            for module in layer.modules():

                if isinstance(module,nn.Linear):
                    module.weight.data.normal_(mean=0.0,std=model.model_config.initializer_range)
                    if module.bias is not None:
                            module.bias.data.zero_()
                elif isinstance(module, nn.Embedding):
                        module.weight.data.normal_(mean=0.0, std=model.config.initializer_range)
                        if module.padding_idx is not None:
                            module.weight.data[module.padding_idx].zero_()
                elif isinstance(module, nn.LayerNorm):
                        module.bias.data.zero_()
                        module.weight.data.fill_(1.0)
                        
    return model


<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Loss function and score</p>

I have used `crossentropy` as the loss function and `jaccard` score used to track the metric

In [None]:
def safe_div(x,y):
    if y == 0:
        return 1
    return x / y

def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return safe_div(float(len(c)) , (len(a) + len(b) - len(c)))

def get_jaccard_score(y_true,y_pred):
    assert len(y_true)==len(y_pred)
    scores=[]
    for i in range(len(y_true)):
        if len(y_true[i])>0:
            scores.append(jaccard(y_true[i], y_pred[i]))
            
        
    return np.mean(scores) if len(scores)>0 else 0.0

def chaii_loss(start_logits, end_logits, start_positions, end_positions):
    ce_loss = nn.CrossEntropyLoss(reduction='sum')
    start_loss = ce_loss(start_logits, start_positions)
    end_loss = ce_loss(end_logits, end_positions)    
    total_loss = (start_loss + end_loss)/2
    return total_loss


<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Train function</p>

In [None]:
def train_one_batch(dataloaders,data,model,criterion,optimizer,scheduler,phase):
    
                if phase=='train':
                    model.train()

                else:
                    model.eval()
                
                input_ids = data['ids'].cuda()
                masks = data['mask'].cuda()
                start,end = data['start'].cuda(),data['end'].cuda()
                optimizer.zero_grad()
                
                with torch.set_grad_enabled(phase == 'train'):

                    start_logits, end_logits = model(input_ids, masks)
                    loss = criterion(start_logits,end_logits,start,end)
                    
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()
                        if scheduler!=None:
                            scheduler.step()

                    epoch_loss = loss.item()
                    
                    start_logits = torch.softmax(start_logits, dim=1).cpu().detach().numpy()
                    end_logits = torch.softmax(end_logits, dim=1).cpu().detach().numpy()
                    
                    pred_start,pred_end = start_logits.argmax(axis=1),end_logits.argmax(axis=1)
                    prediction_strings,_ = dataloaders[phase].dataset.post_process(data,pred_start, pred_end)
                    
                    epoch_jaccard = get_jaccard_score(data['answer_text'],prediction_strings)
                return epoch_loss,epoch_jaccard
                    
            
    
    

In [None]:
def train_model(model,dataloaders,criterion,optimizer,scheduler=None,epochs=3,filename='saved.pth'):
    
    
    model.cuda()
    best_loss = np.inf
    step = 0
    for epoch in range(epochs):
        
        train_loss = 0.0
        train_score = 0.0
        for i,batch in enumerate(dataloaders['train']):
            loss,score = train_one_batch(dataloaders,batch,model,criterion,optimizer,scheduler,'train')
            train_loss += loss
            train_score += score
            
            if (i>0.75*len(dataloaders['train']) and i%config['swa_freq']==0):
                print(f"taking swa snapshot @{i}")
                optimizer.update_swa()
            
            if (step>0 and step%config['eval_every']==0) or (i == len(dataloaders['train'])-1):
                valid_loss=0.0;valid_score = 0.0
                
                if (i==len(dataloaders['train'])-1):
                    optimizer.swap_swa_sgd()
                        
                for k,batch in enumerate(dataloaders['valid']):
                    loss,score = train_one_batch(dataloaders,batch,model,criterion,optimizer,scheduler,'valid')
                    valid_loss += loss
                    valid_score += score
                
                if (valid_loss/len(dataloaders['valid'].dataset)) < best_loss:
                    torch.save(model.state_dict(),filename)
                    best_loss = valid_loss/len(dataloaders['valid'].dataset)
                    
                if (i==len(dataloaders['train'])-1):
                    optimizer.swap_swa_sgd()

                print('Valid step {} | Loss: {:.4f} | Jaccard: {:.4f}'.format(
                    step, valid_loss/len(dataloaders['valid'].dataset), valid_score/len(dataloaders['valid'])))
            
                wandb.log({'validation loss':valid_loss/len(dataloaders['valid'].dataset)},step=step)
                wandb.log({'validation score':valid_score/len(dataloaders['valid'])},step=step)

            step += 1
            
        print('Train Epoch {}/{} | Loss: {:.4f} | Jaccard: {:.4f}'.format(
                epoch + 1, epochs, train_loss/len(dataloaders['train'].dataset), train_score/len(dataloaders['train'])))
    


<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Train and eval function</p>

In [None]:
def train_and_eval(train,valid,fold):
    
    config['run_name']=f'fold{fold}_dropout_{config["dropout"]}_reinit_{config["num_reinit_layers"]}_lr_{config["lr"]}'
    seed_everything(config['seed'])
    run = wandb.init(reinit=True, project="google-chaii-baseline", config=config)
    wandb.run.name = config['run_name']
    
    with run:

    
        train = ChaiiDataset(train)
        train_loader = DataLoader(train,batch_size=config['batch_size'],shuffle=True)
        config['eval_every'] = int(len(train_loader)//3) ##validate 3 times inside an epoch
        
        valid = ChaiiDataset(valid)
        valid_loader = DataLoader(valid,batch_size=config['batch_size'],shuffle=True)

        model = ChaiiModel()
        if config['num_reinit_layers']:
            model = reinit_layers(model)

        criterion = chaii_loss
        no_decay = ["bias", "LayerNorm.weight"]

        optimizer_grouped_parameters = get_params("k",model,config['lr'])
        optimizer = AdamW(optimizer_grouped_parameters)
        optimizer = torchcontrib.optim.SWA(optimizer)
    
        steps = (len(train)*config['epochs'])//config['batch_size']
        scheduler = get_cosine_schedule_with_warmup(optimizer,num_warmup_steps=int(0*steps),num_training_steps=steps)
        dataloaders = {'train':train_loader,'valid':valid_loader}
        filename = f"{fold}_chaii_model.pth"
        train_model(model,dataloaders,criterion,optimizer,scheduler,config['epochs'],filename)


<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">K Fold training</p>

In [None]:
def run_k_fold(folds=5):
    
    
    
    train = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/train.csv')
    external_mlqa = pd.read_csv('../input/mlqa-hindi-processed/mlqa_hindi.csv').sample(354)
    external_xquad = pd.read_csv('../input/mlqa-hindi-processed/xquad.csv').sample(400)
    external_train = pd.concat([external_mlqa, external_xquad])
    external_train['id'] = list(np.arange(1, len(external_train)+1))
    df = pd.concat([train, external_train]).reset_index(drop=True)
    print(f"Number of samples in train data is {df.shape[0]}")

    
    df["kfold"] = -1

    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
    for f, (t_, v_) in enumerate(kf.split(X=df, y=df.language.values)):
        df.loc[v_, 'kfold'] = f
        
    for fold in range(folds):
        
        
            print(f"Training fold {fold}")

            train = df[df['kfold']!=fold]
            valid = df[df['kfold']==fold]

            train_and_eval(train,valid,fold)

            print("----------------------")
        
        

In [None]:
run_k_fold(config['folds'])

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Inference</p>

After making predictions using each folds, the pred start and end is used to post-process and get back the original answer strings. The `get_best_prediction` function is used after this step to select the best prediction string from predictions of different folds.

In [None]:
test_df = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/test.csv")

In [None]:
def post_process(prediction_strings,indices):
    
    df = pd.DataFrame()
    df['index'] = indices
    df['answer'] = prediction_strings
    
    def best_answer(x):
        x = [k for k in x if len(k)>0]
        if len(x)>0:
            return min(x,key=len)
        return ""
    
    answer = df.groupby(['index'])['answer'].apply(lambda x : best_answer(x))
    
    return answer

    
    
    

def inference_fn(test,fold):
    
    prediction_strings,indices = [],[]
    test = ChaiiDataset(test)
    test_loader = DataLoader(test,batch_size=config['batch_size'],shuffle=False)

    model = ChaiiModel()
    filename = f"../input/chaiibaseline/{fold}_chaii_model.pth"
    
    model.load_state_dict(torch.load(filename))
    model.to(config['device'])
    model.eval()
    
    for i,data in enumerate(test_loader):
        
        input_ids = data['ids'].cuda()
        masks = data['mask'].cuda()
        
        start_logits,end_logits = model(input_ids,masks)
        
        start_logits = torch.softmax(start_logits, dim=1).cpu().detach().numpy()
        end_logits = torch.softmax(end_logits, dim=1).cpu().detach().numpy()
                    
        pred_start,pred_end = start_logits.argmax(axis=1),end_logits.argmax(axis=1)
        preds,ind = test_loader.dataset.post_process(data,pred_start, pred_end)
        prediction_strings.extend(preds);indices.extend(ind)
        
    
    return  post_process(prediction_strings,indices)


##credit goes to https://www.kaggle.com/nbroad/chaii-qa-torch-5-fold-with-post-processing-765
def rule_process(test):
    
    bad_starts = [".", ",", "(", ")", "-", "–",  ",", ";"]
    bad_endings = ["...", "-", "(", ")", "–", ",", ";"]

    tamil_ad = "கி.பி"
    tamil_bc = "கி.மு"
    tamil_km = "கி.மீ"
    hindi_ad = "ई"
    hindi_bc = "ई.पू"

    cleaned_preds = []
    for pred, context in test[["PredictionString", "context"]].to_numpy():
        if pred == "":
            cleaned_preds.append(pred)
            continue
        while any([pred.startswith(y) for y in bad_starts]):
            pred = pred[1:]
        while any([pred.endswith(y) for y in bad_endings]):
            if pred.endswith("..."):
                pred = pred[:-3]
            else:
                pred = pred[:-1]

        if any([pred.endswith(tamil_ad), pred.endswith(tamil_bc), pred.endswith(tamil_km), pred.endswith(hindi_ad), pred.endswith(hindi_bc)]) and pred+"." in context:
            pred = pred+"."

        cleaned_preds.append(pred)
    
    return cleaned_preds
        


def get_best_prediction(df):
    
    all_answers=[]
    for i,row in df.iterrows():
    
        candidates = [k.strip() for k in row.values.tolist() if len(k)<100 and len(k)>0]
        if len(candidates)>0:
            counter = Counter(candidates)
            answer = counter.most_common(1)[0][0]
        else:
            answer = ""

        all_answers.append(answer)
    
    return all_answers

    
    
    
    
    
    
    
    

In [None]:

df = pd.DataFrame()
test_df = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/test.csv")
for fold in tqdm(range(config['folds'])):
    
    preds = inference_fn(test_df,fold)
    test_df['PredictionString'] = preds
    preds = rule_process(test_df)
    df[f'fold_{fold}'] = preds
    
    



<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Submission</p>

In [None]:
answers =  get_best_prediction(df)   
submission = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/sample_submission.csv")
submission['PredictionString'] = answers

submission.to_csv('submission.csv',index=False)

In [None]:
submission.head(5)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f6; color :#E94421; border-radius: 10px 10px; text-align:center">Upvote if you liked it</p>