## TPU use in this notebook to speed up training time.
![XLA](https://xla.rocks/_xlawpx/wp-content/uploads/2019/02/logo-xla-2019.svg)
Transformers to finetune

PyTorch/XLA is a Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs. You can try it right now, for free, on a single Cloud TPU with Google Colab, and use it in production and on Cloud TPU Pods with Google Cloud.
Nice to use it in this notebook!


* Note:
     Install XLA must the first shell you run to success install it.
I cannot train model on Kaggle because memory and RAM requirement. So i trainning it on colab. You can visit this [link](https://colab.research.google.com/drive/16zHlALeStz-2vWBgE5Zol4wguPCz2Avz?usp=sharing) for details: 

Please give me a upvote if it helpfully!!!!!!!!!!!!


I got top 1% accuracy with ensemble of BERT+ROBERTA+ELECTRA. 

In [None]:
VERSION = "nightly"  #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION

In [None]:
!pip install transformers==3.0.0

## Data Processing
Preprocessing: Not in 2020
* Just split in 2 dataset. 


In [None]:
from sklearn.metrics import  accuracy_score
from sklearn.model_selection import train_test_split

train = pd.read_csv('data/train.csv')
train = train.fillna('')
train['text'] = train['keyword'] + ' ' + train['location'] + ' '+ train['text']
train = train[['text','target']]
df_train, df_valid = train_test_split(train, test_size=0.2, random_state=42)
df_train.to_csv('train.csv')
df_valid.to_csv('valid.csv')

## Model config 

In [None]:
from transformers import BertConfig, RobertaConfig, DistilBertConfig, BertModel, RobertaModel, DistilBertModel
from torch import nn
import torch
class BertStyleModel(torch.nn.Module):
    
    def __init__(self, model_type):
        super().__init__()
        
        self.model_type = model_type
        if (model_type == 'roberta'):
            config_path = 'roberta-base'
            model_path = 'roberta-base'
            config = RobertaConfig.from_pretrained(config_path)
            config.output_hidden_states = True
            self.bert = RobertaModel.from_pretrained(model_path, config=config)
        elif (model_type == 'distilbert'):
            config_path = 'distilbert-base-uncased'
            config = DistilBertConfig.from_json_file(config_path)
            model_path = 'distilbert-base-uncased'
            config.output_hidden_states = True
            self.bert = DistilBertModel.from_pretrained(model_path, config=config)
        elif (model_type == 'bert'):
            config_path = 'bert-base-uncased'
            config = BertConfig.from_pretrained(config_path)
            config.output_hidden_states = True
            model_path = 'bert-base-uncased'
            self.bert = BertModel.from_pretrained(model_path, config=config)
            
        self.dropout = nn.Dropout(p=0.2)
        self.high_dropout = nn.Dropout(p=0.5)   
        self.cls_token_head = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(768 * 4, 768),
            nn.ReLU(inplace=True),
        )
        self.classifier = nn.Linear(768, 1)
        
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None):
        
        if (self.model_type == 'roberta'):
            outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            hidden_layers = outputs[2]
        elif (self.model_type == 'distilbert'):
            outputs = self.bert(input_ids, attention_mask=attention_mask)
            hidden_layers = outputs[1]
        elif (self.model_type == 'bert'):     
            outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            hidden_layers = outputs[2]
        
        hidden_states_cls_embeddings = [x[:, 0] for x in hidden_layers[-4:]]
        x = torch.cat(hidden_states_cls_embeddings, dim=-1)
        cls_output = self.cls_token_head(x)
        logits = torch.mean(torch.stack([
            #Multi Sample Dropout takes place here
            self.classifier(self.high_dropout(cls_output))
            for _ in range(5)
        ], dim=0), dim=0)
        outputs = logits
        return outputs

## Training

In [None]:
from torch import nn
import torch
import random
import numpy as np
import pandas as pd
from transformers import T5Tokenizer, T5Model
from tqdm import tqdm
import matplotlib.pyplot as plt
class AverageMeter(object):
    """Computes and stores the average and current value"""

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def loss_fn(ypred, label):
    return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))
def seed_all(seed):
    random.seed(seed)
    torch.manual_seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

In [None]:

import os
import torch
import pandas as pd
from scipy import stats
import numpy as np

from tqdm import tqdm
from collections import OrderedDict, namedtuple
import torch.nn as nn
from torch.optim import lr_scheduler
import joblib

import logging
import transformers
from transformers import AdamW, get_linear_schedule_with_warmup, get_constant_schedule
import sys
from sklearn import metrics, model_selection

import warnings
import torch_xla
import torch_xla.debug.metrics as met
import torch_xla.distributed.data_parallel as dp
import torch_xla.distributed.parallel_loader as pl
import torch_xla.utils.utils as xu
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.test.test_utils as test_utils
import warnings

warnings.filterwarnings("ignore")

class BERTDatasetTraining:
    def __init__(self, comment_text, targets, tokenizer, max_length):
        self.comment_text = comment_text
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.targets = targets

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, item):
        comment_text = str(self.comment_text[item])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_length,
        )
        ids = inputs["input_ids"]
        token_type_ids = inputs["token_type_ids"]
        mask = inputs["attention_mask"]
        
        padding_length = self.max_length - len(ids)
        
        ids = ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
        
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[item], dtype=torch.float)
        }

## BERT

In [None]:
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.distributed.parallel_loader as pl
from sklearn import metrics
import torch
from  transformers import AdamW
from transformers import  get_linear_schedule_with_warmup
import pandas as pd
from sklearn.model_selection import train_test_split
import transformers
mx = BertStyleModel('bert')
def _run():
    def loss_fn(outputs, targets):
        return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))

    def train_loop_fn(data_loader, model, optimizer, device, scheduler=None):
        model.train()
        for bi, d in enumerate(data_loader):
            ids = d["ids"]
            mask = d["mask"]
            token_type_ids = d["token_type_ids"]
            targets = d["targets"]

            ids = ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            targets = targets.to(device, dtype=torch.float)

            optimizer.zero_grad()
            outputs = model(
                input_ids=ids,
                attention_mask=mask,
                token_type_ids=token_type_ids
            )

            loss = loss_fn(outputs, targets)
            if bi % 10 == 0:
                xm.master_print(f'bi={bi}, loss={loss}')

            loss.backward()
            xm.optimizer_step(optimizer)
            if scheduler is not None:
                scheduler.step()

    def eval_loop_fn(data_loader, model, device):
        model.eval()
        fin_targets = []
        fin_outputs = []
        for bi, d in enumerate(data_loader):
            ids = d["ids"]
            mask = d["mask"]
            token_type_ids = d["token_type_ids"]
            targets = d["targets"]

            ids = ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            targets = targets.to(device, dtype=torch.float)

            outputs = model(
                input_ids=ids,
                attention_mask=mask,
                token_type_ids=token_type_ids
            )

            targets_np = targets.cpu().detach().numpy().tolist()
            outputs_np = outputs.cpu().detach().numpy().tolist()
            fin_targets.extend(targets_np)
            fin_outputs.extend(outputs_np)    

        return fin_outputs, fin_targets

    df_train = pd.read_csv('train.csv')
    df_valid = pd.read_csv('valid.csv')
    MAX_LEN = 192
    TRAIN_BATCH_SIZE = 4
    EPOCHS = 3

    tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

    train_targets = df_train.target.values
    valid_targets = df_valid.target.values

    train_dataset = BERTDatasetTraining(
        comment_text=df_train.text.values,
        targets=train_targets,
        tokenizer=tokenizer,
        max_length=MAX_LEN
    )

    train_sampler = torch.utils.data.distributed.DistributedSampler(
          train_dataset,
          num_replicas=xm.xrt_world_size(),
          rank=xm.get_ordinal(),
          shuffle=True)

    train_data_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=TRAIN_BATCH_SIZE,
        sampler=train_sampler,
        drop_last=True,
        num_workers=1
    )

    valid_dataset = BERTDatasetTraining(
        comment_text=df_valid.text.values,
        targets=valid_targets,
        tokenizer=tokenizer,
        max_length=MAX_LEN
    )

    valid_sampler = torch.utils.data.distributed.DistributedSampler(
          valid_dataset,
          num_replicas=xm.xrt_world_size(),
          rank=xm.get_ordinal(),
          shuffle=False)

    valid_data_loader = torch.utils.data.DataLoader(
        valid_dataset,
        batch_size=16,
        sampler=valid_sampler,
        drop_last=False,
        num_workers=1
    )
    
    device = xm.xla_device()
    model = mx.to(device)
    
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]

    lr = 0.4 * 1e-5 * xm.xrt_world_size()
    num_train_steps = int(len(train_dataset) / TRAIN_BATCH_SIZE / xm.xrt_world_size() * EPOCHS)
    xm.master_print(f'num_train_steps = {num_train_steps}, world_size={xm.xrt_world_size()}')

    optimizer = AdamW(optimizer_grouped_parameters, lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=num_train_steps
    )

    for epoch in range(EPOCHS):
        para_loader = pl.ParallelLoader(train_data_loader, [device])
        train_loop_fn(para_loader.per_device_loader(device), model, optimizer, device, scheduler=scheduler)

        para_loader = pl.ParallelLoader(valid_data_loader, [device])
        o, t = eval_loop_fn(para_loader.per_device_loader(device), model, device)
        xm.save(model.state_dict(), "bert.bin")
        np.save('bert', np.array(o))
        auc = metrics.roc_auc_score(np.array(t) >= 0.5, o)
        xm.master_print(f'AUC = {auc}')

In [None]:
# Start training processes
def _mp_fn(rank, flags):
    torch.set_default_tensor_type('torch.FloatTensor')
    a = _run()

FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

In [None]:
class BERTDatasetTest:
    def __init__(self, comment_text, tokenizer, max_length):
        self.comment_text = comment_text
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, item):
        comment_text = str(self.comment_text[item])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_length,
        )
        ids = inputs["input_ids"]
        token_type_ids = inputs["token_type_ids"]
        mask = inputs["attention_mask"]
        
        padding_length = self.max_length - len(ids)
        
        ids = ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
        
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long)
           
        }
def test_loop_fn(data_loader, model, device):
        model.eval()
        fin_targets = []
        fin_outputs = []
        for bi, d in enumerate(data_loader):
            ids = d["ids"]
            mask = d["mask"]
            token_type_ids = d["token_type_ids"]
            

            ids = ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)

            outputs = model(
                input_ids=ids,
                attention_mask=mask,
                token_type_ids=token_type_ids
            )

          
            outputs_np = outputs.cpu().detach().numpy().tolist()
            fin_outputs.extend(outputs_np)    

        return fin_outputs
def test_model():
  MAX_LEN = 192
  device = xm.xla_device()
  model = BertStyleModel('bert')
  model.load_state_dict(torch.load("bert.bin"))
  model.to(device)
  df_test = pd.read_csv('data/test.csv')
  df_test = df_test.fillna('')
  df_test['text'] = df_test['keyword'] + ' ' + df_test['location'] + ' '+ df_test['text']
  df_test = df_test[['text']]
  tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

  test_dataset = BERTDatasetTest(
        comment_text=df_test.text.values,
        tokenizer=tokenizer,
        max_length=MAX_LEN
    )

  test_sampler = torch.utils.data.distributed.DistributedSampler(
          test_dataset,
          num_replicas=xm.xrt_world_size(),
          rank=xm.get_ordinal(),
          shuffle=False)

  test_loader = torch.utils.data.DataLoader(
        test_dataset,
        batch_size=16,
        sampler=test_sampler,
        drop_last=False,
        num_workers=1
    )
  para_loader = pl.ParallelLoader(test_loader, [device])
  ypred = test_loop_fn(para_loader.per_device_loader(device), model, device)
  ypred= np.array(ypred)
  np.save('result.npy', ypred)
  return ypred



In [None]:
# Start training processes
def _mp_fn(rank, flags):
    torch.set_default_tensor_type('torch.FloatTensor')
    a = test_model()

FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=1, start_method='fork')

In [None]:
submit = pd.read_csv('data/sample_submission.csv')
ypred = np.load('result.npy')
submit['target'] = ypred
submit['target'].hist()
plt.show()

In [None]:
pred = [int(i> 0.5) for i in ypred] 
plt.hist(pred)
plt.show()
submit['target'] = pred
submit.to_csv('submit.csv', index=False)
### get 81.918 acccuracy

## ROBERTA

In [None]:
class BERTDatasetTraining:
    def __init__(self, comment_text, targets, tokenizer, max_length):
        self.comment_text = comment_text
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.targets = targets

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, item):
        comment_text = str(self.comment_text[item])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_length,
        )
        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]
        padding_length = self.max_length - len(ids)
        
        ids = ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.targets[item], dtype=torch.float)
        }

In [None]:
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.distributed.parallel_loader as pl
from sklearn import metrics
import torch
from  transformers import AdamW
from transformers import  get_linear_schedule_with_warmup
import pandas as pd
from sklearn.model_selection import train_test_split
import transformers

mx = BertStyleModel('roberta')
def _run():
    def loss_fn(outputs, targets):
        return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))

    def train_loop_fn(data_loader, model, optimizer, device, scheduler=None):
        model.train()
        for bi, d in enumerate(data_loader):
            ids = d["ids"]
            mask = d["mask"]
            targets = d["targets"]

            ids = ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets = targets.to(device, dtype=torch.float)

            optimizer.zero_grad()
            outputs = model(
                input_ids=ids,
                attention_mask=mask,
            )

            loss = loss_fn(outputs, targets)
            if bi % 10 == 0:
                xm.master_print(f'bi={bi}, loss={loss}')

            loss.backward()
            xm.optimizer_step(optimizer)
            if scheduler is not None:
                scheduler.step()

    def eval_loop_fn(data_loader, model, device):
        model.eval()
        fin_targets = []
        fin_outputs = []
        for bi, d in enumerate(data_loader):
            ids = d["ids"]
            mask = d["mask"]
            targets = d["targets"]

            ids = ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets = targets.to(device, dtype=torch.float)

            outputs = model(
                input_ids=ids,
                attention_mask=mask,
            )

            targets_np = targets.cpu().detach().numpy().tolist()
            outputs_np = outputs.cpu().detach().numpy().tolist()
            fin_targets.extend(targets_np)
            fin_outputs.extend(outputs_np)    

        return fin_outputs, fin_targets

    df_train = pd.read_csv('train.csv')
    df_valid = pd.read_csv('valid.csv')
    MAX_LEN = 192
    TRAIN_BATCH_SIZE = 4
    EPOCHS = 3

    tokenizer = transformers.RobertaTokenizer.from_pretrained("roberta-base", do_lower_case=True)

    train_targets = df_train.target.values
    valid_targets = df_valid.target.values

    train_dataset = BERTDatasetTraining(
        comment_text=df_train.text.values,
        targets=train_targets,
        tokenizer=tokenizer,
        max_length=MAX_LEN
    )

    train_sampler = torch.utils.data.distributed.DistributedSampler(
          train_dataset,
          num_replicas=xm.xrt_world_size(),
          rank=xm.get_ordinal(),
          shuffle=True)

    train_data_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=TRAIN_BATCH_SIZE,
        sampler=train_sampler,
        drop_last=True,
        num_workers=1
    )

    valid_dataset = BERTDatasetTraining(
        comment_text=df_valid.text.values,
        targets=valid_targets,
        tokenizer=tokenizer,
        max_length=MAX_LEN
    )

    valid_sampler = torch.utils.data.distributed.DistributedSampler(
          valid_dataset,
          num_replicas=xm.xrt_world_size(),
          rank=xm.get_ordinal(),
          shuffle=False)

    valid_data_loader = torch.utils.data.DataLoader(
        valid_dataset,
        batch_size=16,
        sampler=valid_sampler,
        drop_last=False,
        num_workers=1
    )
    
    device = xm.xla_device()
    model = mx.to(device)

    print("%s: traning"%device)
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]

    lr = 0.4 * 1e-5 * xm.xrt_world_size()
    num_train_steps = int(len(train_dataset) / TRAIN_BATCH_SIZE / xm.xrt_world_size() * EPOCHS)
    xm.master_print(f'num_train_steps = {num_train_steps}, world_size={xm.xrt_world_size()}')

    optimizer = AdamW(optimizer_grouped_parameters, lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=num_train_steps
    )

    for epoch in range(EPOCHS):
        para_loader = pl.ParallelLoader(train_data_loader, [device])
        train_loop_fn(para_loader.per_device_loader(device), model, optimizer, device, scheduler=scheduler)

        para_loader = pl.ParallelLoader(valid_data_loader, [device])
        o, t = eval_loop_fn(para_loader.per_device_loader(device), model, device)
        xm.save(model.state_dict(), "roberta.bin")
        np.save('roberta', np.array(t))
        auc = metrics.roc_auc_score(np.array(t) >= 0.5, o)
        xm.master_print(f'AUC = {auc}')

In [None]:
# Start training processes
def _mp_fn(rank, flags):
    torch.set_default_tensor_type('torch.FloatTensor')
    a = _run()

FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

In [None]:
class BERTDatasetTest:
    def __init__(self, comment_text, tokenizer, max_length):
        self.comment_text = comment_text
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, item):
        comment_text = str(self.comment_text[item])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_length,
        )
        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]
        
        padding_length = self.max_length - len(ids)
        
        ids = ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),           
        }
def test_loop_fn(data_loader, model, device):
        model.eval()
        fin_targets = []
        fin_outputs = []
        for bi, d in enumerate(data_loader):
            ids = d["ids"]
            mask = d["mask"]
            

            ids = ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)

            outputs = model(
                input_ids=ids,
                attention_mask=mask,
            )

          
            outputs_np = outputs.cpu().detach().numpy().tolist()
            fin_outputs.extend(outputs_np)    

        return fin_outputs
def test_model():
  MAX_LEN = 192
  device = xm.xla_device()
  model = BertStyleModel('roberta')
  model.load_state_dict(torch.load("roberta.bin"))
  model.to(device)
  df_test = pd.read_csv('data/test.csv')
  df_test = df_test.fillna('')
  df_test['text'] = df_test['keyword'] + ' ' + df_test['location'] + ' '+ df_test['text']
  df_test = df_test[['text']]
  tokenizer = transformers.RobertaTokenizer.from_pretrained("roberta-base", do_lower_case=True)

  test_dataset = BERTDatasetTest(
        comment_text=df_test.text.values,
        tokenizer=tokenizer,
        max_length=MAX_LEN
    )

  test_sampler = torch.utils.data.distributed.DistributedSampler(
          test_dataset,
          num_replicas=xm.xrt_world_size(),
          rank=xm.get_ordinal(),
          shuffle=False)

  test_loader = torch.utils.data.DataLoader(
        test_dataset,
        batch_size=16,
        sampler=test_sampler,
        drop_last=False,
        num_workers=1
    )
  para_loader = pl.ParallelLoader(test_loader, [device])
  ypred = test_loop_fn(para_loader.per_device_loader(device), model, device)
  ypred= np.array(ypred)
  np.save('result2.npy', ypred)
  return ypred


In [None]:
# Start test processes XLA
def _mp_fn(rank, flags):
    torch.set_default_tensor_type('torch.FloatTensor')
    a = test_model()

FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=1, start_method='fork')

In [None]:
submit = pd.read_csv('data/sample_submission.csv')
ypred = np.load('result2.npy')
submit['target'] = ypred
plt.hist(ypred)
plt.show()
pred = [int(i> 0.5) for i in ypred] 
plt.hist(pred)
plt.show()
submit['target'] = pred
submit.to_csv('submit2.csv', index=False)
### get 0.83389 acccuracy

## Postprocessing

In [None]:
Try to concat 2 model.
I really think: ROBERTA is better for this task. So try to multiply Roberta result with greatter 1 number. Get result better. 
    Summary: roberta*1.3 and get top 15%

In [None]:
bert = np.load('result.npy')
roberta= np.load('result2.npy')

In [None]:
bert_ = np.array([int(b>0.5) for b in bert])
roberta_ = np.array([int(b>0.5) for b in roberta])


In [None]:
bert_.sum() , roberta_.sum()

In [None]:
roberta2 = roberta*1.3 #  Best today is 0.83512
roberta_ = np.array([int(b>0.5) for b in roberta2])
print(roberta_.sum())

In [None]:
submit = pd.read_csv('sample_submission.csv')
ypred = roberta2
submit['target'] = ypred
plt.hist(ypred)
plt.show()
pred = [int(i> 0.5) for i in ypred] 
plt.hist(pred)
plt.show()
submit['target'] = pred
submit.to_csv('submit2.csv', index=False)

In [None]:
##Submit the file
import pandas as pd
submit = pd.read_csv('/kaggle/input/nlpstarted/submit2 (1).csv')
submit.to_csv('submission.csv', index=False)