# **About this notebook** 
## Using the method of accurate initialization※1 as a base,
## BERT and RoBERTa were compared with various random seeds in my model※2.

※ref1) How to initialize the code correctly (English&日本語)　(I would be grateful if you could also upvote this.) :
https://www.kaggle.com/chumajin/how-to-initialize-the-code-correctly-english

※ref2) Pytorch BERT Biginner's room : 
https://www.kaggle.com/chumajin/pytorch-bert-beginner-s-room.

### **Thank you for visiting this page. I'm looking forward to helping you even a little. I'm glad if you upvote!**
### **Also, thank you for those who always upvote.**

---------------日本語--------------

## **正確に初期化する方法※1をベースに用いて、**
## **ランダムシードを振ってBERTとRoBERTa を私のモデル※2で比較しました。**
 
※1) 正確に初期化する方法(こちらもupvoteしてもらえると嬉しいです。)
How to initialize data accurately(English&日本語) https://www.kaggle.com/chumajin/how-to-initialize-the-code-correctly-english
 
※2) 私のモデル Pytorch BERT Biginner's room https://www.kaggle.com/chumajin/pytorch-bert-beginner-s-room-version　
 
### **見て頂いてありがとうございます。少しでもお役に立てたら幸いです。upvote/follow頂けたら嬉しいです！**
### **過去にupvoteしてくれた方、ありがとうございます!**

#### If you want to see only the result, please click the link below and jump from here.　Chapters 0 to 10 are training codes as reference.(No training is performed in this version.)

結果だけ見たい方はここからジャンプしてください。チャプター0～10は訓練のコードを参考までに書いたものです。(このversionでは訓練はしません。)

## [Jump to 11. Comparison results of BERT and RoBERTa](#section-one)

# 0. Preparation

In [None]:
import numpy as np 
import pandas as pd 
import os
       
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
import matplotlib.pyplot as plt 

import transformers
import random

from transformers import AdamW
from transformers import get_linear_schedule_with_warmup


import warnings
warnings.simplefilter('ignore')

scaler = torch.cuda.amp.GradScaler() # GPUでの高速化。

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # cpuがgpuかを自動判断
device

In [None]:
def random_seed(SEED):
    
    random.seed(SEED)
    os.environ['PYTHONHASHSEED'] = str(SEED)
    np.random.seed(SEED)
    torch.manual_seed(SEED)
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True


# 1. CFG

In [None]:
class CFG():
    
    epochs = 20
    
    train_batch = 16
    valid_batch = 32
    
    kfold = 5
    
    LR = 2e-5
    
    num_steps = 0.1
    
    endepoch = 10
    
    sentence_len = 314   
   
    

In [None]:
CFG = CFG()

# 2. Sample preparation

In [None]:
train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
train.head(3)

In [None]:
train.excerpt.iloc[0]

In [None]:
test = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
test.head(3)

In [None]:
sample = pd.read_csv("../input/commonlitreadabilityprize/sample_submission.csv")
sample

## 2.1 K-fold

In [None]:
train = train.sort_values("target").reset_index(drop=True)
train

In [None]:
train["kfold"] = train.index % 5

In [None]:
train

# 3. Tokenizer

## 3.1 BERT

In [None]:
BERT_tokenizer = transformers.BertTokenizer.from_pretrained("../input/bert-base-uncased")

## 3.2 Roberta

In [None]:
Roberta_tokenizer =  transformers.RobertaTokenizer.from_pretrained("../input/roberta-base")

# 4. DataSet

In [None]:
class MyDataSet(Dataset):
    
    def __init__(self,sentences,targets,tokenizer):
        
        self.sentences = sentences
        self.targets = targets
        self.tokenizer = tokenizer
        
    def __len__(self):
        
        return len(self.sentences)
    
    def __getitem__(self,idx):
        
        sentence = self.sentences[idx]
        
       
        bert_sens = self.tokenizer.encode_plus(
                                sentence,
                                add_special_tokens = True, 
                                max_length = CFG.sentence_len, 
                                pad_to_max_length = True, 
                                return_attention_mask = True,
                                truncation=True)
        
        

        ids = torch.tensor(bert_sens['input_ids'], dtype=torch.long)
        mask = torch.tensor(bert_sens['attention_mask'], dtype=torch.long)
        #token_type_ids = torch.tensor(bert_sens['token_type_ids'], dtype=torch.long)
     
            
        target = torch.tensor(self.targets[idx],dtype=torch.float)
        
        return {
                'ids': ids,
                'mask': mask,
                #'token_type_ids': token_type_ids,
                'targets': target
            }

# 5. loss function

In [None]:
def loss_fn(output,target):
    return torch.sqrt(nn.MSELoss()(output,target))

# 6. Training & validation function

In [None]:
def training(
    train_dataloader,
    model,
    optimizer,
    scheduler
):
    
    model.train()
    torch.backends.cudnn.benchmark = True

    allpreds = []
    alltargets = []

    for num,a in enumerate(train_dataloader):

        losses = []

        optimizer.zero_grad()

        with torch.cuda.amp.autocast():

            ids = a["ids"].to(device,non_blocking=True)
            mask = a["mask"].to(device,non_blocking=True)
      #      tokentype = a["token_type_ids"].to(device,non_blocking=True)
    
            output = model(ids,mask)
            output = output["logits"].squeeze(-1)

            target = a["targets"].to(device,non_blocking=True)

            loss = loss_fn(output,target)


            # For scoring
            losses.append(loss.item())

            allpreds.append(output.detach().cpu().numpy())
            alltargets.append(target.detach().squeeze(-1).cpu().numpy())

        scaler.scale(loss).backward() # backwards of loss
        
        
        scaler.step(optimizer) # Update optimizer
        scaler.update() # scaler update
            


        scheduler.step() # Update learning rate schedule
        del loss
        
        # Combine dataloader minutes

    allpreds = np.concatenate(allpreds)
    alltargets = np.concatenate(alltargets)

    # I don't use loss, but I collect it

    losses = np.mean(losses)

    # Score with rmse
    train_rme_loss = np.sqrt(mean_squared_error(alltargets,allpreds))
    
    

    return losses,train_rme_loss

In [None]:
def validating(
    valid_dataloader,
    model
):
    
    model.eval()

    allpreds = []
    alltargets = []

    for a in valid_dataloader:

        losses = []

        with torch.no_grad():

            ids = a["ids"].to(device,non_blocking=True)
            mask = a["mask"].to(device,non_blocking=True)
            #tokentype = a["token_type_ids"].to(device,non_blocking=True)

            output = model(ids,mask)
            output = output["logits"].squeeze(-1)

            target = a["targets"].to(device,non_blocking=True)

            loss = loss_fn(output,target)


            # For scoring
            losses.append(loss.item())
            allpreds.append(output.detach().cpu().numpy())
            alltargets.append(target.detach().squeeze(-1).cpu().numpy())
            
            del loss


    # Combine dataloader minutes

    allpreds = np.concatenate(allpreds)
    alltargets = np.concatenate(alltargets)

    # I don't use loss, but I collect it

    losses = np.mean(losses)

    # Score with rmse
    valid_rme_loss = np.sqrt(mean_squared_error(alltargets,allpreds))

    return allpreds,losses,valid_rme_loss

# 7. initialize function 

## ref : How to initialize the code correctly (English&日本語) 

https://www.kaggle.com/chumajin/how-to-initialize-the-code-correctly-english/edit/run/64377180

In [None]:
def initialize(fold,SEED,tokenizer,mode):
    random_seed(SEED)

    p_train = train[train["kfold"]!=fold].reset_index(drop=True)
    p_valid = train[train["kfold"]==fold].reset_index(drop=True)


    train_dataset = MyDataSet(p_train["excerpt"],p_train["target"],tokenizer)
    valid_dataset = MyDataSet(p_valid["excerpt"],p_valid["target"],tokenizer)

    train_dataloader = DataLoader(train_dataset,batch_size=CFG.train_batch,shuffle = True,num_workers=4,pin_memory=True)
    valid_dataloader = DataLoader(valid_dataset,batch_size=CFG.valid_batch,shuffle = False,num_workers=4,pin_memory=True)

    
    if mode == "BERT":
        model = transformers.BertForSequenceClassification.from_pretrained("../input/bert-base-uncased",num_labels=1)
    else:
        model = transformers.RobertaForSequenceClassification.from_pretrained("../input/roberta-base",num_labels=1)
    model.to(device)

    optimizer = AdamW(model.parameters(), CFG.LR,betas=(0.9, 0.999), weight_decay=1e-2) # AdamW optimizer

    train_steps = int(len(p_train)/CFG.train_batch*CFG.epochs)

    num_steps = int(train_steps*CFG.num_steps)

    scheduler = get_linear_schedule_with_warmup(optimizer, num_steps, train_steps)
    
    scaler = torch.cuda.amp.GradScaler() # GPUでの高速化。
    
    return train_dataloader,valid_dataloader,model,optimizer,scheduler,scaler

# 8. Define evaluating SEED. In this notebook, the random seeds are set to from 100 to 105.
今回は、random seedを100-105で振ります。

In [None]:
SEEDS = [100,101,102,103,104]

# 9. BERT training 
### 【Attention】I've showed the training code for reference, but it takes 15 hours to run and it can't run on kaggle notebook. Therefore, I set the train_exe = False.If you want to use it, please arrange it. (Actually, I spent 3 hours for each fold. Refer to the past version.)　Actual training was done in ver 8(fold 0), ver 9(fold 1), ver 7(fold 2), ver 10(fold 3), ver 4(fold 4). 
 

【注意】参考としてtrainingコードを載せましたが、実際に流すと15時間かかる見積もりで、kaggle notebookでは流せません。そのため、以下のtrain_exe=Falseにしています。使用する場合はアレンジしてご使用ください(実際は私は各foldごとに3時間かけて回しています。過去のversionをご参考ください。)。実際の訓練は、ver 8でfold 0, ver 9でfold 1, ver 7でfold 2, ver 10でfold 3, ver 4でfold 4をしました。

In [None]:
train_exe = False # I strongly recommend setting it to false.

In [None]:
if train_exe:
    
    trainlosses = []
    vallosses = []
    
    trainscores = []
    validscores = []

    BERTresult = []

    for fold in range(CFG.kfold):

        bestscore = 100

        for SEED in SEEDS:

            train_dataloader,valid_dataloader,model,optimizer,scheduler,scaler = initialize(fold,SEED,BERT_tokenizer,"BERT")    

            for epoch in tqdm(range(CFG.epochs)):

                print("-{}{}--{}{}----{}{}----{}----".format("fold:",str(fold),"seed:",str(SEED),"epoch:",str(epoch),"start"))

                trainloss,trainscore = training(train_dataloader,model,optimizer,scheduler)

                trainlosses.append(trainloss)
                trainscores.append(trainscore)

                print("trainscore is " + str(trainscore))

                preds,validloss,valscore=validating(valid_dataloader,model)

                vallosses.append(validloss)
                validscores.append(valscore)


                print("valscore is " + str(valscore))

                if bestscore > valscore:

                    bestscore = valscore

                    print("found better point")

                    state = {
                                    'state_dict': model.state_dict(),
                                    'optimizer_dict': optimizer.state_dict(),
                                    "bestscore":bestscore
                                }


                    torch.save(state, "BERTmodel_fold" + str(fold) + ".pth")

                else:
                    pass

                BERTresult.append([fold,SEED,epoch,trainloss,trainscore,validloss,valscore,bestscore])

                if epoch == CFG.endepoch:
                    break
            
    BERTdf = pd.DataFrame(BERTresult)
    BERTdf.columns=["fold","SEED","epoch","trainloss","trainscore","validloss","valscore","bestscore"]
    BERTdf.to_csv("BERTdf.csv",index=False)

# 10. RoBERTa Training

In [None]:
if train_exe:
   

    trainlosses = []
    vallosses = []
    bestscore = None

    trainscores = []
    validscores = []

    Robertaresult = []

    for fold in range(CFG.kfold):


        bestscore = 100

        for SEED in SEEDS:

            train_dataloader,valid_dataloader,model,optimizer,scheduler,scaler = initialize(fold,SEED,Roberta_tokenizer,"Roberta") 


            for epoch in tqdm(range(CFG.epochs)):

                print("-{}{}--{}{}----{}{}----{}----".format("fold:",str(fold),"seed:",str(SEED),"epoch:",str(epoch),"start"))

                trainloss,trainscore = training(train_dataloader,model,optimizer,scheduler)

                trainlosses.append(trainloss)
                trainscores.append(trainscore)

                print("trainscore is " + str(trainscore))

                preds,validloss,valscore=validating(valid_dataloader,model)

                vallosses.append(validloss)
                validscores.append(valscore)


                print("valscore is " + str(valscore))



                if bestscore > valscore:

                    bestscore = valscore

                    print("found better point")

                    state = {
                                    'state_dict': model.state_dict(),
                                    'optimizer_dict': optimizer.state_dict(),
                                    "bestscore":bestscore
                                }


                    torch.save(state, "Robertamodel_fold" + str(fold) + ".pth")

                else:
                    pass

                Robertaresult.append([fold,SEED,epoch,trainloss,trainscore,validloss,valscore,bestscore])

                if epoch == CFG.endepoch:
                    break

        Robertadf = pd.DataFrame(Robertaresult)
        Robertadf.columns=["fold","SEED","epoch","trainloss","trainscore","validloss","valscore","bestscore"]
        Robertadf.to_csv("Robertadf.csv",index=False)

# --------------------------From here, I will show the all of results. ----------------------------------------------
## Training was done in ver 8(fold 0), ver 9(fold 1), ver 7(fold 2), ver 10(fold 3), ver 4(fold 4). Each takes about 3 hours.
 訓練は、ver 8でfold 0, ver 9でfold 1, ver 7でfold 2, ver 10でfold 3, ver 4でfold 4をしました。それぞれ3時間ずつかかります。

<a id="section-one"></a>
# **11. Comparison results of BERT and RoBERTa**

### Evaluation Condition (評価条件)
* model : BERT or RoBERTa
* fold : 5
* epoch : 10
* metric : valscore(validation score), Public LeadersBoard score(actually, I submit these models.Total 12 submits(Single 5 models with BERT and RoBERTa and 2 models by k-fold.) 

validationスコアと、実際にsubmitしてPublic LeadersBoardのスコアを指標にしました（BERTとRoBERTaの各foldのsingle model 5×2と、それぞれのmodelのk-fold 2つで計12個submitしました。)

* random seed : 101,102,103,104,105
* how to adopt the best model : The model created with randam seed, which had the smallest validation score in each fold. ※　Bestscore means the smallest validation score in each fold.
 
     一番良いモデルは、各foldで最もvalidation scoreが小さくなるrandam seedを採用しました。
     bestscoreの意味は、各foldで最も小さかったvalidation scoreです。


## 11.1 Loading the all results 

In [None]:
BERTres = pd.read_csv("../input/allresbertvsrobert/BERT_allres.csv")
BERTres

In [None]:
BERTres[BERTres["fold"]==0]

In [None]:
RoBERTa_res = pd.read_csv("../input/allresbertvsrobert/Roberta_allres.csv")
RoBERTa_res

------------------------------

## 11.2 Example : Visualizing the validation score of fold 0 every SEED.
事例として、fold 0をrandam seedを振ったvalidation scoreの結果を示します。

In [None]:
BERT_fold0 = BERTres[BERTres["fold"]==0]
Roberta_fold0 = RoBERTa_res[RoBERTa_res["fold"]==0]

In [None]:
BERT_fold0.head(3)

In [None]:
plt.figure(figsize = (20,5))
plt.subplot(1,2,1)
for a in BERT_fold0["SEED"].unique():
    tmpdf = BERT_fold0[BERT_fold0["SEED"]==a]
    plt.scatter(tmpdf["epoch"],tmpdf["valscore"],label = "seed="+str(a))
    plt.plot(tmpdf["epoch"],tmpdf["valscore"],)
plt.title("BERT-fold0",fontsize = 23)
plt.xlabel("epoch",fontsize = 15)
plt.ylabel("validation score",fontsize = 15)

plt.ylim(0.45,0.9)
plt.legend()
plt.grid()


plt.subplot(1,2,2)
for a in Roberta_fold0["SEED"].unique():
    tmpdf = Roberta_fold0[Roberta_fold0["SEED"]==a]
    plt.scatter(tmpdf["epoch"],tmpdf["valscore"],label = "seed="+str(a))
    plt.plot(tmpdf["epoch"],tmpdf["valscore"])

plt.title("RoBERTa-fold0",fontsize = 23)
plt.xlabel("epoch",fontsize = 15)
plt.ylabel("validation score",fontsize = 15)

plt.ylim(0.45,0.9)
plt.legend()
plt.grid()


#### At first glance, RoBERTa may have a smaller validation score.
#### And, with my model, it was confirmed that there was considerable variation every random seed and every epoch.






#### 一見すると、RoBERTaの方がvalidation scoreが小さくて良い。
#### また、私のモデルだと、random seedごと、epochごとにかなりばらつきが大きいことが確認されます。

#### Comparing the mean value.

In [None]:
BERTdf_mean = BERT_fold0.groupby("epoch")["valscore"].mean().reset_index()
BERTdf_mean.columns = ["epoch","BERT_scores_mean"]
Robertadf_mean = Roberta_fold0.groupby("epoch")["valscore"].mean().reset_index()
Robertadf_mean.columns = ["epoch","RoBERTa_scores_mean"]
Resdf_mean = pd.merge(BERTdf_mean ,Robertadf_mean,on="epoch")
Resdf_mean

In [None]:
x = np.arange(Resdf_mean["epoch"].max()+1)
plt.plot(x,Resdf_mean["BERT_scores_mean"],label="BERT_mean")
plt.plot(x,Resdf_mean["RoBERTa_scores_mean"],label="RoBERTa_mean")

plt.ylim(0.45,0.85)
plt.legend()

plt.title("Mean value comparison of BERT and RoBERTa at fold0",fontsize = 15)
plt.xlabel("epoch",fontsize = 15)
plt.ylabel("validation score",fontsize = 15)

plt.grid()


## With this fold, RoBERTa looks better.　
#### ※　Actually, there was the case that is difficult to judge as other folds.


## このfoldだと、RoBERTaのほうが結果が良く見えます。
※　実際は、他のfoldだと判断しづらいものもありました。

-----------------------------------------------

## 11.3 The result of comparing all folds

すべてのfoldでの比較

#### BERT case

In [None]:
BERTres.head(3)

### Extract the bestscore(min validation value) in each SEED and fold.
各foldとseedごとにvalidationの最小値(best score)を抽出します。

In [None]:
tmp = BERTres.groupby(["fold","SEED"])["valscore"].min().reset_index()
tmp.head(12)

In [None]:
import seaborn as sns

sns.boxplot(x="fold",y="valscore",data=tmp)


## You can see that the minimum validation score varies considerably for each random seed.

ランダムシードごとに、最小となるvalidation scoreはけっこうばらついていることがわかります。

-----------------------------------------------------

#### Next, for each fold, output the index that minimizes the validation score.
次に各foldごとにvalidation scoreが最小となるインデックスを出します。

In [None]:
tmp2 = tmp.groupby("fold")["valscore"].idxmin().reset_index()
tmp2

#### By extracting the index part, you can extract the minimum value and which seed was the minimum value in each fold.
インデックスの箇所を抽出すると、各foldでどのseedで最小値になったかとその最小値を抽出できます。

In [None]:
BERT_bestscore = tmp.iloc[tmp2["valscore"],:]
BERT_bestscore.columns = ["fold","BERT_SEED","BERT_val_bestscore"]
BERT_bestscore

#### To explain the meaning, in fold 0, when SEED was 101, the validation score was the lowest at 0.552.

意味を少し解説すると、fold 0では、SEEDが101のとき、0.552と最もvalidation scoreが低くなりました。

------------------------------

#### Next, Roberta's results are processed in the same way and merged.
次にRobertaの結果も同様に処理してマージします。

In [None]:
tmp = RoBERTa_res.groupby(["fold","SEED"])["valscore"].min().reset_index()
sns.boxplot(x="fold",y="valscore",data=tmp)


In [None]:
tmp2 = tmp.groupby("fold")["valscore"].idxmin().reset_index()
RoBERTa_bestscore = tmp.iloc[tmp2["valscore"],:]
RoBERTa_bestscore.columns = ["fold","RoBERTa_SEED","RoBERTa_val_bestscore"]
RoBERTa_bestscore

In [None]:
Mergeres = pd.merge(BERT_bestscore,RoBERTa_bestscore,on="fold")
Mergeres

#### Compare each fold to see which was better, BERT or Roberta.

BERTとRobertaのどちらが良かったかをfoldごとに比較します。

In [None]:
Mergeres["valscore_better_model"] = np.where(Mergeres["BERT_val_bestscore"]<Mergeres["RoBERTa_val_bestscore"],"BERT","RoBERTa")
Mergeres

#### I thought RoBERTa was better for all, but when I compared it with the validation score, I confirmed better results for BERT in fold3.
#### So I then submitted all of these models as a single model and compared the Leaders Board scores.




全部Robertaのほうが良いと思っていたが、validation scoreで比較すると、fold3でbertのほうが良い結果も見られました。


そのため、これらのsingleモデルを全部submitしてPublic Leaders Boardのスコアを比較しました。

---------------------------------------

## 11.4 The result of comparing Public Leaders board score

In [None]:
BERT_bestscore

Enter the submitted result.


submitした結果を入力します。

In [None]:
BERT_publicLB = [0.544,0.554,0.534,0.548,0.572]

In [None]:
BERT_bestscore["BERT_publicLB"] = BERT_publicLB

In [None]:
BERT_bestscore

The results of RoBERTa are the same.

RoBERTaの結果も同様です。

In [None]:
RoBERTa_bestscore

In [None]:
RoBERTa_publicLB = [0.532,0.530,0.511,0.536,0.535]

In [None]:
RoBERTa_bestscore["RoBERTa_publicLB"] = RoBERTa_publicLB
RoBERTa_bestscore

Merge as you would for a validation score comparison.

validation scoreの比較と同じく、マージします。

In [None]:
Mergeres = pd.merge(BERT_bestscore,RoBERTa_bestscore,on="fold")
Mergeres

Judge the comparison result of BERT and RoBERTa not only the validation score but also the Public LB score.

BERTとRoBERTaの比較結果を先ほどのvalidation scoreだけでなく、Public LB scoreもjudgeします。

In [None]:
Mergeres["valscore_better_model"] = np.where(Mergeres["BERT_val_bestscore"]<Mergeres["RoBERTa_val_bestscore"],"BERT","RoBERTa")
Mergeres["publicLB_better_model"] = np.where(Mergeres["BERT_publicLB"]<Mergeres["RoBERTa_publicLB"],"BERT","RoBERTa")
Mergeres

## A comparison of validation scores showed that BERT was good for fold3, 

## but a comparison with public LB confirmed that RoBERTa was good for all.

## validation scoreの比較だと、fold3でBERTが良いという結果が出たが、

## public LBで比較すると、すべてにおいて、RoBERTaが良かったことが確認された。

--------------------------------------

## 11.5 Comparison of my validation score and Public LB score

## validation scoreとPublic LB scoreの比較をします。

In [None]:
plt.figure(figsize=(5,5))

plt.scatter(Mergeres["BERT_val_bestscore"],Mergeres["BERT_publicLB"],label="BERT")


plt.scatter(Mergeres["RoBERTa_val_bestscore"],Mergeres["RoBERTa_publicLB"],label="RoBERTa")

for num,a in enumerate(Mergeres["fold"]):
    plt.annotate("fold " + str(a),xy=(Mergeres["BERT_val_bestscore"].iloc[num]+0.001,Mergeres["BERT_publicLB"].iloc[num]+0.001),c="blue")
    plt.annotate("fold " + str(a),xy=(Mergeres["RoBERTa_val_bestscore"].iloc[num]+0.001,Mergeres["RoBERTa_publicLB"].iloc[num]+0.001),c="orange")
    plt.annotate("y=x",xy=(0.50,0.51),c="black")



x = y = np.arange(0.49,0.56,0.01)
plt.plot(x,y,c="black")

plt.title("Comparison of my validation score and Public LB score")

plt.xlabel("My validation score")
plt.ylabel("Public LB score")

plt.legend()
plt.grid()

### It was confirmed that the result of fold 4 was clearly strange. The BERT result for fold3 also seems to be a little off.
### From this, I found that my k-fold division was a little bad.

## fold 4の結果は明らかにおかしいことが確認された。fold3のBERTの結果も少しずれているように思えます。
## このことから、私のk-foldの分け方は少し良くないことがわかりました。

------------------------------------

## 11.6 Score when submitting the average of 5 folds inference
## 通常の5個のk-foldの平均をsubmitしたときのスコア

In [None]:
Mergeres

In [None]:
BERTres = [Mergeres["BERT_val_bestscore"].mean(),Mergeres["BERT_publicLB"].mean(),0.522]
RoBERTares = [Mergeres["RoBERTa_val_bestscore"].mean(),Mergeres["RoBERTa_publicLB"].mean(),0.505]

In [None]:
foldmeans = pd.DataFrame()
foldmeans["BERTres"] = BERTres
foldmeans["RoBERTares"] = RoBERTares

Indexname = ["Mean validation_score","Mean public LB with single model","5 k-fold public LB result"]
foldmeans["Metric"] = Indexname
foldmeans = foldmeans.set_index("Metric")
foldmeans

In [None]:
x1 = [1, 5, 9]
y1 = foldmeans.BERTres

x2 = [1.3, 5.3, 9.3]
y2 = foldmeans.RoBERTares

label_x = foldmeans.index

# 1つ目の棒グラフ
plt.bar(x1, y1, color='b', width=0.3, label='BERT', align="center")

# 2つ目の棒グラフ
plt.bar(x2, y2, color='g', width=0.3, label='RoBERTa', align="center")

# 凡例
plt.legend(loc=4)

# X軸の目盛りを置換
plt.xticks([1.15, 5.15, 9.15], label_x)
plt.xticks(rotation=45)
plt.show()

## It was confirmed that both BERT and RoBERTa clearly had the good effect of k-fold.

## BERTもRoBERTaも明らかにk-foldの効果が出ていることが確認できました。

------------------------------

# 12. inference of best models with RoBERTa k-fold except fold4 which is strange about validation score and LB score.
#### ver15 : I found that removing the \n in the text would get better for the score, so I tried to remove it (I should have done it in training as well.)

## スコアが一番良かったRoBERTaのk-foldモデルを推論して提出します。fold4はおかしな結果が出ていたので、抜きました。

## ver15 : 文章の改行を抜くとスコアが上がることがわかったので、抜いてみました(訓練でもやればよかったです。。)

In [None]:
class Inf_DataSet(Dataset):
    
    def __init__(self,sentences,tokenizer):
        
        self.sentences = sentences
        self.tokenizer = tokenizer
       
        
    def __len__(self):
        
        return len(self.sentences)
    
    def __getitem__(self,idx):
        
        sentence = self.sentences[idx]
        
        sentence = str(sentence) # adding in ver 16
        sentence = " ".join(sentence.split()) # adding in ver 16
        
#        sentence = sentence.replace('\n', '') # adding in ver 15.
       
        
        
        
        bert_sens = self.tokenizer.encode_plus(
                                sentence,
                                add_special_tokens = True, # [CLS],[SEP]
                                max_length = CFG.sentence_len,
                                pad_to_max_length = True, # add padding to blank
                                truncation=True)

        ids = torch.tensor(bert_sens['input_ids'], dtype=torch.long)
        mask = torch.tensor(bert_sens['attention_mask'], dtype=torch.long)
#        token_type_ids = torch.tensor(bert_sens['token_type_ids'], dtype=torch.long)
     
        
    
        
        return {
                'ids': ids,
                'mask': mask,
 #               'token_type_ids': token_type_ids,
                
            }

In [None]:
test_dataset = Inf_DataSet(test["excerpt"],Roberta_tokenizer)

In [None]:
test_batch = 32

In [None]:
test_dataloader = DataLoader(test_dataset,batch_size=test_batch,shuffle = False,num_workers=4,pin_memory=True)

In [None]:
model = transformers.RobertaForSequenceClassification.from_pretrained("../input/roberta-base",num_labels=1)

#### model path : these models were trained in ver 8(fold 0), ver 9(fold 1), ver 7(fold 2), ver 10(fold 3), ver 4(fold 4).I made the merged one into a dataset.

これらのモデルは、ver 8(fold 0), ver 9(fold 1), ver 7(fold 2), ver 10(fold 3), ver 4(fold 4)で訓練されたものです。それをまとめてdatasetにしました。

In [None]:
#pthes = [os.path.join("../input/roberta-res",s) for s in os.listdir("../input/roberta-res")]
pthes = ["../input/roberta-res/Robertamodel_fold0_seed102.pth","../input/roberta-res/Robertamodel_fold1_seed103.pth","../input/roberta-res/Robertamodel_fold2_seed100.pth"
        ,"../input/roberta-res/Robertamodel_fold3_seed104.pth"]
pthes

## 12.1 prediction function

In [None]:
def predicting(
    test_dataloader,
    model,
    pthes
    
):

    allpreds = []
    
    for pth in pthes:
        
        state = torch.load(pth)
        model.load_state_dict(state["state_dict"])
        model.to(device)
        model.eval()
    
    
        preds = []
    
        with torch.no_grad():


            for a in test_dataloader:



                ids = a["ids"].to(device)
                mask = a["mask"].to(device)
                #tokentype = a["token_type_ids"].to(device)

                output = model(ids,mask)
                output = output["logits"].squeeze(-1)


                preds.append(output.cpu().numpy())

            preds = np.concatenate(preds)
            
            allpreds.append(preds)

    return allpreds

In [None]:
allpreds = predicting(test_dataloader,model,pthes)

In [None]:
findf = pd.DataFrame(allpreds)
findf = findf.T

In [None]:
findf

In [None]:
finpred = findf.mean(axis=1)
finpred

In [None]:
sample

In [None]:
sample["target"] = finpred

In [None]:
sample

In [None]:
sample.to_csv("submission.csv",index = False)

# 13.Summary

#### I compared BERT and RoBERTa in my model with random seeds, using code that does the initialization correctly.

#### In conclusion, **RoBERTa was better than BERT.**

#### In the process, for my model, I found the following:

* There is a large variation for each epoch. (It may not be stable unless you increase epoch a little more)
* There is a large variation among seeds. (It may be better to shake some seeds and get the best score)
* The validation score for a particular fold is less reliable for the Leaders board. (Maybe I should change the k-fold method. As a result, https://www.kaggle.com/abhishek/step-1-create-folds is better.)

ver15
* If you remove the \n in inference, the score will increase a little (not verified in training)


##### 初期化を正確に行うコードを用いて、BERTとRoBERTaをランダムシードを振って私のモデルで比較しました。

##### 結論としては、**BERTよりもRoBERTaの方が良いことを確認しました。**

##### その過程で、私のモデルの場合、以下のことがわかりました。

* epochごとのばらつきが大きい。（もう少しepoch増やさないと安定しないかも）
* seedごとのばらつきが大きい。 (seedを何個か振って、一番良いスコアを取るやり方は良いかも)
* 特定のfoldでのvalidation scoreのLeaders boardに対する信頼性が低い。（k-foldのやり方を変えた方がいいかも.結果としてhttps://www.kaggle.com/abhishek/step-1-create-folds　の分け方の方が良かったです。)

* inferenceで改行を抜くとスコアが少し上がる(trainingでは未検証)

## **Please note that this result is for my model.**


# Thank you for watching so far!

# If it is helpful to you, I would appreciate it if you could upvote it.

## この結果は私のモデルの場合なので、ご注意ください！

## 最後まで見て頂いてありがとうございます！

## もし、少しでもお役に立てば、**upvote**いただけたら嬉しいです！