# **About this notebook**

#### I will share the reasons why my score got worse from 0.528(ver6) to 0.546(ver15) and how I could improve it back to 0.528(ver19) in my shared notebook.
※　My shared notebook is Pytorch BERT Beginner's Room : https://www.kaggle.com/chumajin/pytorch-bert-beginner-s-room. 
#### It is because of reproducibility, I will show you how to get stable results from which I found in my simple experiment.

### **I'm looking forward to helping you, I would be glad if you could upvote!**
### Thank you to those who have upvoted before!


------------------------------以下、日本語です-----------------------------------
#### このノートブックでは、以下の私のシェアしたnotebookで、なぜスコアが0.528(ver7)から0.546(ver9)に悪化し、それを改善して0.528(ver10)に戻せたかの理由をシェアします。
※ https://www.kaggle.com/chumajin/pytorch-bert-beginner-s-room-version　

#### 一言で言えば、再現性の問題で、簡単な実験から理解した安定してデータを出すコードの書き方を示します。

#### 何かしらお役に立てたら幸いです。**upvote/フォローして頂けたら嬉しいです！**
#### 他のnotebookもupvoteして頂けた方ありがとうございます！

# **0. Preparation**
下準備

In [None]:
import numpy as np 
import pandas as pd 
import os

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
import matplotlib.pyplot as plt 

import transformers
import random


import warnings
warnings.simplefilter('ignore')

scaler = torch.cuda.amp.GradScaler() # GPUでの高速化。

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # cpuがgpuかを自動判断
device

#### Random seed fixation. 
　ランダムシードの固定

In [None]:
SEED = 508

def random_seed(SEED):
    
    random.seed(SEED)
    os.environ['PYTHONHASHSEED'] = str(SEED)
    np.random.seed(SEED)
    torch.manual_seed(SEED)
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True

random_seed(SEED)

# **1. A simple experiment to understand the order in which data is loaded from the Dataloader**
   Dataloaderの中からどのような順番でデータが読みだされるかを理解する簡単な実験

In [None]:
df = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
df

#### Since it is a simple experiment, the first 100 are extracted. 
　簡単な実験なので、最初の100個を抽出。

In [None]:
df = df.iloc[:100,:]
df

#### Create a class that outputs the index to see the loading order
　読みだされる順番を見るために、インデックスを出力するクラスを作成

In [None]:
class EvalDataSet(Dataset):
    
    def __init__(self,df):
        
        self.df = df
        
    def __len__(self):
        
        return len(self.df)
    
    def __getitem__(self,idx):
        
        
        return { "id":self.df.index[idx]  }

#### Create a dataset. 
　データセット作成

In [None]:
test_dataset = EvalDataSet(df)

Example

例

In [None]:
test_dataset[0]

In [None]:
df

## **1.1 Understanding the order when loading DataLoader repeatedly**  
繰り返しDataLoaderを読んだときの、順番

In [None]:
test_batch = 4

In [None]:
test_dataloader1 = DataLoader(test_dataset,batch_size=test_batch,shuffle = True,num_workers=4,pin_memory=True)

In [None]:
print(len(test_dataloader1))
for a in test_dataloader1:
    print(a)
    break

## **Do it again to see the reproducibility**
再現性を見るためにもう一度行う

In [None]:
print(len(test_dataloader1))
for a in test_dataloader1:
    print(a)
    break

#### It was confirmed that the order was out of order when it was read repeatedly.
#### 繰り返し読み込むと順番がずれていることを確認。

## How to get back in order??

どうすれば、順番が戻るか ?

## **1.2 Understanding the order when creating a new dataloader.**    
  新しくdataloaderを作ったときの順番の理解

In [None]:
for a in test_dataloader1:
    result1 = a
    print(result1)
    break

#### Create a new DataLoader in the experiment. 

実験で、新たにデータローダー作成 (test_dataloader2)

In [None]:
test_dataloader2 = DataLoader(test_dataset,batch_size=test_batch,shuffle = True,num_workers=4,pin_memory=True)

In [None]:
for a in test_dataloader2:
    result2 = a
    print(result2)
    break

In [None]:
result1["id"] == result2["id"]

#### I confirmed False.It indicates that the loading order is different.
#### I expected True because I had thought that the order would be reset if I made a new data loader.
 
 
 Falseを確認。読みだされる順番が異なっているという意味です。
 新しくDataLoaderを作ったら、順番がリセットされると思っていたので、Trueを期待したが、Falseを確認。

------------------------------------------------------------------

## **1.3 Random seeds were fixed again to solve it.**
対策としてランダムシードをもう一度固定した

In [None]:
random_seed(SEED)

In [None]:
test_dataloader1 = DataLoader(test_dataset,batch_size=test_batch,shuffle = True,num_workers=4,pin_memory=True)

In [None]:
for a in test_dataloader1:
    result1 = a
    print(a)
    break

In [None]:
random_seed(SEED)

In [None]:
test_dataloader2 = DataLoader(test_dataset,batch_size=test_batch,shuffle = True,num_workers=4,pin_memory=True)

In [None]:
for a in test_dataloader2:
    result2 = a
    print(a)
    break

In [None]:
result1["id"] == result2["id"]

## **It is confirmed that the random seed has to be fixed again to initialize the Dataloader just before loading..**
Dataloaderを初期化するのには、読み込む直前にランダムシードをもう一度固定する必要があることがわかりました。

## **1.4 What happens if you access shuffle = False?**  
Shuffle = Falseにアクセスした場合どうなるか ?

In [None]:
random_seed(SEED)

In [None]:
test_dataloader1 = DataLoader(test_dataset,batch_size=test_batch,shuffle = True,num_workers=4,pin_memory=True)

In [None]:
for a in test_dataloader1:
    result1 = a
    print(a)
    break

#### Fix random seed and access DataLoader with shuffle = False and read again
random seedを固定してshuffle=FalseのDataLoaderにアクセスしてから再度読む

In [None]:
random_seed(SEED)

In [None]:
valid_dataloader1 = DataLoader(test_dataset,batch_size=test_batch,shuffle = False,num_workers=4,pin_memory=True)

In [None]:
for a in valid_dataloader1:
    print(a)
    break

In [None]:
for a in test_dataloader1:
    result2 = a
    print(a)
    break

In [None]:
result1["id"] == result2["id"]

#### Even if you access the DataLoader with Shuffle = False, the order seems to change when you load the DataLoader with Shuffle = True.
Shuffle=FalseのDataLoaderにアクセスしてもそのあとShuffle=TrueのDataLoaderを読むと順番が変わるようだ。

## **1.5 Summary so far**
### In the my shared notebook, I used to access the dataloader several times for explanation, so it seems that the score was getting worse when I changed the code a little. I found it important to fix the random seeds just before.

私のシェアしたnotebookでは、解説用に何度かdataloaderにアクセスしていたため、コードを少し変えるとスコアが悪くなってしまったようです。ランダムシードを直前にきちんと固定することが大事だとわかりました。

## However, I have noticed that the score may not be reproduced even if this is fixed. So let's experiment with it next.
しかしながら、これを直してもスコアが再現しない場合があることにきづきました。そのため、次にそれを実験します。

---------------------------------------------------------

# **2. Experiments on reproducibility when doing machine learning from my shared notebook.**
私のnotebook※から機械学習をするときの再現性の実験


※ English : https://www.kaggle.com/chumajin/pytorch-bert-beginner-s-room


※ Japanese : https://www.kaggle.com/chumajin/pytorch-bert-beginner-s-room-version

#### I just copied it to the point where I trained by machine learning, so it is OK just to run it for a while. If you want to understand the contents, see reference.
#### 機械学習で訓練するところまでコピーしただけなんで、しばらく、走らせるだけでOKです。中身を理解したければ、refを見てください。

In [None]:
import numpy as np 
import pandas as pd 
import os
       
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
import matplotlib.pyplot as plt 

import transformers
import random


import warnings
warnings.simplefilter('ignore')

scaler = torch.cuda.amp.GradScaler() 

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # cpuがgpuかを自動判断
device

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained("../input/bert-base-uncased")


In [None]:
max_sens=314

In [None]:
class BERTDataSet(Dataset):
    
    def __init__(self,sentences,targets):
        
        self.sentences = sentences
        self.targets = targets
        
    def __len__(self):
        
        return len(self.sentences)
    
    def __getitem__(self,idx):
        
        sentence = self.sentences[idx]
        
        bert_sens = tokenizer.encode_plus(
                                sentence,
                                add_special_tokens = True, 
                                max_length = max_sens, # 上で314に設定しています
                                pad_to_max_length = True, 
                                return_attention_mask = True,
        truncation=True)

        ids = torch.tensor(bert_sens['input_ids'], dtype=torch.long)
        mask = torch.tensor(bert_sens['attention_mask'], dtype=torch.long)
        token_type_ids = torch.tensor(bert_sens['token_type_ids'], dtype=torch.long)
     
            
        target = torch.tensor(self.targets[idx],dtype=torch.float)
        
        return {
                'ids': ids,
                'mask': mask,
                'token_type_ids': token_type_ids,
                'targets': target
            }

In [None]:
from transformers import AdamW
LR=2e-5

model = transformers.BertForSequenceClassification.from_pretrained("../input/bert-base-uncased",num_labels=1)
optimizer = AdamW(model.parameters(), LR,betas=(0.9, 0.999), weight_decay=1e-2) 

In [None]:
from transformers import get_linear_schedule_with_warmup


epochs = 20

train_batch = 16


train_steps = int(len(df)/train_batch*epochs)
print(train_steps)

num_steps = int(train_steps*0.1)

scheduler = get_linear_schedule_with_warmup(optimizer, num_steps, train_steps)

In [None]:
from tqdm import tqdm

In [None]:
def training(
    train_dataloader,
    model,
    optimizer,scheduler
):
    
    model.train()
    torch.backends.cudnn.benchmark = True

    allpreds = []
    alltargets = []

    for a in tqdm(train_dataloader):

        losses = []

        optimizer.zero_grad()

        with torch.cuda.amp.autocast():

            ids = a["ids"].to(device,non_blocking=True)
            mask = a["mask"].to(device,non_blocking=True)
            tokentype = a["token_type_ids"].to(device,non_blocking=True)

            output = model(ids,mask)
            output = output["logits"].squeeze(-1)

            target = a["targets"].to(device,non_blocking=True)

            loss = loss_fn(output,target)


            # For scoring
            losses.append(loss.item())
            allpreds.append(output.detach().cpu().numpy())
            alltargets.append(target.detach().squeeze(-1).cpu().numpy())

        scaler.scale(loss).backward() # backwards of loss
        scaler.step(optimizer) # Update optimizer
        scaler.update() # scaler update

        scheduler.step() # Update learning rate schedule
        
        del loss

        # Combine dataloader minutes

    allpreds = np.concatenate(allpreds)
    alltargets = np.concatenate(alltargets)

    # I don't use loss, but I collect it

    losses = np.mean(losses)

    # Score with rmse
    train_rme_loss = np.sqrt(mean_squared_error(alltargets,allpreds))

    return losses,train_rme_loss

In [None]:
def loss_fn(output,target):
    return torch.sqrt(nn.MSELoss()(output,target))

### It is important from here. Initialize and train. The scheduler and scaler are updated, so they must be run again if you are using them.
ここから大事です。初期化して訓練する。※　schedulerやscalerは、updateするので、使用する方はもう一度入れないと再現性でません

In [None]:
random_seed(SEED)
train_dataset = BERTDataSet(df["excerpt"],df["target"])

train_dataloader = DataLoader(train_dataset,batch_size=16,shuffle = True,num_workers=4,pin_memory=True)

model = transformers.BertForSequenceClassification.from_pretrained("../input/bert-base-uncased",num_labels=1)

model.to(device)
LR=2e-5
optimizer = AdamW(model.parameters(), LR,betas=(0.9, 0.999), weight_decay=1e-2) # AdamW optimizer

train_steps = int(len(df)/train_batch*epochs)

num_steps = int(train_steps*0.1)

scheduler = get_linear_schedule_with_warmup(optimizer, num_steps, train_steps)

scaler = torch.cuda.amp.GradScaler() 


In [None]:
losses,train_rmse_loss = training(train_dataloader,model,optimizer,scheduler)

In [None]:
losses

In [None]:
train_rmse_loss

## Do it again to see the reproducibility
再現性を見るためにもう一度行う

In [None]:
random_seed(SEED)
train_dataset = BERTDataSet(df["excerpt"],df["target"])

train_dataloader = DataLoader(train_dataset,batch_size=16,shuffle = True,num_workers=4,pin_memory=True)

model = transformers.BertForSequenceClassification.from_pretrained("../input/bert-base-uncased",num_labels=1)

model.to(device)
LR=2e-5
optimizer = AdamW(model.parameters(), LR,betas=(0.9, 0.999), weight_decay=1e-2) # AdamW optimizer

train_steps = int(len(df)/train_batch*epochs)

num_steps = int(train_steps*0.1)

scheduler = get_linear_schedule_with_warmup(optimizer, num_steps, train_steps)

scaler = torch.cuda.amp.GradScaler() 


In [None]:
losses2,train_rmse_loss2 = training(train_dataloader,model,optimizer,scheduler)

In [None]:
losses2

In [None]:
train_rmse_loss2

## Comparing the score.
スコア比較

In [None]:
train_rmse_loss == train_rmse_loss2

#### With this method, the score was reproduced properly. The problem is if you make a function for initialization.

#### このやり方ならきちんとスコアが再現した。問題は、初期化のところを関数化した場合

--------------------------------------------------------------------------------------
### Failure example
### 失敗例

In [None]:
def initialization1():
    random_seed(SEED)
    train_dataset = BERTDataSet(df["excerpt"],df["target"])

    train_dataloader = DataLoader(train_dataset,batch_size=16,shuffle = True,num_workers=4,pin_memory=True)

    model = transformers.BertForSequenceClassification.from_pretrained("../input/bert-base-uncased",num_labels=1)

    model.to(device)
    LR=2e-5
    optimizer = AdamW(model.parameters(), LR,betas=(0.9, 0.999), weight_decay=1e-2) # AdamW optimizer

    train_steps = int(len(df)/train_batch*epochs)

    num_steps = int(train_steps*0.1)

    scheduler = get_linear_schedule_with_warmup(optimizer, num_steps, train_steps)

    scaler = torch.cuda.amp.GradScaler() 
    
    


In [None]:
initialization1()
losses,train_rmse_loss = training(train_dataloader,model,optimizer,scheduler)
print(losses,train_rmse_loss)

## Do it again to see the reproducibility
再現性を見るためにもう一度行う

In [None]:
initialization1()
losses2,train_rmse_loss2 = training(train_dataloader,model,optimizer,scheduler)
print(losses2,train_rmse_loss2)

In [None]:
train_rmse_loss == train_rmse_loss2

## When the initialization was made into a function, it was not reproduced.
## 初期化を関数化した場合、再現しなかった。

------------------------------------------------------------
## Below is a successful example. Return each data.

## 以下、うまくいった例。リターンでデータをきちんと返す。

In [None]:
def initialization2():
    random_seed(SEED)
    train_dataset = BERTDataSet(df["excerpt"],df["target"])

    train_dataloader = DataLoader(train_dataset,batch_size=16,shuffle = True,num_workers=4,pin_memory=True)

    model = transformers.BertForSequenceClassification.from_pretrained("../input/bert-base-uncased",num_labels=1)

    model.to(device)
    LR=2e-5
    optimizer = AdamW(model.parameters(), LR,betas=(0.9, 0.999), weight_decay=1e-2) # AdamW optimizer

    train_steps = int(len(df)/train_batch*epochs)

    num_steps = int(train_steps*0.1)

    scheduler = get_linear_schedule_with_warmup(optimizer, num_steps, train_steps)

    scaler = torch.cuda.amp.GradScaler() 
    
    return train_dataloader,model,optimizer,scheduler,scaler


In [None]:
train_dataloader,model,optimizer,scheduler,scaler = initialization2()
losses,train_rmse_loss = training(train_dataloader,model,optimizer,scheduler)
print(losses,train_rmse_loss)

## Do it again to see the reproducibility
再現性を見るためにもう一度行う

In [None]:
train_dataloader,model,optimizer,scheduler,scaler = initialization2()
losses2,train_rmse_loss2 = training(train_dataloader,model,optimizer,scheduler)
print(losses2,train_rmse_loss2)

Comparing the score


スコア比較

In [None]:
train_rmse_loss == train_rmse_loss2

## **I was able to reproduce it properly by using the function to initialize.**


**初期化する関数を用いて、きちんと再現させることができた。**

## v11 コメントいただいたものを追加します(globalに飛ばす)

In [None]:
def initialization3():
    global train_dataloader,model,optimizer,scheduler,scaler
    random_seed(SEED)
    train_dataset = BERTDataSet(df["excerpt"],df["target"])
    train_dataloader = DataLoader(train_dataset,batch_size=16,shuffle = True,num_workers=4,pin_memory=True)
    model = transformers.BertForSequenceClassification.from_pretrained("../input/bert-base-uncased",num_labels=1)
    model.to(device)
    LR=2e-5
    optimizer = AdamW(model.parameters(), LR,betas=(0.9, 0.999), weight_decay=1e-2) # AdamW optimizer
    train_steps = int(len(df)/train_batch*epochs)
    num_steps = int(train_steps*0.1)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_steps, train_steps)
    scaler = torch.cuda.amp.GradScaler()
initialization3()
losses,train_rmse_loss3 = training(train_dataloader,model,optimizer,scheduler)
print(train_rmse_loss == train_rmse_loss3)

きちんとTrueが返ってきます。※　勉強になりました。ありがとうございます!

--------------------------------------------------------------------------------

# Summary

## In order to make the data with proper reproducibility ...

## 1. You need to fix the random seeds again before accessing the dataloader.

## 2. If you want to make a function for initialization, you need to return the data properly.

#### In the my shared notebook, I adopt this principle. I return the score from 0.546 to 0.528.

# まとめ

## きちんと再現性のあるデータを出すためには・・・

## 1. dataloaderを読むこむ前にランダムシードを再度fixする必要がある。

## 2. 初期化を関数にする場合は、きちんとreturnでデータを返してあげることが必要。

#### シェアしたnotebookでは、この原理を取り入れて、0.546にスコアが悪くなったのをもとの0.528に戻すことができました。

#### I think this code is useful for accurate K-folding and comparing models. 
## Thank you for reading. If you find it useful, I would be grateful if you could **upvote**.
## Using this technique, I compared BERT and RoBERTa with various random seed.
## Please refer it! https://www.kaggle.com/chumajin/bert-v-s-roberta-english

#### このコードは正確にK-fold回したり、モデル間を比較したりするときに役立つと思います。
### 読んでいただいてありがとうございます。何かお役に立てば、**upvote**して頂けると嬉しいです。
### このテクニックを使って、ランダムシードを振って、BERTとRoBERTaの比較を行いました。
### 良かったら、こちらもご参照ください。https://www.kaggle.com/chumajin/bert-v-s-roberta-english