## Intoroduction
This notebook is my own guide to understand BERT.  
Maybe it helps a beginner like me.  
I forgot to mention offset mapping, so I added a comment.

私自身が初心者のため、自分用のガイドとして作成しています。  
初歩すぎですが、忘備録とします。  
オフセットマッピングに関する記述を書き忘れていたたためコメントを追加しました。

## References  
[NBME / Deberta-base baseline [train]](https://www.kaggle.com/yasufuminakama/nbme-deberta-base-baseline-train)  
[[NBME]BERT_for_beginners](https://www.kaggle.com/tomohiroh/nbme-bert-for-beginners)

In [None]:
import os

data_path = "../input/nbme-score-clinical-patient-notes"
print("File list of data_path:\n", os.listdir(data_path))
print('\n')

forMyself = False
    
#絶対pathと相対pathについてコンペの度に忘れるのでメモしておく。
if forMyself:
    # current path
    print("Name of currnt path:\n", os.getcwd())
    print("File list of current path:\n", os.listdir(os.getcwd()))
    print("File list of current path:\n", os.listdir('./')) #これは同じ階層
    # １つ上の path
    print('\n')
    print(os.listdir('/kaggle'))
    print(os.listdir('../'))
    # 最上位path
    print("\n", os.listdir('/'))

### **Preperations**

* I transcribed the code from namaka's notebook and checked the operation of BERT.

* namakaさんのノートブックからコードを転記してBERTの動作確認をおこないます。

In [None]:
# ====================================================
# CFG
# ====================================================
class CFG:
    model="bert-base-uncased"
    max_len=512
    batch_size=12
    n_fold=5
    trn_fold=[0, 1, 2, 3, 4]

In [None]:
# ====================================================
# Library
# ====================================================
import gc
import random
import ast
import itertools
import warnings
warnings.filterwarnings("ignore")

import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import matplotlib.pyplot as plt

from tqdm.auto import tqdm

from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold

import torch
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
from torch.utils.data import DataLoader, Dataset

import tokenizers
import transformers
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
#%env TOKENIZERS_PARALLELISM=true

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# ====================================================
# Utils
# ====================================================
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

In [None]:
# ====================================================
# Data Loading
# ====================================================
train = pd.read_csv(os.path.join(data_path,'train.csv'))
features = pd.read_csv(os.path.join(data_path,'features.csv'))
patient_notes = pd.read_csv(os.path.join(data_path,'patient_notes.csv'))

* Change anotation and location to list.

* annotationとlocationをリストに変換します。

In [None]:
# ====================================================
# Change anotation and location to list
# ====================================================
print(f"train.shape: {train.shape}")
display(train.head()) #printと異なりデータフレームの表形式で表示

#train
train['annotation'] = train['annotation'].apply(ast.literal_eval)
train['location'] = train['location'].apply(ast.literal_eval)

print(f"train.shape: {train.shape}")
display(train.head()) #printと異なりデータフレームの表形式で表示

In [None]:
train = train.merge(features, on = ['feature_num', 'case_num'], how = 'left')
train = train.merge(patient_notes, on = ['pn_num', 'case_num'], how = 'left')

* Count the number of elements in the annotarion.
>  
* anotationの要素数をカウントする列をつくります。

In [None]:
train['annotation_length'] = train['annotation'].apply(len)
#anotationの文字数ではなく要素数。
train[train['annotation_length']==2][0:3]

## **Tokenizer**

* Define tokenizer.
>  
* tokenizerを定義します。AutoTokenizerでモデルを選べばやってくれるようです。

In [None]:
# ====================================================
# tokenizer
# ====================================================
tokenizer = AutoTokenizer.from_pretrained(CFG.model)
CFG.tokenizer = tokenizer

### Outputs of tokenizer
* Check the tokenizer output.
* Special_token is CLS at the beginning of the sentence and SEP at the end of the sentence.To tokenize the 'pn_history' and 'feature_text', SEP is used twice.  
* ##tokens represent a connection to the previous lexicon.  
>  
* tokenizerの出力をチェックします。
* Special_tokenはCLSと文末のSEP。'pn_history'と'feature_text'をtokenizeするため、SEPは2回はいります。
* ##トークンは、前の語彙との繋がりを表します。

In [None]:
text1 = train['pn_history'][0]
text2 = 'This is a feature text.'
print("Text1:\n",text1, "\nText2:\n",text2)
print("#"*100)
tokenized_text = tokenizer.tokenize(text1, text2, add_special_tokens=True)
print("Tokenized:\n", tokenized_text)
print("#"*100)
print("Encoded:\n", tokenizer.encode(text1))
print("#"*100)

#special tokens
s1 = tokenizer.all_special_ids
print("Special tokens:")
for s in s1:
    print(f'{s} --> {tokenizer.decode(s)}')

### Operation tests
* Text can be entered in pairs (pn_history and feature_text). Sentence connections can be analyzed. 
* Output is in dictionary format.  
* attention_mask corresponds to input_id, with 1 for meaningful tokens and 0 for padding.  
* Token_type_id is set to 0 for the token position in the preceding text and 1 for the following text.
* As for the numbers, interesting results.
>  
* textの入力はペアでも可能(pn_historyとfeature_text)。文のつながりを解析できます。 
* 出力は辞書形式。  
* attention_maskはinput_idに対応し、意味のあるトークンは1、パッディングは0となります。  
* Token_type_id先行するテキストのトークン位置には0を、後続テキストには1がセットされます。
* 数字については、おもしろい結果になりました。

In [None]:
text1 = "This is a pen."
text2 = "So, we must keep learning."
text3 = "12345678 9 date is 20220314"
tokenizer_output = tokenizer(text1, text2, add_special_tokens=True)
print("tokenizer output with special_tokens:\n", tokenizer_output)

tokenizer_output2 = tokenizer(text3, add_special_tokens=True)
print("\ntokenizer output2:\n", tokenizer_output2)

print("\n")
for i in tokenizer_output2['input_ids']:
    print(tokenizer.decode(i))

* sequence_ids() has special_token = None, first sentence = 0, next sentence = 1. Note that padding is also special_token and therefore None.
>  
* sequence_ids()はspecial_token = None、最初の文章 = 0、次の文章 = 1 となります。文章だけを識別します。paddingもspecial_tokenなのでNoneとなることに注意。

In [None]:
print(tokenizer_output.items())

seq = tokenizer_output.sequence_ids()
print("Sequence ids:\n", seq)

* Padding is the number 0 of the special token. 
* The token is aligned with the number of elements specified by max_length.
* return_offsets_mapping = True would return a tuple indicating the positional relationship between the token and the original sentence. specilal_token is now (0,0).
>  
* パッディングはスペシャルトークンの0番です。 
* max_lengthで指定するした要素数にtokenをそろえます。
* return_offsets_mapping = True とするとトークンともとの文章の位置関係を示す、タプルを返します。specilal_tokenは(0,0)となりました。

In [None]:
#lenは文字数でなくてtoken数であることに注意！
text1_len = len(tokenizer.encode(text1, add_special_tokens = False))
text2_len = len(tokenizer.encode(text2, add_special_tokens = False))

max_len = text1_len + text2_len + 10

tokenizer_output = tokenizer(text1, text2, 
                             add_special_tokens = True,
                             max_length = max_len,
                             padding= "max_length",
                             return_offsets_mapping = True)
print("tokenizer output with padding:\n", tokenizer_output)
print(f'tokens:\n {tokenizer.decode(tokenizer_output["input_ids"])}')

### Equalize the number of tokens
* Extract the maximum number of characters for each pn_history and feature_text. Match the number of characters.
>  
* 学習にそなえて、pn_historyとfeature_textそれぞれについて最大文字数を抽出します。その文字数にそろえます。  
* special_token=Falseでエンコードし、実際はCLSが1回、SEPが2回はいるので最大数は3足します。

In [None]:
# nakama's code
# https://www.kaggle.com/yasufuminakama/nbme-deberta-base-baseline-train
# ====================================================
# Define max_len
# ====================================================
print(patient_notes['pn_history'].isnull().sum())
print(features['feature_text'].isnull().sum())

for text_col in ['pn_history']:
    pn_history_lengths = []
    tk0 = tqdm(patient_notes[text_col].fillna("").values, total=len(patient_notes))
    for text in tk0:
        length = len(tokenizer(text, add_special_tokens=False)['input_ids'])#カルテの語数
        pn_history_lengths.append(length)
    #LOGGER.info(f'{text_col} max(lengths): {max(pn_history_lengths)}')

for text_col in ['feature_text']:
    features_lengths = []
    tk0 = tqdm(features[text_col].fillna("").values, total=len(features))
    for text in tk0:
        length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
        features_lengths.append(length)
    #LOGGER.info(f'{text_col} max(lengths): {max(features_lengths)}')

CFG.max_len = max(pn_history_lengths) + max(features_lengths) + 3
# cls & sep & sep

print(CFG.max_len)
# model="microsoft/deberta-base" -> 466 モデルによって最大文字数は異なる。

いちおうグラフ化

In [None]:
fig = plt.figure(figsize=(8, 3))
fig.add_subplot(121)
plt.hist(pn_history_lengths)
fig.add_subplot(122)
plt.hist(features_lengths)
plt.show()

### Inputs and lebels

* Define inputs to the model.

* モデルへの入力を定義します。

In [None]:
# nakama's code
# https://www.kaggle.com/yasufuminakama/nbme-deberta-base-baseline-train
#
def prepare_input(cfg, text, feature_text):
    inputs = cfg.tokenizer(text, feature_text, 
                           add_special_tokens=True,
                           max_length=CFG.max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long) #torch.longに変換
    return inputs


def create_label(cfg, text, annotation_length, location_list):
    encoded = cfg.tokenizer(text,
                            add_special_tokens=True,
                            max_length=CFG.max_len,
                            padding="max_length",
                            return_offsets_mapping=True)
    offset_mapping = encoded['offset_mapping']
    ignore_idxes = np.where(np.array(encoded.sequence_ids()) != 0)[0]
    label = np.zeros(len(offset_mapping))
    label[ignore_idxes] = -1
    if annotation_length != 0:
        for location in location_list:
            for loc in [s.split() for s in location.split(';')]:
                start_idx = -1
                end_idx = -1
                start, end = int(loc[0]), int(loc[1])
                for idx in range(len(offset_mapping)):
                    if (start_idx == -1) & (start < offset_mapping[idx][0]):
                        start_idx = idx - 1
                    if (end_idx == -1) & (end <= offset_mapping[idx][1]):
                        end_idx = idx + 1
                if start_idx == -1:
                    start_idx = end_idx
                if (start_idx != -1) & (end_idx != -1):
                    label[start_idx:end_idx] = 1
    return torch.tensor(label, dtype=torch.float)

* inputs are tokens of pn_history and feature_text. The number of tokens (not the number of characters) is padded to be the same.
* label is calculated from annotation/location/offset_mapping and the description part about feature_text in pn_history is set to 1.
>  
* inputsはpn_historyとfeature_textのトークン。トークンの数（文字数ではない）はパディングして同じにしてあります。
* label は annotation/location/offsets_mappingから算出され、pn_history のfeture_textに関する記述部分を1と設定されます。

In [None]:
text =train['pn_history'][0]
feature_text = train['feature_text'][0]

inputs = prepare_input(CFG, text, feature_text)
print(inputs.keys())

* inputs are defined as torch.long
>  
* inputsは学習のため、torch.longとして定義します。

In [None]:
for k, v in inputs.items():
    print(k, "\n", v[0:10],".....", v.dtype)

In [None]:
text = train['pn_history'][0]
annotation_length = train['annotation_length'][0]
location_list = train['location'][0]
train['annotation']
label = create_label(CFG, text, annotation_length, location_list)

* Check the meaning of label.
>  
* labelの中身を確認してみます。
* ラベルは、return_offsets_mapping=True を使用して作成します。

In [None]:
print("annotation location: ", location_list)
print(f'location{location_list} is --> {text[696:724]}')

print(f'\nlabel = 1: {np.where(label == 1)}')
tokenized_text = tokenizer.tokenize(text, add_special_tokens=True)
print(f'tokenized_text[181:187]: {tokenized_text[181:187]}')

## Dataset

* Define a dataset, passing inputs and lebel to the dataloader.
>  
* データセットを定義します。inputsとlebelをローダーにわたします。

In [None]:
# nakama's code
# https://www.kaggle.com/yasufuminakama/nbme-deberta-base-baseline-train
#
class TrainDataset(Dataset):
    def __init__(self, cfg, df):
        self.cfg = cfg
        self.feature_texts = df['feature_text'].values
        self.pn_historys = df['pn_history'].values
        self.annotation_lengths = df['annotation_length'].values
        self.locations = df['location'].values

    def __len__(self):
        return len(self.feature_texts)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, 
                               self.pn_historys[item], 
                               self.feature_texts[item])
        label = create_label(self.cfg, 
                             self.pn_historys[item], 
                             self.annotation_lengths[item], 
                             self.locations[item])
        return inputs, label

### Model

* Let's look at the structure of the model.
>  
* モデルの構造をみてみます。torchsummaryXをインストールします。

In [None]:
pip install -q torchsummaryX

In [None]:
from torchsummaryX import summary

* Defines the model, using the pre-trained specified in AutoConfig, AutoModel.
* output_hidden_states=True to get the output for encoder layers including the final layer. Normally, only the final layer is used, so the outputs of the other layers are used for accuracy improvement studies after the baseline study.
>  
* モデルを定義します。AutoConfig, AutoModelで指定したpretrainedを使用したモデルが定義されます。
* output_hidden_states=True で最終層を含めたencoder層の出力が取得されます。通常は最終層のみになりますので、ベースライン検討後の精度改善検討時に他の層の出力を使用します。

In [None]:
config = AutoConfig.from_pretrained(CFG.model)
config.update({"output_hidden_states": True})
model = AutoModel.from_pretrained(CFG.model, config = config)

In [None]:
model.config

In [None]:
s = summary(model, torch.zeros((1, CFG.max_len), dtype=torch.long))

* encoder.layers0 - 12 (transformers) are displayed.

* encoder.layer0 ～ 12(transformers) が表示される。

In [None]:
display(s)

### Try out the output of the model

* Check the output of the model.

* モデルの出力を確認しましょう。ミニバッチサイズは1にしました。

In [None]:
# ====================================================
# CV split
# ====================================================
Fold = GroupKFold(n_splits=CFG.n_fold)
groups = train['pn_num'].values
for n, (train_index, val_index) in enumerate(
    Fold.split(train, train['location'], groups)):
    train.loc[val_index, 'fold'] = int(n)
train['fold'] = train['fold'].astype(int)

In [None]:
fold = 0
CFG.batch_size = 1

folds = train.copy()

train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
train_dataset = TrainDataset(CFG, train_folds)

train_loader = DataLoader(train_dataset,
                          batch_size=CFG.batch_size,
                          shuffle = False,#動作確認用
                          num_workers=0, pin_memory=True, drop_last=True)

In [None]:
#trainローダの中身
batch_iterator = iter(train_loader)
inputs, label = next(batch_iterator)
print(inputs.keys())
print(label.size())

In [None]:
y = model(**inputs)

* By specifying output_hidden_states=True, hidden_states other than the last layer could be extracted.
>  
* output_hidden_states=True を指定することにより 最終層以外のhidden_statesも取り出せました。

In [None]:
print(f'BERT model output: {y.keys()}')
print(f'max_len = {CFG.max_len}')
print(f'last_hidden_state --> {y["last_hidden_state"].size()}')
print(f'pooler_output     --> {y["pooler_output"].size()}')
print(f'hidden_states     --> {len(y["hidden_states"])}')

for i in range(len(y["hidden_states"])):
    print(f'hidden_states-{i}     --> {y["hidden_states"][i].size()}')