In this Notebook,  I introduce how to get text embedding from RoBERTa (/BERT/ALBERT/etc.).  
There are 2 methods to get text embedding from RoBERTa.
1. get CLS Token
2. pool RoBERTa output (RoBERTa output = word embeddings)  
  
If I make mistakes, please let me know in the comments.

このnotebookでは、RoBERTa (またはALBERT, BERTなど) を使って文章のベクトル化 (text embedding) を行う方法を紹介する。  
RoBERTaを使ったtext embeddingには2種類の方法が提案されている。  
1. CLSトークンを取得してそれを文章の埋め込みベクトルと見做す方法
2. RoBERTaの出力をプーリングする方法 (ここでいう出力とは単語埋め込みベクトルたちのこと)  
  
もし間違えている個所があったらコメントで教えてください

In [None]:
import os
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import RobertaModel, RobertaTokenizer

In [None]:
class Settings:
    batch_size=16
    max_len=350
    device = "cuda" if torch.cuda.is_available() else "cpu"
    seed = 318

In [None]:
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.backends.cudnn.deterministic = True
    
set_seed(Settings.seed)

# Dataset

In [None]:
class TrainValidDataset(Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.df = df
        self.text = df["excerpt"].values
        self.target = df["target"].values
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        texts = self.text[idx]
        tokenized = self.tokenizer.encode_plus(texts, truncation=True, add_special_tokens=True,
                                               max_length=self.max_len, padding="max_length")
        ids = tokenized["input_ids"]
        mask = tokenized["attention_mask"]
        targets = self.target[idx]
        return {
            "ids": torch.LongTensor(ids),
            "mask": torch.LongTensor(mask),
            "targets": torch.tensor(targets, dtype=torch.float32)
        }

# Model

In [None]:
class CommonLitRoBERTa(nn.Module):
    def __init__(self, pretrained_path):
        super().__init__()
        self.roberta = RobertaModel.from_pretrained(pretrained_path)
        
    def forward(self, ids, mask):
        output = self.roberta(ids, attention_mask=mask)
        return output

In [None]:
model = CommonLitRoBERTa("../input/roberta-transformers-pytorch/roberta-base")
model.to(Settings.device)

# Get Text Embeddings

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("../input/roberta-transformers-pytorch/roberta-base")
tokenizer

In [None]:
# prepare dataset
# データセットを準備
df_train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")

train_dataset = TrainValidDataset(df_train, tokenizer, Settings.max_len)
train_loader = DataLoader(train_dataset, batch_size=Settings.batch_size,
                          shuffle=True, num_workers=8, pin_memory=True)

In [None]:
# make mini batch data
# ミニバッチデータを作る
batch = next(iter(train_loader))

In [None]:
ids = batch["ids"].to(Settings.device)
mask = batch["mask"].to(Settings.device)
targets = batch["targets"].to(Settings.device)

print(ids.shape)
print(mask.shape)
print(targets.shape)

16 = num of texts, 350 = num of word tokens in a text  
16 = 文章の数、350 = 1文の中にある単語tokenの数

In [None]:
output = model(ids, mask)
output

2 outputs (last_hidden_state, pooler_output) from RoBERTa  
2つの出力がRoBERTaから吐き出される

In [None]:
# last_hidden_state
last_hidden_state = output[0]
print("shape:", last_hidden_state.shape)

16 = num of texts, 350 = num of tokens in a text, 768 = dimension of word embedding  
16 = 文章の数、350 = 1文中の単語tokenの数、768 = 単語埋め込みの次元数

In [None]:
# pooler output
pooler_output = output[1]
print("shape:", pooler_output.shape)

meaning of pooler output will be explained later.  
pooler outputの意味は後で説明する。

## 1. Get CLS Token

![get cls token](https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png)

In [None]:
# .detach() = make copies and remove gradient information  
# .detach() = 勾配情報を除外してテンソルをコピー
cls_embeddings = last_hidden_state[:, 0, :].detach()

print("shape:", cls_embeddings.shape)
print("")
print(cls_embeddings)

16 = num of texts, 768 = dimension of text embedding  
16 = 文章の数、768 = text embeddingの次元数

In [None]:
pd.DataFrame(cls_embeddings.numpy()).head()

## 2. Pool RoBERTa Output

use last_hidden_state  
last_hidden_stateを使う

In [None]:
last_hidden_state.shape

In [None]:
# apply avg.pooling to word embeddings
# 単語埋め込みベクトルにaverage pooling を適用する
pooled_embeddings = last_hidden_state.detach().mean(dim=1)

print("shape:", pooled_embeddings.shape)
print("")
print(pooled_embeddings)

In [None]:
pd.DataFrame(pooled_embeddings.numpy()).head()

note!: pooler output "not" equal pooled_embeddings we calculated  
What is pooler output ?  
-> It takes the representation from the [CLS] token from top layer of RoBERTa encoder, and feed that through another dense layer.  
reference: https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/modeling.py#L224-L232  
  
注： pooler outputは今私たちが計算したpooled_embeddingsとは全くの別物。  
なら、pooler outputとは何か？
-> 1. Get CLS Token で取得したCLSトークンを別のdense layerに通したもの。  
参照: https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/modeling.py#L224-L232