## 第9章：事前学習済み言語モデル（BERT型）
本章では、BERT型の事前学習済みモデルを利用して、単語マスクの予測や文ベクトルの計算、評判分析器（ポジネガ分類機）の構築に取り組む。

### 80.トークン化
“The movie was full of incomprehensibilities.”という文をトークンに分解し、トークン列を表示せよ。

In [1]:
from transformers import AutoTokenizer

model_name = 'google-bert/bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = 'The movie was full of incomprehensibilities.'
token = tokenizer.tokenize(text)
print(token)

  from .autonotebook import tqdm as notebook_tqdm


['the', 'movie', 'was', 'full', 'of', 'inc', '##omp', '##re', '##hen', '##si', '##bilities', '.']


### 81.マスクの予測
“The movie was full of [MASK].”の”[MASK]”に埋めるのに適切なトークン上位10個と、その確率（尤度）を求めよ。

In [2]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model_name, top_k=1)
text = 'The movie was full of [MASK].'
result = unmasker(text)
print(result)

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'score': 0.10711904615163803, 'token': 4569, 'token_str': 'fun', 'sequence': 'the movie was full of fun.'}]


### 82.マスクのtop-k予測
“The movie was full of [MASK].”の”[MASK]”に埋めるのに適切なトークン上位10個と、その確率（尤度）を求めよ。

In [3]:
unmasker = pipeline('fill-mask', model_name, top_k=10)
result = unmasker(text)

for mask in result:
    print(f'token:{mask['token_str']} score:{mask['score']}')

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


token:fun score:0.10711904615163803
token:surprises score:0.06634485721588135
token:drama score:0.04468414932489395
token:stars score:0.027217062190175056
token:laughs score:0.025412775576114655
token:action score:0.01951693743467331
token:excitement score:0.019038118422031403
token:people score:0.018290281295776367
token:tension score:0.015030575916171074
token:music score:0.014646227471530437


### 83.CLSトークンによる文ベクトル
以下の文の全ての組み合わせに対して、最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ。

・“The movie was full of fun.”

・“The movie was full of excitement.”

・“The movie was full of crap.”

・“The movie was full of rubbish.”

In [4]:
from transformers import AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import torch

model = AutoModel.from_pretrained(model_name)

sentences = [
    'The movie was full of fun.',
    'The movie was full of excitement.',
    'The movie was full of crap.',
    'The movie was full of rubbish.'
]

tokenize_text = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
input = torch.tensor(tokenize_text['input_ids'])
outputs = model(input)

last_hidden_states = outputs[0]
sentencevec = last_hidden_states[:,0,:].detach().cpu().numpy()

cos = cosine_similarity(sentencevec)
print(cos)


  input = torch.tensor(tokenize_text['input_ids'])


[[0.9999999  0.98806083 0.955766   0.94753236]
 [0.98806083 1.         0.9541273  0.9486635 ]
 [0.955766   0.9541273  0.99999976 0.9806931 ]
 [0.94753236 0.9486635  0.9806931  1.0000001 ]]


### 84.平均による文ベクトル
以下の文全ての組み合わせに対して、最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ。

・“The movie was full of fun.”

・“The movie was full of excitement.”

・“The movie was full of crap.”

・“The movie was full of rubbish.”

In [5]:
sentences = [
    'The movie was full of fun.',
    'The movie was full of excitement.',
    'The movie was full of crap.',
    'The movie was full of rubbish.'
]

tokenize_text = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
input = torch.tensor(tokenize_text['input_ids'])
outputs = model(input)

last_hidden_states = outputs[0]
sentencevec = last_hidden_states.mean(dim=1).detach().cpu().numpy()

cos = cosine_similarity(sentencevec)
print(cos)


  input = torch.tensor(tokenize_text['input_ids'])


[[1.0000001  0.95681167 0.8489994  0.8168843 ]
 [0.95681167 0.9999999  0.83518374 0.7938444 ]
 [0.8489994  0.83518374 0.9999999  0.92255414]
 [0.8168843  0.7938444  0.92255414 1.        ]]


### 85.データセットの準備
General Language Understanding Evaluation (GLUE) ベンチマークで配布されているStanford Sentiment Treebank (SST) から訓練セット（train.tsv）と開発セット（dev.tsv）のテキストと極性ラベルと読み込み、さらに全てのテキストはトークン列に変換せよ。

In [6]:
import pandas as pd

def make_token(df):
    result = []
    for i, item in df.iterrows():
        label = item['label']
        text = item['sentence']
        token = tokenizer.tokenize(text)

        result.append({'text': text, 'label': label, 'token': token})
    return result

df_train = pd.read_csv('cp07-data/SST-2/train.tsv', sep='\t')
df_dev = pd.read_csv('cp07-data/SST-2/dev.tsv', sep='\t')

data_train = make_token(df_train)
data_dev = make_token(df_dev)

for i in range(5):
    print(data_train[i])

{'text': 'hide new secretions from the parental units ', 'label': 0, 'token': ['hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units']}
{'text': 'contains no wit , only labored gags ', 'label': 0, 'token': ['contains', 'no', 'wit', ',', 'only', 'labor', '##ed', 'gag', '##s']}
{'text': 'that loves its characters and communicates something rather beautiful about human nature ', 'label': 1, 'token': ['that', 'loves', 'its', 'characters', 'and', 'communicate', '##s', 'something', 'rather', 'beautiful', 'about', 'human', 'nature']}
{'text': 'remains utterly satisfied to remain the same throughout ', 'label': 0, 'token': ['remains', 'utterly', 'satisfied', 'to', 'remain', 'the', 'same', 'throughout']}
{'text': 'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up ', 'label': 0, 'token': ['on', 'the', 'worst', 'revenge', '-', 'of', '-', 'the', '-', 'ne', '##rds', 'cl', '##iche', '##s', 'the', 'filmmakers', 'could', 'dr', '##edge', 'up']}


### 86.ミニバッチの作成
85で読み込んだ訓練データの一部（例えば冒頭の4事例）に対して、パディングなどの処理を行い、トークン列の長さを揃えてミニバッチを構成せよ。

In [23]:
import pandas as pd

def make_token(df):
    result = []
    for i, item in df.iterrows():
        label = item['label']
        text = item['sentence']
        token = tokenizer(text, return_tensors="pt", max_length=128, padding="max_length")

        result.append({'text': text, 'label': label, 'input_ids': token})
    return result

df_train = pd.read_csv('cp07-data/SST-2/train.tsv', sep='\t')
df_dev = pd.read_csv('cp07-data/SST-2/dev.tsv', sep='\t')

data_train = make_token(df_train)
data_dev = make_token(df_dev)

print(data_train[0])

{'text': 'hide new secretions from the parental units ', 'label': 0, 'input_ids': {'input_ids': tensor([[  101,  5342,  2047,  3595,  8496,  2013,  1996, 18643,  3197,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0

In [14]:
import transformers
print(transformers.__version__)
print(transformers.__file__)


4.51.0
/net/nas8/data/home/yokoyama/nlp-100/.venv/lib/python3.12/site-packages/transformers/__init__.py


### 87.ファインチューニング
訓練セットを用い、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, BatchEncoding, DataCollatorWithPadding
from torch.utils.data import DataLoader
from datasets import Dataset
import numpy as np

def make_dataset(file_name):
  df = pd.read_csv(file_name, sep='\t')
  df['label'] = df['label'].astype(int)
  df['tokens'] = df['sentence'].apply(tokenizer.tokenize)
  return Dataset.from_pandas(df)

def compute_accuracy(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return {"accuracy": (predictions == labels).mean()}

train_dataset =  make_dataset('cp07-data/SST-2/train.tsv')
dev_dataset = make_dataset('cp07-data/SST-2/dev.tsv')

def preprocess_text_classification(
        example: dict[str, str | int]
) -> BatchEncoding:
    
    encoded_example = tokenizer(example['sentence'], max_length=128)
    encoded_example['labels'] = example['label']
    return encoded_example

encoded_train_dataset = train_dataset.map(preprocess_text_classification, remove_columns=train_dataset.column_names)
encoded_dev_dataset = dev_dataset.map(preprocess_text_classification, remove_columns=dev_dataset.column_names)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

class_label = train_dataset.features['label']

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
training_args = TrainingArguments(
    output_dir='/home/yokoyama/nlp-100/models/cp09/',
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy='epoch',
    warmup_ratio=0.1,
    save_total_limit=1,
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_dev_dataset,
    data_collator=data_collator,
    compute_metrics=compute_accuracy,
)

trainer.train()

Map: 100%|██████████| 67349/67349 [00:19<00:00, 3512.06 examples/s]
Map: 100%|██████████| 872/872 [00:00<00:00, 3011.59 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2905,0.21302,0.919725
2,0.1251,0.256958,0.909404
3,0.082,0.288396,0.926606
4,0.0608,0.328246,0.931193
5,0.0571,0.496765,0.924312




TrainOutput(global_step=5265, training_loss=0.12310158927895744, metrics={'train_runtime': 1298.3701, 'train_samples_per_second': 259.36, 'train_steps_per_second': 4.055, 'total_flos': 7721667113841000.0, 'train_loss': 0.12310158927895744, 'epoch': 5.0})

### 88.極性分析
問題87でファインチューニングされたモデルを用いて、以下の分の極性を予測せよ。

In [33]:
sentences = [
    'The movie was full of fun.',
    'The movie was full of excitement.',
    'The movie was full of crap.',
    'The movie was full of rubbish.'
]

model = AutoModelForSequenceClassification.from_pretrained('/home/yokoyama/nlp-100/models/cp09/checkpoint-5265')
token = tokenizer(sentences, return_tensors="pt", max_length=128, padding="max_length")

input_ids = torch.tensor(token['input_ids'])
output = model(input_ids)

for sentens, output in zip(sentences, output.logits):
    print(f'{sentens}  {np.argmax(output.detach().cpu().numpy())}')

  input_ids = torch.tensor(token['input_ids'])


The movie was full of fun.  0
The movie was full of excitement.  0
The movie was full of crap.  0
The movie was full of rubbish.  0


### 89.アーキテクチャの変更
問題87とは異なるアーキテクチャ（例えば[CLS]トークンを用いるか、各トークンの最大値プーリングを用いるなど）の分類モデルを設計し、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [7]:
from transformers import AutoModel, AutoConfig, BatchEncoding, DataCollatorWithPadding, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from transformers.modeling_outputs import SequenceClassifierOutput

model_name = 'google-bert/bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def make_dataset(file_name):
  df = pd.read_csv(file_name, sep='\t')
  df['label'] = df['label'].astype(int)
  df['tokens'] = df['sentence'].apply(tokenizer.tokenize)
  return Dataset.from_pandas(df)

def compute_accuracy(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return {"accuracy": (predictions == labels).mean()}

train_dataset =  make_dataset('cp07-data/SST-2/train.tsv')
dev_dataset = make_dataset('cp07-data/SST-2/dev.tsv')

def preprocess_text_classification(
        example: dict[str, str | int]
) -> BatchEncoding:
    
    encoded_example = tokenizer(example['sentence'], max_length=128)
    encoded_example['labels'] = example['label']
    return encoded_example

encoded_train_dataset = train_dataset.map(preprocess_text_classification, remove_columns=train_dataset.column_names)
encoded_dev_dataset = dev_dataset.map(preprocess_text_classification, remove_columns=dev_dataset.column_names)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

class_label = train_dataset.features['label']

class CommonLitModel(nn.Module):
    
    def __init__(self, model_name, num_labels):
        super(CommonLitModel, self).__init__()
        self.config = AutoConfig.from_pretrained(model_name)
        self.bert = AutoModel.from_pretrained(
            model_name
        )
        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
        self.regressor = nn.Linear(self.config.hidden_size, num_labels)

    def forward(self, input_ids, attention_mask, token_type_ids, labels):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )
        hidden_states = outputs.last_hidden_state
        mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        hidden_states = hidden_states.masked_fill(mask_expanded == 0, -1e9)
        pooled_output, _ = torch.max(hidden_states, dim=1)  # (batch_size, hidden_dim)

        pooled_output = self.dropout(pooled_output)
        logits = self.regressor(pooled_output)

        loss = None
        if labels is not None:
            if self.config.num_labels == 1:
                # 回帰問題 (CommonLitなど)
                loss_fct = nn.MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                # 分類問題 (例えばSST-2など)
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )  # max pooling
    
model = CommonLitModel(model_name=model_name, num_labels=2)
training_args = TrainingArguments(
    output_dir='/home/yokoyama/nlp-100/models/cp09/',
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy='epoch',
    warmup_ratio=0.1,
    save_total_limit=1,
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_dev_dataset,
    data_collator=data_collator,
    compute_metrics=compute_accuracy,
)
trainer.train()

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Map: 100%|██████████| 67349/67349 [00:16<00:00, 4032.03 examples/s]
Map: 100%|██████████| 872/872 [00:00<00:00, 3409.91 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2892,0.220065,0.920872
2,0.1228,0.242296,0.917431
3,0.0804,0.280345,0.925459
4,0.0595,0.328229,0.930046
5,0.0589,0.523736,0.924312




TrainOutput(global_step=5265, training_loss=0.12215669854753718, metrics={'train_runtime': 1307.5842, 'train_samples_per_second': 257.532, 'train_steps_per_second': 4.027, 'total_flos': 0.0, 'train_loss': 0.12215669854753718, 'epoch': 5.0})