# 第9章: 事前学習済み言語モデル (BERT型)

本章では, BERT型の事前学習済みモデルを利用して, マスク単語の予測や文ベクトルの計算, 評判分析器 (ポジネガ分類器) の構築に取り組む.

```{warning}
本章は, `code-cell` ではなく, Markdown のコードブロック内にコードを記述しているため, Google Colab上で直接実行できません.
```

## 80. トークン化

"The movie was full of incomprehensibilities."という文をトークンに分解し, トークン列を表示せよ.

```python
from transformers import AutoModel, AutoTokenizer

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "The movie was full of incomprehensibilities."
tokens = tokenizer.tokenize(text)

print(tokens)
```

```bash
['the', 'movie', 'was', 'full', 'of', 'inc', '##omp', '##re', '##hen', '##si', '##bilities', '.']
```

## 81. マスクの予測

"The movie was full of [MASK]."の"[MASK]"を埋めるのに最も適切なトークンを求めよ.

```python
from transformers import pipeline

model_name = "bert-base-uncased"
unmasker = pipeline("fill-mask", model=model_name)

masked_text = "The movie was full of [MASK]."
results = unmasker(masked_text)

print(f'[MASK]: {results[0]["token_str"]}')
```

```bash
[MASK]: fun
```

## 82. マスクのtop-k予測

"The movie was full of [MASK]."の"[MASK]"に埋めるのに適切なトークン上位10個と, その確率 (尤度) を求めよ.

```python
from transformers import pipeline

model_name = "bert-base-uncased"
unmasker = pipeline("fill-mask", model=model_name)

masked_text = "The movie was full of [MASK]."
results = unmasker(masked_text, top_k=10)

for i in range(10):
    print(f'[MASK]: {results[i]["token_str"]} (score: {results[i]["score"]})')
```

```bash
[MASK]: fun (score: 0.10711898654699326)
[MASK]: surprises (score: 0.06634481996297836)
[MASK]: drama (score: 0.04468412697315216)
[MASK]: stars (score: 0.027217047289013863)
[MASK]: laughs (score: 0.02541276253759861)
[MASK]: action (score: 0.019516924396157265)
[MASK]: excitement (score: 0.019038109108805656)
[MASK]: people (score: 0.018290270119905472)
[MASK]: tension (score: 0.015030566602945328)
[MASK]: music (score: 0.014646219089627266)
```

## 83. CLSトークンによる文ベクトル

以下の文の全ての組み合わせに対して, 最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ.

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."

```python
from transformers import AutoModel, AutoTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

sentences = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    
cls_embs = outputs.last_hidden_state[:, 0, :]

cls_np = cls_embs.cpu().numpy()
cos_sim_mat = cosine_similarity(cls_np)

print("Cosine Similarity Matrix:")
print(cos_sim_mat)
```

```bash
Cosine Similarity Matrix:
[[0.99999976 0.98806083 0.95576596 0.9475324 ]
 [0.98806083 1.         0.95412743 0.9486636 ]
 [0.95576596 0.95412743 0.9999998  0.9806931 ]
 [0.9475324  0.9486636  0.9806931  1.0000002 ]]
```

## 84. 平均による文ベクトル

以下の文の全ての組み合わせに対して, 最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ.

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."

```python
from transformers import AutoModel, AutoTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

sentences = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    
last_hidden = outputs.last_hidden_state
attention_mask = inputs['attention_mask']

mask = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
masked_hidden = last_hidden * mask
sum_hidden = masked_hidden.sum(1)
sum_mask = mask.sum(1)
mean_pooled = sum_hidden / sum_mask

mean_vecs = mean_pooled.cpu().numpy()
cos_sim_mat = cosine_similarity(mean_vecs)
cos_sim_mat = np.clip(cos_sim_mat, -1.0, 1.0)

print("Cosine Similarity Matrix (Mean Pooling):")
print(cos_sim_mat)
```

```bash
Cosine Similarity Matrix (Mean Pooling):
[[1.         0.95681167 0.8489993  0.81688446]
 [0.95681167 0.99999976 0.83518374 0.79384434]
 [0.8489993  0.83518374 1.         0.9225539 ]
 [0.81688446 0.79384434 0.9225539  0.9999999 ]]
```

## 85. データセットの準備

[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) ベンチマークで配布されている[Stanford Sentiment Treebank (SST)](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip) から訓練セット (train.tsv) と開発セット (dev.tsv) のテキストと極性ラベルと読み込み, さらに全てのテキストはトークン列に変換せよ.

```python
from transformers import AutoModel, AutoTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

train_df = pd.read_table("../ch07/SST-2/train.tsv", delimiter='\t')
dev_df = pd.read_table("../ch07/SST-2/dev.tsv", delimiter='\t')

def tokenize_all_text(df, tokenizer):
    df["tokens"] = [None] * len(df)
    for i in range(len(df)):
        text = df["sentence"][i]
        tokens = tokenizer.tokenize(text)
        df["tokens"][i] = tokens
        
    return df

train_df = tokenize_all_text(train_df, tokenizer)
dev_df = tokenize_all_text(dev_df, tokenizer)

train_df.to_csv("SST-2/train.tsv", index=False, sep='\t')
dev_df.to_csv("SST-2/dev.tsv", index=False, sep='\t')

print("Train:")
print(train_df.head())
print("Dev:")
print(dev_df.head())
```

```bash
Train:
                                            sentence  label                                             tokens
0       hide new secretions from the parental units       0  [hide, new, secret, ##ions, from, the, parenta...
1               contains no wit , only labored gags       0  [contains, no, wit, ,, only, labor, ##ed, gag,...
2  that loves its characters and communicates som...      1  [that, loves, its, characters, and, communicat...
3  remains utterly satisfied to remain the same t...      0  [remains, utterly, satisfied, to, remain, the,...
4  on the worst revenge-of-the-nerds clichés the ...      0  [on, the, worst, revenge, -, of, -, the, -, ne...
Dev:
                                            sentence  label                                             tokens
0    it 's a charming and often affecting journey .       1  [it, ', s, a, charming, and, often, affecting,...
1                 unflinchingly bleak and desperate       0  [un, ##fl, ##in, ##ching, ##ly, bleak, and, de...
2  allows us to hope that nolan is poised to emba...      1  [allows, us, to, hope, that, nolan, is, poised...
3  the acting , costumes , music , cinematography...      1  [the, acting, ,, costumes, ,, music, ,, cinema...
4                  it 's slow -- very , very slow .       0     [it, ', s, slow, -, -, very, ,, very, slow, .]
```

## 86. ミニバッチの作成

85で読み込んだ訓練データの一部 (例えば冒頭の4事例) に対して, パディングなどの処理を行い, トークン列の長さを揃えてミニバッチを構成せよ.

```python
from transformers import AutoModel, AutoTokenizer
import pandas as pd

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

train_df = pd.read_table("../ch07/SST-2/train.tsv", delimiter='\t')
dev_df = pd.read_table("../ch07/SST-2/dev.tsv", delimiter='\t')

def tokenize_all_text(df, tokenizer):
    df["tokens"] = [None] * len(df)
    for i in range(len(df)):
        text = df["sentence"][i]
        tokens = tokenizer.tokenize(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
        df["tokens"][i] = tokens
        
    return df

train_df = tokenize_all_text(train_df[:4], tokenizer)
dev_df = tokenize_all_text(dev_df[:4], tokenizer)

print("Train:")
print(train_df.head())
print("Dev:")
print(dev_df.head())
```

## 87. ファインチューニング

訓練セットを用い, 事前学習済みモデルを極性分析タスク向けにファインチューニングせよ.検証セット上でファインチューニングされたモデルの正解率を計測せよ.

```python
import torch
import pandas as pd
import argparse
import csv
import json
import logging
from datetime import datetime
from torch.utils.data import DataLoader
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
)
import pytorch_lightning as pl
from sklearn.metrics import accuracy_score

def tsv2json(input_file, output_file):
    with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
        reader = csv.DictReader(f_in, delimiter='\t')
        for row in reader:
            data = {
                "sentence": row['sentence'].strip(),
                "label": int(row['label'].strip())
            }
            f_out.write(json.dumps(data) + "\n")

tsv2json("../ch07/SST-2/train.tsv", "SST-2/train.json")
tsv2json("../ch07/SST-2/dev.tsv", "SST-2/dev.json")

def parse_args():
    parser = argparse.ArgumentParser(description="Fine-tune BERT model for sentiment analysis")
    parser.add_argument("--model", type=str, required=True, help="Path to the pre-trained BERT model")
    parser.add_argument("--lr", type=float, required=True, help="Learning rate for training")
    parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay for regularization (default: 0.01)")
    parser.add_argument("--max_epochs", type=int, required=True, help="Maximum number of epochs for training")
    return parser.parse_args()

def setup_logging():
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    log_filename = f"log/training_{timestamp}.log"

    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(levelname)s - %(message)s",
        handlers=[
            logging.FileHandler(log_filename),
            logging.StreamHandler()
        ]
    )
    logging.info(f"Logging setup complete. Logs will be saved to: {log_filename}")
    return log_filename

def make_dataset(tokenizer, max_length, texts, labels):
    dataset_for_loader = list()
    for text, label in zip(texts, labels):
        encoding = tokenizer(text, max_length=max_length, padding="max_length", truncation=True)
        encoding["labels"] = label
        encoding = {key: torch.tensor(value) for key, value in encoding.items()}
        dataset_for_loader.append(encoding)
    return dataset_for_loader

class SentimentAnalyzer(pl.LightningModule):
    def __init__(self, model, num_labels, lr, weight_decay):
        super().__init__()
        self.save_hyperparameters()
        self.bert_sc = AutoModelForSequenceClassification.from_pretrained(
            model, num_labels=num_labels, ignore_mismatched_sizes=True
        )

    def forward(self, **inputs):
        return self.bert_sc(**inputs)

    def training_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("train_loss", loss)
        self.log("train_acc", acc)
        return loss

    def validation_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        val_loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("val_loss", val_loss)
        self.log("val_acc", acc)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(
            self.parameters(), lr=self.hparams.lr, weight_decay=self.hparams.weight_decay
        )
        return optimizer
    
def main():
    log_file = setup_logging()
    args = parse_args()
    logging.info("Starting training with arguments:")
    logging.info(vars(args))

    train_df = pd.read_json('SST-2/train.json', lines=True)
    val_df = pd.read_json('SST-2/dev.json', lines=True)

    tokenizer = AutoTokenizer.from_pretrained(args.model)
    max_length = 100

    train = make_dataset(tokenizer, max_length, train_df["sentence"].tolist(), train_df["label"].tolist())
    val = make_dataset(tokenizer, max_length, val_df["sentence"].tolist(), val_df["label"].tolist())

    dataloader_train = DataLoader(train, batch_size=64, shuffle=True)
    dataloader_val = DataLoader(val, batch_size=512, shuffle=False)

    model = SentimentAnalyzer(
        args.model, num_labels=2, lr=args.lr, weight_decay=args.weight_decay
    )

    checkpoint = pl.callbacks.ModelCheckpoint(
        monitor="val_acc", mode="max", save_top_k=1,
        save_weights_only=True, dirpath="model/"
    )

    early_stopping = pl.callbacks.EarlyStopping(
        monitor='val_acc',
        patience=3,
        verbose=True,
        mode='max'
    )

    trainer = pl.Trainer(
        max_epochs=args.max_epochs,
        callbacks=[checkpoint, early_stopping],
        devices=1,
        accelerator="auto",
        gradient_clip_val=1.0,
        log_every_n_steps=10
    )
    logging.info("Trainer initialized. Starting training...")
    trainer.fit(model, dataloader_train, dataloader_val)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    model_path = f"./models/{args.model.split('/')[-1]}_lr{args.lr}_wd{args.weight_decay}_epochs{args.max_epochs}_{timestamp}.ckpt"
    trainer.save_checkpoint(model_path)

    logging.info(f"Best model saved at: {checkpoint.best_model_path}")
    logging.info(f"Best validation ACC: {checkpoint.best_model_score}")

if __name__ == "__main__":
    main()
```

```bash
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:17<00:00,  2.40it/s, v_num=2]
Metric val_acc improved. New best score: 0.817
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:25<00:00,  2.36it/s, v_num=2]
Metric val_acc improved by 0.010 >= min_delta = 0.0. New best score: 0.827
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:25<00:00,  2.36it/s, v_num=2]
`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:25<00:00,  2.36it/s, v_num=2]
2025-04-16 11:35:05,228 - INFO - Best model saved at: /net/nas8/data/home/yoneyama/workspace/nlp100_2025/ch09/model/epoch=1-step=2106.ckpt
2025-04-16 11:35:05,244 - INFO - Best validation ACC: 0.8268348574638367
```

## 88. 極性分析

問題87でファインチューニングされたモデルを用いて, 以下の文の極性を予測せよ.

- "The movie was full of incomprehensibilities."
- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."

```python
import torch
import pandas as pd
import argparse
import torch.nn.functional as F
from datetime import datetime
from torch.utils.data import DataLoader
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
)
import pytorch_lightning as pl
from sklearn.metrics import accuracy_score

def parse_args():
    parser = argparse.ArgumentParser(description="Fine-tune BERT model for sentiment analysis")
    parser.add_argument("--model", type=str, required=True, help="Path to the pre-trained BERT model")
    parser.add_argument("--ckpt", type=str, required=True, help="Path to the .ckpt")
    return parser.parse_args()

class SentimentAnalyzer(pl.LightningModule):
    def __init__(self, model, num_labels, lr, weight_decay):
        super().__init__()
        self.save_hyperparameters()
        self.bert_sc = AutoModelForSequenceClassification.from_pretrained(
            model, num_labels=num_labels, ignore_mismatched_sizes=True
        )

    def forward(self, **inputs):
        return self.bert_sc(**inputs)

    def training_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("train_loss", loss)
        self.log("train_acc", acc)
        return loss

    def validation_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        val_loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("val_loss", val_loss)
        self.log("val_acc", acc)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(
            self.parameters(), lr=self.hparams.lr, weight_decay=self.hparams.weight_decay
        )
        return optimizer
    
def inference(text, tokenizer, model, device='cuda' if torch.cuda.is_available() else 'cpu'):
    model = model.to(device)
    model.eval()
    
    tokens = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    tokens = {key: value.to(device) for key, value in tokens.items()}
    
    with torch.no_grad():
        outputs = model.bert_sc(**tokens)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=-1)
        pred_class = torch.argmax(probs, dim=-1).item()
    return pred_class

def main():
    args = parse_args()

    tokenizer = AutoTokenizer.from_pretrained(args.model)
    model = SentimentAnalyzer.load_from_checkpoint(args.ckpt)
    
    example = [
        "The movie was full of incomprehensibilities.",
        "The movie was full of fun.",
        "The movie was full of excitement.",
        "The movie was full of crap.",
        "The movie was full of rubbish."
    ]
    
    for text in example:
        pred = inference(text, tokenizer, model)
        print(f"{text} -> {pred}")

if __name__ == "__main__":
    main()

```

```bash
The movie was full of incomprehensibilities. -> 0
The movie was full of fun. -> 1
The movie was full of excitement. -> 1
The movie was full of crap. -> 0
The movie was full of rubbish. -> 0
```

## 89. アーキテクチャの変更

問題87とは異なるアーキテクチャ (例えば[CLS]トークンを用いるか, 各トークンの最大値プーリングを用いるなど) の分類モデルを設計し, 事前学習済みモデルを極性分析タスク向けにファインチューニングせよ.検証セット上でファインチューニングされたモデルの正解率を計測せよ.

```python
import torch
import pandas as pd
import argparse
import csv
import json
import logging
from datetime import datetime
from torch.utils.data import DataLoader
from transformers import (
    AutoTokenizer, 
    AutoModel
)
import pytorch_lightning as pl
from sklearn.metrics import accuracy_score

def tsv2json(input_file, output_file):
    with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
        reader = csv.DictReader(f_in, delimiter='\t')
        for row in reader:
            data = {
                "sentence": row['sentence'].strip(),
                "label": int(row['label'].strip())
            }
            f_out.write(json.dumps(data) + "\n")

tsv2json("../ch07/SST-2/train.tsv", "SST-2/train.json")
tsv2json("../ch07/SST-2/dev.tsv", "SST-2/dev.json")

def parse_args():
    parser = argparse.ArgumentParser(description="Fine-tune BERT model for sentiment analysis")
    parser.add_argument("--model", type=str, required=True, help="Path to the pre-trained BERT model")
    parser.add_argument("--lr", type=float, required=True, help="Learning rate for training")
    parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay for regularization (default: 0.01)")
    parser.add_argument("--max_epochs", type=int, required=True, help="Maximum number of epochs for training")
    return parser.parse_args()

def setup_logging():
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    log_filename = f"log/training_{timestamp}.log"

    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(levelname)s - %(message)s",
        handlers=[
            logging.FileHandler(log_filename),
            logging.StreamHandler()
        ]
    )
    logging.info(f"Logging setup complete. Logs will be saved to: {log_filename}")
    return log_filename

def make_dataset(tokenizer, max_length, texts, labels):
    dataset_for_loader = list()
    for text, label in zip(texts, labels):
        encoding = tokenizer(text, max_length=max_length, padding="max_length", truncation=True)
        encoding["labels"] = label
        encoding = {key: torch.tensor(value) for key, value in encoding.items()}
        dataset_for_loader.append(encoding)
    return dataset_for_loader

class SentimentAnalyzer(pl.LightningModule):
    def __init__(self, model, num_labels, lr, weight_decay):
        super().__init__()
        self.save_hyperparameters()
        self.bert = AutoModel.from_pretrained(model)
        hidden_size = self.bert.config.hidden_size
        self.classifier = torch.nn.Linear(hidden_size, num_labels)
        self.lr = lr
        self.weight_decay = weight_decay

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state
        pooled = torch.max(last_hidden_state, dim=1).values
        logits = self.classifier(pooled)
        return logits

    def training_step(self, batch, batch_idx):
        labels = batch["labels"]
        logits = self.forward(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
        loss = torch.nn.functional.cross_entropy(logits, labels)
        preds = logits.argmax(dim=-1)
        acc = accuracy_score(labels.cpu().numpy(), preds.cpu().numpy())
        self.log("train_loss", loss)
        self.log("train_acc", acc)
        return loss

    def validation_step(self, batch, batch_idx):
        labels = batch["labels"]
        logits = self.forward(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
        val_loss = torch.nn.functional.cross_entropy(logits, labels)
        preds = logits.argmax(dim=-1)
        acc = accuracy_score(labels.cpu().numpy(), preds.cpu().numpy())
        self.log("val_loss", val_loss)
        self.log("val_acc", acc)

    def configure_optimizers(self):
        return torch.optim.AdamW(
            self.parameters(), lr=self.lr, weight_decay=self.weight_decay
        )
    
def main():
    log_file = setup_logging()
    args = parse_args()
    logging.info("Starting training with arguments:")
    logging.info(vars(args))

    train_df = pd.read_json('SST-2/train.json', lines=True)
    val_df = pd.read_json('SST-2/dev.json', lines=True)
    
    tokenizer = AutoTokenizer.from_pretrained(args.model)
    max_length = 100

    train = make_dataset(tokenizer, max_length, train_df["sentence"].tolist(), train_df["label"].tolist())
    val = make_dataset(tokenizer, max_length, val_df["sentence"].tolist(), val_df["label"].tolist())

    dataloader_train = DataLoader(train, batch_size=64, shuffle=True)
    dataloader_val = DataLoader(val, batch_size=512, shuffle=False)

    model = SentimentAnalyzer(
        args.model, num_labels=2, lr=args.lr, weight_decay=args.weight_decay
    )

    checkpoint = pl.callbacks.ModelCheckpoint(
        monitor="val_acc", mode="max", save_top_k=1,
        save_weights_only=True, dirpath="model/"
    )

    early_stopping = pl.callbacks.EarlyStopping(
        monitor='val_acc',
        patience=3,
        verbose=True,
        mode='max'
    )

    trainer = pl.Trainer(
        max_epochs=args.max_epochs,
        callbacks=[checkpoint, early_stopping],
        devices=1,
        accelerator="auto",
        gradient_clip_val=1.0,
        log_every_n_steps=10
    )
    logging.info("Trainer initialized. Starting training...")
    trainer.fit(model, dataloader_train, dataloader_val)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    model_path = f"./models/{args.model.split('/')[-1]}_lr{args.lr}_wd{args.weight_decay}_epochs{args.max_epochs}_{timestamp}.ckpt"
    trainer.save_checkpoint(model_path)

    logging.info(f"Best model saved at: {checkpoint.best_model_path}")
    logging.info(f"Best validation ACC: {checkpoint.best_model_score}")

if __name__ == "__main__":
    main()

```

```bash
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:30<00:00,  2.34it/s, v_num=3]
Metric val_acc improved. New best score: 0.794
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:38<00:00,  2.30it/s, v_num=3]
Metric val_acc improved by 0.022 >= min_delta = 0.0. New best score: 0.815
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:36<00:00,  2.31it/s, v_num=3]
Metric val_acc improved by 0.011 >= min_delta = 0.0. New best score: 0.827
`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1053/1053 [07:43<00:00,  2.27it/s, v_num=3]
2025-04-16 12:10:02,356 - INFO - Best model saved at: /net/nas8/data/home/yoneyama/workspace/nlp100_2025/ch09/model/epoch=2-step=3159.ckpt
2025-04-16 12:10:02,376 - INFO - Best validation ACC: 0.8268348574638367
```