## 6-5　BERT のファインチューニングと性能

In [1]:
# !mkdir chap6
%cd ./chap6

/content/chap6


In [2]:
!pip install transformers==4.18.0 fugashi===1.1.0 ipadic==1.0.0 pytorch-lightning==1.6.1



In [3]:
import glob
import torch
import random
import numpy as np
import pytorch_lightning as pl

from tqdm import tqdm
from torch.utils.data import DataLoader
from transformers import BertJapaneseTokenizer, BertForSequenceClassification

In [4]:
MODEL_NAME = 'tohoku-nlp/bert-base-japanese-whole-word-masking'

In [5]:
tokenizer = BertJapaneseTokenizer.from_pretrained(MODEL_NAME)
bert_sc = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
bert_sc = bert_sc.cuda()

Some weights of the model checkpoint at tohoku-nlp/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initializ

In [6]:
text_list = [
    'この映画は面白かった。',
    'この映画の最後にはがっかりさせられた。',
    'この映画を見て幸せな気持ちになった。'
    ]

label_list = [1, 0, 1]

encoding = tokenizer(text_list, padding='longest', return_tensors='pt')
encoding = {k: v.cuda() for k, v in encoding.items()}
labels = torch.tensor(label_list).cuda()

with torch.no_grad():
  output = bert_sc.forward(**encoding)
score = output.logits
labels_predicted = score.argmax(-1)
num_correct = (labels_predicted==labels).sum().item()
accuracy = num_correct / labels.size(0)

print('# scores:')
print(score.size())
print('# predicted labels:')
print(labels_predicted)
print('# accuracy')
print(accuracy)

# scores:
torch.Size([3, 2])
# predicted labels:
tensor([0, 1, 0], device='cuda:0')
# accuracy
0.0


In [7]:
encoding = tokenizer(text_list, padding='longest', return_tensors='pt')
encoding['labels'] = torch.tensor(label_list)
encoding = {k: v.cuda() for k, v in encoding.items()}

output = bert_sc(**encoding)
loss = output.loss
print(loss)

tensor(0.7364, device='cuda:0', grad_fn=<NllLossBackward0>)


In [8]:
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
!tar -zxf ldcc-20140209.tar.gz

--2024-05-22 07:11:22--  https://www.rondhuit.com/download/ldcc-20140209.tar.gz
Resolving www.rondhuit.com (www.rondhuit.com)... 59.106.19.174
Connecting to www.rondhuit.com (www.rondhuit.com)|59.106.19.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8855190 (8.4M) [application/x-gzip]
Saving to: ‘ldcc-20140209.tar.gz.1’


2024-05-22 07:11:27 (2.55 MB/s) - ‘ldcc-20140209.tar.gz.1’ saved [8855190/8855190]



In [9]:
!cat ./text/it-life-hack/it-life-hack-6342280.txt

http://news.livedoor.com/article/detail/6342280/
2012-03-06T13:00:00+0900
USB3.0対応で爆速データ転送！　9倍速のリーダー／ライター登場
USB3.0が登場してから今年で4年目となるがパソコン側でのUSB3.0ポート搭載が進んで来ても対応機器がなかなか充実していない現状がある。そんな中で新しく高速な読み取りが可能なメモリーカードリーダー／ライターが登場した。

バッファローコクヨサプライがUSB3.0対応のカードリーダー／ライターを発表した。SDHC対応のSD系メディアやコンパクトフラッシュ、メモリースティック系メディア、xDピクチャーカードといったデジカメやスマホ、携帯ゲームといった機器で使われている各種メディアを従来よりも短時間でPCに取り込むことが可能になる。

転送速度が5Gbps（理論値）とUSB2.0の480Mbpsと比べて爆速になったUSB3.0はPC側の対応が進んで来ていたが高速転送が生かせる周辺機器としては、外付けHDDや一部のUSBメモリーくらいしかなかった。これに多くのメディアが扱えるリーダー／ライターが加わることで手軽にUSB3.0の恩恵を受けることができるようになる。

今回発表されたのは、USB3.0ケーブルとカードリーダー本体が分かれるタイプの「BSCR09U3」シリーズ（3,240円）、USB3.0コネクタをカードリーダー本体に内蔵している「BSCRD04U3」シリーズ（2,690円）だ。共にホワイトとブラックのカラーバリエーションが用意される（発売は3月下旬以降）。

■リリースページ
■バッファローコクヨサプライ




■バッファローの記事をもっと見る
・約283gでカバンに入る！小型キーボードの驚くべき機能
・3種類のホットキーで使いやすい！AndroidとPCで使えるキーボードの魅力
・ドラえもんもビックリの新アイテム！マウスとキーボードが合体"OPAir"
・ありそうでなかった便利機能！ファイル仕分けする画期的なHDD


サンディスク SanDisk microSDHC 32GB（microSD 32GB） 超高速クラス4  変換アダプター付 世界国内シェアNo.1 バルク品
クチコミを見る


In [10]:
dataset_for_loader = [
    {'data': torch.tensor([0, 1]), 'labels': torch.tensor(0)},
    {'data': torch.tensor([2, 3]), 'labels': torch.tensor(1)},
    {'data': torch.tensor([4, 5]), 'labels': torch.tensor(2)},
    {'data': torch.tensor([6, 7]), 'labels': torch.tensor(3)},
]

loader = DataLoader(dataset_for_loader, batch_size=2)

for idx, batch in enumerate(loader):
  print(f'# batch {idx}')
  print(batch)

# batch 0
{'data': tensor([[0, 1],
        [2, 3]]), 'labels': tensor([0, 1])}
# batch 1
{'data': tensor([[4, 5],
        [6, 7]]), 'labels': tensor([2, 3])}


In [11]:
loader = DataLoader(dataset_for_loader, batch_size=2, shuffle=True)

for idx, batch in enumerate(loader):
  print(f'# batch {idx}')
  print(batch)

# batch 0
{'data': tensor([[2, 3],
        [0, 1]]), 'labels': tensor([1, 0])}
# batch 1
{'data': tensor([[4, 5],
        [6, 7]]), 'labels': tensor([2, 3])}


In [12]:
category_list = [
    'dokujo-shushin',
    'it-life-hack',
    'kaden-channel',
    'livedoor-home',
    'movie-enter',
    'peachy',
    'smax',
    'sports-watch',
    'topic-news'
]

tokenizer = BertJapaneseTokenizer.from_pretrained(MODEL_NAME)

max_length = 128
dataset_for_loader = []
for label, category in enumerate(tqdm(category_list)):
  for file in glob.glob(f'./text/{category}/{category}*'):
    lines = open(file).read().splitlines()
    text = '\n'.join(lines[3:])
    encoding = tokenizer(
        text,
        max_length=max_length,
        padding='max_length',
        truncation=True
    )

    encoding['labels'] = label
    encoding = {k: torch.tensor(v) for k, v in encoding.items()}
    dataset_for_loader.append(encoding)

100%|██████████| 9/9 [00:22<00:00,  2.53s/it]


In [13]:
print(dataset_for_loader[0])

{'input_ids': tensor([    2, 10994,     9,     6,   159,    37,    48,    32,     7,    36,
         9574,     5,  2358,    32,   833,    19,   174,    11,  1488,    38,
           15,    16,  6724,  5464,    12,  5210,  8585,    11,  8065, 12932,
           36,   546, 10780,  1064, 28555,    38,    11,  1174,    15,    10,
            8,   546, 10780,  1064, 28555,    12,     9,     6,  4430,   331,
            5,  5210,  8585,    12,  3593,     7,  2856,    16,    36,   804,
         1723,    28,  3871,  1058,    38,    13,  2547,    13,     6,  8585,
          854,     5,  4749,   109,     7,     6,    73, 22982,     5, 14344,
           12, 24127,   181,  7488,     5,  4310,   118,     5,  5831,    14,
         1876,    26,    62,     8,    59,  4310,    12,     9,     6,   908,
           19,   265,   587, 23149,     5,  1377,  1043,    12,  2519,    26,
           20,    10, 24127,   181,  7488,    11,  1951,    15,    16,    33,
            8,   106,     6,  9574,     5, 24127, 

In [14]:
random.shuffle(dataset_for_loader)
n = len(dataset_for_loader)
n_train = int(0.6*n)
n_val = int(0.2*n)

dataset_train = dataset_for_loader[:n_train]
dataset_val = dataset_for_loader[n_train:n_train+n_val]
dataset_test = dataset_for_loader[n_train+n_val:]

dataloader_train = DataLoader(dataset_train, batch_size=32, shuffle=True)
dataloader_val = DataLoader(dataset_val, batch_size=256)
dataloader_test = DataLoader(dataset_test, batch_size=256)

In [15]:
class BertForSequenceClassification_pl(pl.LightningModule):

  def __init__(self, model_name, num_labels, lr):
    super().__init__()
    self.save_hyperparameters()
    self.bert_sc = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

  def training_step(self, batch, batch_idx):
    output = self.bert_sc(**batch)
    loss = output.loss
    self.log('train_loss', loss)
    return loss

  def validation_step(self, batch, batch_idx):
    output = self.bert_sc(**batch)
    val_loss = output.loss
    self.log('val_loss', val_loss)

  def test_step(self, batch, batch_idx):
    labels = batch.pop('labels')
    output = self.bert_sc(**batch)
    labels_predicted = output.logits.argmax(-1)
    num_correct = (labels_predicted == labels).sum().item()
    accuracy = num_correct / labels.size(0)
    self.log('accuracy', accuracy)

  def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

`test_step` では、`accuracy` を計算するために `labels` を取得している。

In [16]:
checkpoint = pl.callbacks.ModelCheckpoint(
    monitor='val_loss',
    mode='min',
    save_top_k=1,
    save_weights_only=True,
    dirpath='model/'
)

trainer = pl.Trainer(
    gpus=1,
    max_epochs=10,
    callbacks=[checkpoint]
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True, used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [17]:
model = BertForSequenceClassification_pl(MODEL_NAME, num_labels=9, lr=1e-5)
trainer.fit(model, dataloader_train, dataloader_val)

Some weights of the model checkpoint at tohoku-nlp/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initializ

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

In [18]:
best_model_path = checkpoint.best_model_path
print('ベストモデルのファイル: ', checkpoint.best_model_path)
print('ベストモデルの検証データに対する損失: ', checkpoint.best_model_score)

ベストモデルのファイル:  /content/chap6/model/epoch=5-step=678-v1.ckpt
ベストモデルの検証データに対する損失:  tensor(0.2359, device='cuda:0')


In [19]:
# %load_ext tensorboard
# %tensorboard --logdir ./

In [20]:
test = trainer.test(dataloaders=dataloader_test)
print(f'Accuracy: {test[0]["accuracy"]:.2f}')

  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/chap6/model/epoch=5-step=678-v1.ckpt
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from checkpoint at /content/chap6/model/epoch=5-step=678-v1.ckpt


Testing: 0it [00:00, ?it/s]

Accuracy: 0.93


In [21]:
model = BertForSequenceClassification_pl.load_from_checkpoint(best_model_path)
model.bert_sc.save_pretrained('./model_transformers')

Some weights of the model checkpoint at tohoku-nlp/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initializ

In [22]:
bert_sc = BertForSequenceClassification.from_pretrained('./model_transformers')