(https://zenn.dev/novel_techblog/articles/362fceec01c8b1)の記事から  
ニュースをWorld, Sports, Business, Science/Technologyのラベルに分類する

- transformersモジュール
学習済みモデルの利用、新たなデータセットでの事前学習、そしてモデルの微調整を行うためのツール

- Tokenizer
文章を最小単位であるトークン化し、自然言語処理に用いる

- Datasets
データセットの読み込み、操作、前処理など
豊富なデータセットのリポジトリにもこのライブラリからアクセス可

# 大まかな流れ
事前学習されたモデルをfine tuningして特定のタスクに適用することにより、安価かつ精度の良いモデルを作る。（転移学習）
今回はDistilBERTをファインチューニングする

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch

2025-05-17 01:31:31.210662: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747413091.227567 3148287 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747413091.231322 3148287 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747413091.241885 3148287 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747413091.241899 3148287 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747413091.241901 3148287 computation_placer.cc:177] computation placer alr

In [2]:
dataset = load_dataset("ag_news")
dataset["train"] = dataset["train"].select(range(10000))

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

In [4]:
train_dataset = tokenized_dataset["train"]
test_dataset = tokenized_dataset["test"]

In [5]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4) # 今回の事前学習モデルはDistilBERT

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
training_args = TrainingArguments("test_trainer", eval_strategy="epoch")
trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset
)

trainer.train()

  return torch._C._cuda_getDeviceCount() > 0
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss
1,No log,0.291758
2,0.301700,0.275978
3,0.301700,0.304875




TrainOutput(global_step=939, training_loss=0.2121425746474911, metrics={'train_runtime': 23043.089, 'train_samples_per_second': 1.302, 'train_steps_per_second': 0.041, 'total_flos': 3974163701760000.0, 'train_loss': 0.2121425746474911, 'epoch': 3.0})

In [7]:
eval_results = trainer.evaluate()

print(f"Eval results: {eval_results}")

Eval results: {'eval_loss': 0.3048754036426544, 'eval_runtime': 628.4933, 'eval_samples_per_second': 12.092, 'eval_steps_per_second': 0.379, 'epoch': 3.0}


In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
input_text = "Text"
input_data = tokenizer(input_text, return_tensors='pt').to(device)
outputs = model(**input_data)
predicted_class_idx = outputs.logits.argmax(-1).item()
class_dict = {0: "World", 1: "Sports", 2: "Business", 3: "Science/Technology"}
predicted_class_name = class_dict[predicted_class_idx]
print(f"The input text is classified as: {predicted_class_name}")

The input text is classified as: Science/Technology
