(https://zenn.dev/novel_techblog/articles/362fceec01c8b1)の記事から  
ニュースをWorld, Sports, Business, Science/Technologyのラベルに分類する

- transformersモジュール
学習済みモデルの利用、新たなデータセットでの事前学習、そしてモデルの微調整を行うためのツール

- Tokenizer
文章を最小単位であるトークン化し、自然言語処理に用いる

- Datasets
データセットの読み込み、操作、前処理など
豊富なデータセットのリポジトリにもこのライブラリからアクセス可

# 大まかな流れ
事前学習されたモデルをfine tuningして特定のタスクに適用することにより、安価かつ精度の良いモデルを作る。（転移学習）
今回はDistilBERTをファインチューニングする

In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch

In [2]:
dataset = load_dataset("ag_news")
dataset["train"] = dataset["train"].select(range(10000))

Generating train split: 100%|██████████| 120000/120000 [00:00<00:00, 372299.46 examples/s]
Generating test split: 100%|██████████| 7600/7600 [00:00<00:00, 718639.91 examples/s]


In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 4474.43 examples/s]
Map: 100%|██████████| 7600/7600 [00:01<00:00, 4657.29 examples/s]


In [4]:
train_dataset = tokenized_dataset["train"]
test_dataset = tokenized_dataset["test"]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4) # 今回の事前学習モデルはDistilBERT

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
training_args = TrainingArguments("test_trainer", eval_strategy="epoch")
trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset
)

trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
eval_results = trainer.evaluate()

print(f"Eval results: {eval_results}")

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
input_text = "Text"
input_data = tokenizer(input_text, return_tensors='pt').to(device)
outputs = model(**input_data)
predicted_class_idx = outputs.logits.argmax(-1).item()
class_dict = {0: "World", 1: "Sports", 2: "Business", 3: "Science/Technology"}
predicted_class_name = class_dict[predicted_class_idx]
print(f"The input text is classified as: {predicted_class_name}")