<a href="https://colab.research.google.com/github/raymondwcs/learning_bert/blob/master/Fine_tuning_a_prertrained_model_(huggingface_Trainer).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Reference:

https://huggingface.co/transformers/training.html


In [1]:
!git clone https://github.com/Christainx/Openrice_Cantonese
!7z e Openrice_Cantonese/Openrice_Cantonese.7z -aoa

Cloning into 'Openrice_Cantonese'...
remote: Enumerating objects: 17, done.[K
remote: Total 17 (delta 0), reused 0 (delta 0), pack-reused 17[K
Unpacking objects: 100% (17/17), done.

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.30GHz (306F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 8138559 bytes (7948 KiB)

Extracting archive: Openrice_Cantonese/Openrice_Cantonese.7z
--
Path = Openrice_Cantonese/Openrice_Cantonese.7z
Type = 7z
Physical Size = 8138559
Headers Size = 154
Method = LZMA2:25
Solid = -
Blocks = 1

  0%     34% - Openrice_Cantonese.txt                              74% - Openrice_Cantonese.txt                             Everything is Ok

Size:       27969449
Compressed: 8138559


In [2]:
!pip install --quiet transformers
!pip install --quiet datasets
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments,Trainer,pipeline

import numpy as np
import pandas as pd
import torch, random
from datasets import load_metric
from sklearn import preprocessing

MAX_LEN = 256
CHECKPOINT = 'uer/roberta-base-finetuned-jd-full-chinese'  # https://huggingface.co/uer/roberta-base-finetuned-jd-full-chinese

# Set the seed value all over the place to make this reproducible.
seed_val = 0

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
model = AutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=5)

[K     |████████████████████████████████| 2.6 MB 5.3 MB/s 
[K     |████████████████████████████████| 895 kB 64.6 MB/s 
[K     |████████████████████████████████| 3.3 MB 47.6 MB/s 
[K     |████████████████████████████████| 636 kB 58.5 MB/s 
[K     |████████████████████████████████| 542 kB 5.2 MB/s 
[K     |████████████████████████████████| 243 kB 58.9 MB/s 
[K     |████████████████████████████████| 118 kB 54.7 MB/s 
[K     |████████████████████████████████| 76 kB 5.2 MB/s 
[?25h

Downloading:   0%|          | 0.00/295 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/409M [00:00<?, ?B/s]

In [3]:
df_train = pd.read_csv('Openrice_Cantonese.txt', delimiter="\t\t", header=None, nrows=128, engine='python')
df_eval = pd.read_csv('Openrice_Cantonese.txt', delimiter="\t\t", header=None, skiprows=200, nrows=64, engine='python')
df_test = pd.read_csv('Openrice_Cantonese.txt', delimiter="\t\t", header=None, skiprows=300, nrows=16, engine='python')

# remap labels - 5 to 4; 4 to 3; 3 to 2; 2 to 1; 1 to 0
train_labels = df_train.iloc[:,0].transform(lambda x: x -1)
eval_labels = df_eval.iloc[:,0].transform(lambda x: x -1)
test_labels = df_test.iloc[:,0].transform(lambda x: x -1)

train_encodings = tokenizer(df_train.iloc[:,1].values.tolist(), truncation=True, padding=True, max_length=MAX_LEN)
eval_encodings = tokenizer(df_eval.iloc[:,1].values.tolist(), truncation=True, padding=True, max_length=MAX_LEN)
test_encodings = tokenizer(df_test.iloc[:,1].values.tolist(), truncation=True, padding=True, max_length=MAX_LEN)

class OpenRiceDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = OpenRiceDataset(train_encodings, train_labels)
eval_dataset = OpenRiceDataset(eval_encodings, eval_labels)
test_dataset = OpenRiceDataset(test_encodings, test_labels)

# debug...
# data = next(iter(train_dataset))
# data

In [4]:
# Lets find out the classifier's performance before fine-tuning
# print(df_test)

text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
results = text_classification(df_test.iloc[:,1].values.tolist())

for i in range(len(df_test)):
  print(results[i], test_labels[i], df_test[1][i])

    0                                                  1
0   5  <sssss>   一條舊街，一條石屎樓梯，一望親切嘅裝修，原來係金利來老闆，佢張金利來定咗...
1   3  工作忙碌, 有時候連吃飯的時間也減少了, 唯有近近地找個便利的地方吃午餐, 其實這裡勝在地方...
2   4  串燒宵夜🍢😋聽講美味佳好出名，終於俾我試到啦。十一點左右去到唔洗等位，野食等左大約半個鐘就黎...
3   3  怕排隊，特意提前出門的時間半走半跑的來到有「紅磡之寶」之稱（誰改的名字？）的小店，誰知道甫坐...
4   3  呢到食面都係普通咁叫幾餸。特別之處係叫湯可以有湯渣。小弟最愛食湯渣，D肉煲到好軟綿綿，但又未...
5   2  我好少寫食評, 今日都忍唔住要寫一寫見到網上介紹所以試下~ 6:30到達不用等位先來招牌龍蝦...
6   1  食物質量差，超低性價比牛肉很薄，沒有牛肉味，賣295元，”盜鄉”的安格斯牛肉比他好食好多水餃...
7   4  好好彩當晚唔洗等位，一入就已被告知只有1.5 小時。去為食仔就當然要食卜卜蜆煲啦！一入去立刻...
8   4  見到好多朋友是上網食卜卜蜆呃like，所以就去試下啦。嗰度環境就比較窄，但係服務人員態度良好...
9   4  係太古睇完戲都10點幾，好多餐廳都關門。所以專登搭車出去銅鑼灣食野，咁放諗住食麵算啦，突然諗...
10  4  中辣的傷心酸辣粉真的超級辣 快要辣哭了水煮魚好好食，魚肉超級嫩 辣度也剛剛好夥伴點了雞絲涼麵...
11  2  兒時爸爸最喜歡飲茶嘅地方～鳯城，好耐冇食，今日同朋友去懷舊一下。一般酒家嘅裝修及格局，點了菠...
12  4  媽咪生日揀咗好耐唔知去邊度食飯, 最後決定揀咗鳳城！頭盤叫咗個例牌燒肉+蝦多士,味道新鮮又夠...
13  3  只來過食晚餐, 飲茶倒是第一次.星期日中午都竟不太多人, 上樓上馬上有位.叫了幾味標準點心,...
14  3  今日下午去鳳城不過我地選擇唔飲茶...叫餸吃飯~~叫左:大良野雞卷:麻麻地，炸到好乾.......
15  3  係英皇道行，完全無心水食咩，行行下見到薩利亞同大x樂, 諗住是但食一間算，所以就揀左薩利亞蒜...
{'label': 'star 4', 'score': 0.

In [5]:
# Fine-tuning with Trainer

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=eval_dataset            # evaluation dataset
)

trainer.train()

trainer.save_model()

***** Running training *****
  Num examples = 128
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 24


Step,Training Loss
10,1.2147
20,1.1434




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to ./results
Configuration saved in ./results/config.json
Model weights saved in ./results/pytorch_model.bin


In [None]:
# Fine-tuning with native PyTorc

# from torch.utils.data import DataLoader
# from transformers import AdamW

# device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# model = AutoModelForSequenceClassification.from_pretrained(CHECKPOINT)
# model.to(device)
# model.train()

# train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# optim = AdamW(model.parameters(), lr=5e-5)

# for epoch in range(3):
#     for batch in train_loader:
#         optim.zero_grad()
#         input_ids = batch['input_ids'].to(device)
#         attention_mask = batch['attention_mask'].to(device)
#         labels = batch['labels'].to(device)
#         outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
#         loss = outputs[0]
#         loss.backward()
#         optim.step()

# model.eval()

In [None]:
# Give the fine-tuned model a try!

model = AutoModelForSequenceClassification.from_pretrained('./results')

print(df_test)

text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
results = text_classification(df_test.iloc[:,1].values.tolist())

for i in range(len(df_test)):
  print(results[i], test_labels[i], df_test[1][i])


loading configuration file ./results/config.json
Model config BertConfig {
  "_name_or_path": "uer/roberta-base-finetuned-jd-full-chinese",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "star 1",
    "1": "star 2",
    "2": "star 3",
    "3": "star 4",
    "4": "star 5"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "star 1": 0,
    "star 2": 1,
    "star 3": 2,
    "star 4": 3,
    "star 5": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.9.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_s

    0                                                  1
0   5  <sssss>   一條舊街，一條石屎樓梯，一望親切嘅裝修，原來係金利來老闆，佢張金利來定咗...
1   3  工作忙碌, 有時候連吃飯的時間也減少了, 唯有近近地找個便利的地方吃午餐, 其實這裡勝在地方...
2   4  串燒宵夜🍢😋聽講美味佳好出名，終於俾我試到啦。十一點左右去到唔洗等位，野食等左大約半個鐘就黎...
3   3  怕排隊，特意提前出門的時間半走半跑的來到有「紅磡之寶」之稱（誰改的名字？）的小店，誰知道甫坐...
4   3  呢到食面都係普通咁叫幾餸。特別之處係叫湯可以有湯渣。小弟最愛食湯渣，D肉煲到好軟綿綿，但又未...
5   2  我好少寫食評, 今日都忍唔住要寫一寫見到網上介紹所以試下~ 6:30到達不用等位先來招牌龍蝦...
6   1  食物質量差，超低性價比牛肉很薄，沒有牛肉味，賣295元，”盜鄉”的安格斯牛肉比他好食好多水餃...
7   4  好好彩當晚唔洗等位，一入就已被告知只有1.5 小時。去為食仔就當然要食卜卜蜆煲啦！一入去立刻...
8   4  見到好多朋友是上網食卜卜蜆呃like，所以就去試下啦。嗰度環境就比較窄，但係服務人員態度良好...
9   4  係太古睇完戲都10點幾，好多餐廳都關門。所以專登搭車出去銅鑼灣食野，咁放諗住食麵算啦，突然諗...
10  4  中辣的傷心酸辣粉真的超級辣 快要辣哭了水煮魚好好食，魚肉超級嫩 辣度也剛剛好夥伴點了雞絲涼麵...
11  2  兒時爸爸最喜歡飲茶嘅地方～鳯城，好耐冇食，今日同朋友去懷舊一下。一般酒家嘅裝修及格局，點了菠...
12  4  媽咪生日揀咗好耐唔知去邊度食飯, 最後決定揀咗鳳城！頭盤叫咗個例牌燒肉+蝦多士,味道新鮮又夠...
13  3  只來過食晚餐, 飲茶倒是第一次.星期日中午都竟不太多人, 上樓上馬上有位.叫了幾味標準點心,...
14  3  今日下午去鳳城不過我地選擇唔飲茶...叫餸吃飯~~叫左:大良野雞卷:麻麻地，炸到好乾.......
15  3  係英皇道行，完全無心水食咩，行行下見到薩利亞同大x樂, 諗住是但食一間算，所以就揀左薩利亞蒜...
{'label': 'star 4', 'score': 0.