<a href="https://colab.research.google.com/github/zzxx666413/AI/blob/main/17.Bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 查看GPU <br>
NVIDIA-SMI是查看GPU設備的命令，本範例GPU卡是由colab提供的Tesla T4，擁有15109Mib(16G)的顯卡記憶體。



In [None]:
!nvidia-smi

Thu May 19 07:01:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### 安裝需要的BERT套件 <br>


*   Transformers套件，是由Hugging Face網站所提供的許多Transformer育訓練模型所組成的套件，可參考https://huggingface.co/docs/transformers/index
*   nlp套件,是由Hugging Face網站所提供，由許多資料及所組成的套件，可參考https://github.com/huggingface/datasets


*   torch為pytorch套件






In [None]:
!pip install transformers -U
!pip install nlp -U
!pip install torch -U



### 載入套件


*   BertForSequenceClassification:針對GLUE任務之類的sequence classification/regression模型，會自動在Bert模型上方加入linear layer

*   BertTokenizerFast:基於BPE(Byte-Pair Encoding)的WordPiece(把字切成基礎單字的過程)，例如loved，loving，loves這幾個字，可以切成lov，ed，ing，es

*   Trainer:訓練模型相關套件 
*   TrainingArguments:Transformers訓練相關參數設定函式


*   load_dataset, Dataset:匯入資料集函式


*   torch:pytorch套件


*   random:亂數套件

*   pandas:讀取檔案資料套件
*   sklearn.metrics:sklearn圍住明知機器學習套件，metrics為其評估成效之函式套件








In [None]:
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from nlp import load_dataset, Dataset
import nlp
import torch
import random
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

### 讀取訓練資料集


*   本資料集為AI Go業界出題求解比賽，目的是希望透過AI自動判斷，保健食品廣告詞是否有違反食品衛生管理法中，宣稱療效保健食品廣告詞是否有食品衛生管理法中所禁止，宣稱療效功能的字句出現。
*   共有一千多筆資料，一半是合法(passed,label=0)，一半是不合法(failed,label=1)



In [None]:
df = pd.read_csv('train.csv')
dataset = Dataset.from_pandas(df)


In [None]:
df.sample(10)

Unnamed: 0,ID,Name,Description,label
861,861,https://tw.buy.yahoo.com/gdsale/GNC健安喜-BSN-Syn...,●優質蛋白質補充飲品，每份熱量200大卡，含有:-6種優質長、中、速效蛋白質共22公克-必需...,0
467,467,長庚純正牛樟菇菌絲體純液搶購組,「2014民族藥物學國際期刊牛樟菇菌絲體防止肝臟發炎作用明顯增進肺癌抑制療效並緩化療藥物的肝...,1
165,165,超級好抗膠囊(60日份)1瓶,「...多重影響讓過敏族的噴嚏咳嗽樣樣來...光戴口罩根本不夠...總是掛號看病治標不治本....,1
924,924,https://tw.buy.yahoo.com/gdsale/GNC健安喜-乳清蛋白-AM...,進階級能量飲品高熱量、高蛋白、低脂肪。每份4匙提供熱量750大卡蛋白質50公克複合性醣類12...,0
212,212,活氣靈芝多醣體液,「可增加細胞內殼膀胱甘?(GSH)和抗氧化氫H2O2，誘導抗氧化?，以保護氧化壓力...具有...,1
113,113,OCanada燕米1kg,「降低T2DM(第二型糖尿病)的風險...抗胰島素的二型糖尿病患者若以燕米取代米飯，可以期望...,1
368,368,【出生快樂】芝麻素、蜂王漿、黑芝麻、蝦紅素綜合錠,「降低膽固醇、降血壓…抗氧化作用…抑制高血壓…抗發炎、抗腫瘤抗菌、防老年體衰、傳染性肝炎、風...,1
847,847,https://tw.buy.yahoo.com/gdsale/GNC健安喜-雙12限定-活...,【三效魚油】-每顆含有EPA540毫克及DHA360毫克，含量為一般魚油的三倍。-採最新純化...,0
768,768,維生素C(錠狀食品)(Usana),USANA的礦物維生素C配方獨特，每錠提供250毫克的維生素C，含西印度櫻桃萃取物，提供植物...,0
157,157,維他露澳洲神果KKD青纖錠口碑囤貨檔,「杜詩梅口述：...因為在花蓮我都穿運動服所以沒有太大感受都很舒服很寬鬆然後準備要來上節目的...,1


### 載入原始模型


*   這邊載入的是HFL(哈爾濱工業大學訊飛聯合實驗室),針對Bert所訓練好的預訓練模型
*   模型可參考:https://huggingface.co/hfl



In [None]:
model = BertForSequenceClassification.from_pretrained('hfl/chinese-bert-wwm-ext')

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/393M [00:00<?, ?B/s]

Some weights of the model checkpoint at hfl/chinese-bert-wwm-ext were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkp

###  載入分詞模型


*   載入上述之Bert中文預訓練模型,等遺下可以用來進行文字的分詞(切割中文字詞)



In [None]:
tokenizer = BertTokenizerFast.from_pretrained('hfl/chinese-bert-wwm-ext')

Downloading:   0%|          | 0.00/19.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/107k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

### 設定執行參數

*   Epochs:訓練5輪

*   Max_len:單一評論字數長度上限
*   BATCH_SIZE:一次丟幾筆資料進去訓練，取決於GPU卡記憶體的大小，設大一點可以節省訓練時間






In [None]:
RANDOM_SEED = 5
MAX_LEN = 512
EPOCHS=5
BATCH_SIZE = 8

### 把資料集分成訓練、測試、驗證


*   dataset.shuffle(RANDOM_SEED):產生隨機種子，切割資料集時使用

*   shuffled_ds.train_test_split(test_size=0.2):訓練資料集佔80%，測試資料集佔20%
*   test_val_dataset.train_test_split(test_size=0.5):20%的測試資料集中，其中50%作為訓練時的測試資料集，50%作為模型訓練好後，驗證之用的驗證資料集





In [None]:
shuffled_ds = dataset.shuffle(RANDOM_SEED)
split_ds = shuffled_ds.train_test_split(test_size=0.2)
train_dataset = split_ds['train']
test_val_dataset = split_ds['test']
split_tv = test_val_dataset.train_test_split(test_size=0.5)
test_dataset = split_tv['train']
val_dataset = split_tv['test']

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

### 取一筆出來瞧瞧(已經打亂)

In [None]:
train_dataset[2]

{'Description': '最高320毫克德國大廠檸檬酸鈣一天2包相當於6.4杯牛奶鈣含量，輕鬆補足每天所需。奶素可食',
 'ID': 811,
 'Name': '康采檸檬酸鈣粉-avon',
 'label': 0}

### 示範分詞


*   隨便定義一個句子，示範如何透過tokenizer進行分詞




In [None]:
tokenizer.tokenize("體重跟體脂都有在持續緩慢的下降中")

['體',
 '重',
 '跟',
 '體',
 '脂',
 '都',
 '有',
 '在',
 '持',
 '續',
 '緩',
 '慢',
 '的',
 '下',
 '降',
 '中']

### 定義分詞器

In [None]:
def tokenize(batch):
  return tokenizer(batch['Description'], max_length=MAX_LEN, padding=True, truncation=True)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

### 定義評測準則，訓練器

*   此處為Hugging face Bert訓練模型設定相關參數，照抄即可




In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=50,
    weight_decay=0.01,
    # evaluate_during_training=True,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

### 開始訓練

In [None]:
trainer.train()
trainer.evaluate()

***** Running training *****
  Num examples = 749
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 470


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


***** Running Evaluation *****
  Num examples = 94
  Batch size = 8


{'epoch': 5.0,
 'eval_accuracy': 0.9893617021276596,
 'eval_f1': 0.99009900990099,
 'eval_loss': 0.09061499685049057,
 'eval_precision': 0.9803921568627451,
 'eval_recall': 1.0,
 'eval_runtime': 6.1999,
 'eval_samples_per_second': 15.162,
 'eval_steps_per_second': 1.936}

### 將模型存檔

In [None]:
model.save_pretrained("bert_food_ad")

Configuration saved in bert_food_ad/config.json
Model weights saved in bert_food_ad/pytorch_model.bin


### 用TENSORBOARD查看訓練結果

In [None]:
# %reload_ext tensorboard
%tensorboard --logdir='./logs' # --host 192.168.1.113
#from tensorboard import notebook
#notebook.list()
#notebook.display(port=6006, height=1000)

UsageError: Line magic function `%tensorboard` not found.


### 載入已訓練好的模型


*   要載入訓練好的模型，需要兩個東西

*   第一個是之前基於預訓練模型，訓練出來存檔之config.json參數檔案
*   第二個是bert模型檔案


*   兩者合併即可成為訓練好的模型



In [None]:
from transformers import BertConfig

test_config = BertConfig.from_json_file('./bert_food_ad/config.json')
test_model = BertForSequenceClassification.from_pretrained('./bert_food_ad/pytorch_model.bin', config=test_config)
model.eval()

loading weights file ./bert_food_ad/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint at ./bert_food_ad/pytorch_model.bin.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:

print(df.iloc[220]['label'], df.iloc[220]['Description'])
print(df.iloc[578]['label'], df.iloc[578]['Description'])

1 「增強免疫力...護眼明目...抗氧化...保肝...皮膚：...保護皮膚，減緩皮屑問題，降低敏感...毛髮……促進毛髮亮麗...消化系統」「強化免疫系統，有助調整過敏體質...清除自由基延緩老化...調整腸胃道機能...調理皮膚問題...諾麗果含有許多強大的抗氧化劑」「阻止或減少病痛的發生，並迅速修補已受傷的細胞，將代謝殘渣或毒素排出體外」「它能維護身體細胞組織正常運作，阻止或減少病痛的發生，並迅速修補已受傷的細胞，將代謝殘渣或毒素排出體外，使身體恢復正常」
0 ·???????每份含有30mg輔Q10。輔Q10可參與維生素E再生，兩者具協同作用，可促進新陳代謝。保護循環順暢維持健康。暢銷日本的最佳美容聖品。


### 在此輸入測試文字


*   test_text:測試文字
*   encoded_review:轉換成pytorch tensor之向量，長度為512，不足的部分補1



In [None]:
test_text = '增強免疫力...護眼明目...抗氧化...保肝...皮膚：...保護皮膚，減緩皮屑問題，降低敏感...毛髮……促進毛髮亮麗...消化系統」「強化免疫系統，有助調整過敏體質...清除自由基延緩老化...調整腸胃道機能...調理皮膚問題...諾麗果含有許多強大的抗氧化劑」「阻止或減少病痛的發生，並迅速修補已受傷的細胞，將代謝殘渣或毒素排出體外」「它能維護身體細胞組織正常運作，阻止或減少病痛的發生，並迅速修補已受傷的細胞，將代謝殘渣或毒素排出體外，使身體恢復正常'

MAX_LEN = 512

encoded_review = tokenizer.encode_plus(
    test_text,
    max_length=MAX_LEN,
    add_special_tokens=True,
    return_token_type_ids=False,
    return_attention_mask=True,
    return_tensors='pt',
    truncation=True
)

In [None]:
encoded_review

{'input_ids': tensor([[ 101, 1872, 2485, 1048, 4554, 1213,  119,  119,  119, 6362, 4706, 3209,
         4680,  119,  119,  119, 2834, 3709, 1265,  119,  119,  119,  924, 5498,
          119,  119,  119, 4649, 5604, 8038,  119,  119,  119,  924, 6362, 4649,
         5604, 8024, 3938, 5227, 4649, 2244, 1558, 7539, 8024, 7360,  856, 3130,
         2697,  119,  119,  119, 3688, 7773,  100,  100,  914, 6868, 3688, 7773,
          778, 7927,  119,  119,  119, 3867, 1265, 5143, 5186,  520,  519, 2485,
         1265, 1048, 4554, 5143, 5186, 8024, 3300, 1221, 6310, 3146, 6882, 3130,
         7768, 6549,  119,  119,  119, 3926, 7370, 5632, 4507, 1825, 2454, 5227,
         5439, 1265,  119,  119,  119, 6310, 3146, 5591, 5517, 6887, 3582, 5543,
          119,  119,  119, 6310, 4415, 4649, 5604, 1558, 7539,  119,  119,  119,
         6330, 7927, 3362, 1419, 3300, 6258, 1914, 2485, 1920, 4638, 2834, 3709,
         1265, 1212,  520,  519, 7349, 3632, 2772, 3938, 2208, 4567, 4578, 4638,
         4634,

### 進行測試


*   結果為Failed(不通過，含有療效等非法字詞)




In [None]:
test_model = test_model.to('cuda')

input_ids = encoded_review['input_ids'].to('cuda')
attention_mask = encoded_review['attention_mask'].to('cuda')
output = test_model(input_ids, attention_mask)
result = torch.argmax(output[0][0])

classnames = ['Passed', 'Failed']

print(f'Review text: {test_text}')
print(f'Result  : {classnames[result]}')

Review text: 增強免疫力...護眼明目...抗氧化...保肝...皮膚：...保護皮膚，減緩皮屑問題，降低敏感...毛髮……促進毛髮亮麗...消化系統」「強化免疫系統，有助調整過敏體質...清除自由基延緩老化...調整腸胃道機能...調理皮膚問題...諾麗果含有許多強大的抗氧化劑」「阻止或減少病痛的發生，並迅速修補已受傷的細胞，將代謝殘渣或毒素排出體外」「它能維護身體細胞組織正常運作，阻止或減少病痛的發生，並迅速修補已受傷的細胞，將代謝殘渣或毒素排出體外，使身體恢復正常
Result  : Failed


In [None]:
!nvidia-smi

Thu May 19 07:19:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    72W / 149W |   7532MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces