## GPT2-Fine-Tuning

> 以下基于开源医疗语料对Huggingface-GPT2进行了微调<br>
> <font color=red>微调需谨慎，可能导致大模型遗忘预训练阶段的知识</font>

In [1]:
from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline, Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

from datasets import Dataset, load_dataset
import pandas as pd
import numpy as np

import os

In [2]:
os.environ['CUDA_VISIBLE_DEVICES']='0'

In [3]:
tokenizer = BertTokenizer.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")

In [4]:
context_length = 64

In [5]:
def tokenize(element):
    # 对句子进行切分，短的补0， 长的截断， 长度统一为 context_length
    outputs = tokenizer(element['inputs'], 
                        padding='max_length', 
                        truncation=True, 
                        max_length=context_length,
                        return_overflowing_tokens=True,
                        return_length=True)
    input_batch = []
    count = 0
    for length, input_ids in zip(outputs['length'], outputs['input_ids']):
        input_batch.append(input_ids)
    return {'inputs': input_batch}

> datasets

In [6]:
path = './input/Chinese-medical-dialogue-data/internal_examples_7700.csv'

In [7]:
df_internal = pd.read_csv(path, encoding='gbk')

In [8]:
df_internal.head()

Unnamed: 0,department,title,ask,answer
0,心血管科,高血压患者能吃党参吗？,我有高血压这两天女婿来的时候给我拿了些党参泡水喝，您好高血压可以吃党参吗？,高血压病人可以口服党参的。党参有降血脂，降血压的作用，可以彻底消除血液中的垃圾，从而对冠心病...
1,心血管科,高血压该治疗什么？,我是一位中学教师，平时身体健康，最近学校组织健康检查，结果发觉我是高血压，去年还没有这种情况...,高血压患者首先要注意控制食盐摄入量，每天不超过六克，注意不要吃太油腻的食物，多吃新鲜的绿色蔬...
2,心血管科,老年人高血压一般如何治疗？,我爷爷今年68了，年纪大了，高血压这些也领着来了，这些病让老人很痛苦，每次都要按时喝药，才能...,你爷爷患高血压，这是老年人常见的心血管病，血管老化硬化，血压调整能力消退了，目前治疗高血压最...
3,内分泌科,糖尿病还会进行遗传吗？,糖尿病有隔代遗传吗？我妈是糖尿病，很多年了，也没养好，我现在也是，我妹子也是，我儿子现在二十...,2型糖尿病的隔代遗传概率为父母患糖尿病，临产的发生率为40%，比一般人患糖尿病，疾病，如何更...
4,内分泌科,糖尿病一般需要怎么治疗？,我妈定期检查仔细检查的时候，仔细检查出患糖尿病，糖尿病需要有怎么治疗？我大概知晓是需要有控制...,糖尿病患者首先通过饮食控制和锻练运动，肥胖患者把体重降下来等方式调整一下看一看，如果血糖仍然...


In [9]:
train_texts = []
for _, row in df_internal.iterrows():
    sentence = row['title'] + row['answer']
    train_texts.append(sentence)

In [10]:
train_texts.__len__()

7644

In [11]:
raw_datasets = Dataset.from_dict({'inputs': train_texts})

> load tokenizer

In [12]:
raw_datasets

Dataset({
    features: ['inputs'],
    num_rows: 7644
})

In [13]:
tokenized_datasets = raw_datasets.map(tokenize, batched=True)

Map:   0%|          | 0/7644 [00:00<?, ? examples/s]

In [14]:
tokenized_datasets

Dataset({
    features: ['inputs'],
    num_rows: 7644
})

In [15]:
raw_datasets[0]['inputs']

'高血压患者能吃党参吗？高血压病人可以口服党参的。党参有降血脂，降血压的作用，可以彻底消除血液中的垃圾，从而对冠心病以及心血管疾病的患者都有一定的稳定预防工作作用，因此平时口服党参能远离三高的危害。另外党参除了益气养血，降低中枢神经作用，调整消化系统功能，健脾补肺的功能。感谢您的进行咨询，期望我的解释对你有所帮助。'

In [16]:
print(tokenized_datasets[0]['inputs'])

[101, 7770, 6117, 1327, 2642, 5442, 5543, 1391, 1054, 1346, 1408, 8043, 7770, 6117, 1327, 4567, 782, 1377, 809, 1366, 3302, 1054, 1346, 4638, 511, 1054, 1346, 3300, 7360, 6117, 5544, 8024, 7360, 6117, 1327, 4638, 868, 4500, 8024, 1377, 809, 2515, 2419, 3867, 7370, 6117, 3890, 704, 4638, 1796, 1769, 8024, 794, 5445, 2190, 1094, 2552, 4567, 809, 1350, 2552, 6117, 5052, 102]


In [17]:
len(tokenized_datasets[0]['inputs'])

64

In [18]:
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [19]:
model = GPT2LMHeadModel.from_pretrained('uer/gpt2-distil-chinese-cluecorpussmall')

In [20]:
args = TrainingArguments(output_dir='./models/', 
                         num_train_epochs=10, 
                         per_device_train_batch_size=128,
                         warmup_steps=50,
                         weight_decay=0.01, 
                         logging_dir='./logs',
                         logging_steps=100,
                         save_total_limit=1)

In [21]:
trainer = Trainer(model=model,
                  tokenizer=tokenizer,
                  args=args,
                  data_collator=data_collator,
                  train_dataset=tokenized_datasets['inputs'])

In [22]:
trainer.train()

***** Running training *****
  Num examples = 7644
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 600
  Number of trainable parameters = 59541504


Step,Training Loss
100,2.6819
200,2.229
300,2.0851
400,1.9987
500,1.9504
600,1.9234


Saving model checkpoint to ./models/checkpoint-500
Configuration saved in ./models/checkpoint-500/config.json
Model weights saved in ./models/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./models/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./models/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=600, training_loss=2.144769592285156, metrics={'train_runtime': 440.4822, 'train_samples_per_second': 173.537, 'train_steps_per_second': 1.362, 'total_flos': 1248345225953280.0, 'train_loss': 2.144769592285156, 'epoch': 10.0})

In [23]:
model.save_pretrained('models/')

Configuration saved in models/config.json
Model weights saved in models/pytorch_model.bin


> train complete!!!