[Open In Colab](https://colab.research.google.com/github/shibing624/textgen/blob/main/examples/language_generation/GPT2_Finetune_Chinese_Poem.ipynb)


# GPT2 写诗
- 设计：Pretrained GPT2 + “写诗 prompt” fine-tuning
  - 对比我的 [T5 training from scratch](https://github.com/shibing624/textgen/blob/main/examples/T5/T5_Finetune_Chinese_Poem.ipynb)
  - 想要加入作者作为可选输入
    - 每个文章分两次输入，一次作者名字，一次“None”名字（通用）
- 数据：[诗歌github](https://github.com/chinese-poetry/chinese-poetry)
- 相关内容
  - [Huggingface](https://huggingface.co/)
  - LangZhou Chinese [MengZi T5 pretrained Model](https://huggingface.co/Langboat/mengzi-t5-base) and [paper](https://arxiv.org/pdf/2110.06696.pdf)
  - [textgen](https://github.com/shibing624/textgen) 


## Prepare Data

In [1]:
!nvidia-smi

Sat Aug 13 13:03:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   37C    P0    36W / 300W |   6144MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  

In [2]:
IS_TEST_FLOW = True  #@param {type: "boolean"}

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [24]:
import json
import urllib.request
from loguru import logger
import pandas as pd
!pip install -q "tqdm>=4.36.1" > /tmp/na
from tqdm.notebook import tqdm
!pip install -q chinese-converter > /tmp/na
import chinese_converter  # 繁体到简体需要
import pickle
import os
import numpy as np
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizerFast

In [5]:
# https://github.com/chinese-poetry/chinese-poetry
POEM_CONTENT = {
    'tang': {
        'total': 58,
        'pattern': "https://raw.githubusercontent.com/chinese-poetry/chinese-poetry/master/json/poet.tang.{0}.json"
    },
    'song': {
        'total': 255,
        'pattern': "https://raw.githubusercontent.com/chinese-poetry/chinese-poetry/master/json/poet.song.{0}.json"
    }
}


def get_poems(is_test=True, verbose=True):
    df_list = []
    for dynasty in POEM_CONTENT:
        size = POEM_CONTENT[dynasty]['total']
        pbar = tqdm(total=size, desc="Dynasty " + dynasty)
        for i in range(size):
            url = POEM_CONTENT[dynasty]['pattern'].format(i * 1000)
            if verbose:
                print(f"download {url} now")
            df_list.append(pd.read_json(url))
            pbar.update(1)
    return pd.concat(df_list)

In [6]:
poem_file = 'poems.csv'
if os.path.exists(poem_file):
    df = pd.read_csv(poem_file)
else:
    df = get_poems(is_test=IS_TEST_FLOW, verbose=False)
    df['concat_paragraphs'] = [''.join(map(str, l)) for l in df['paragraphs']]
    df = df[['author', 'title', 'concat_paragraphs']]

    def convert_schinese(tchinese):
        return chinese_converter.to_simplified(tchinese)

    df['s_content'] = df.apply(lambda row: convert_schinese(''.join(row.concat_paragraphs)), axis=1)
    df['s_title'] = df.apply(lambda row: convert_schinese(''.join(row.title)), axis=1)
    df['s_author'] = df.apply(lambda row: convert_schinese(''.join(row.author)), axis=1)
    df.to_csv(poem_file, index=False)

my_df = df.astype('string')
my_df = my_df.dropna()
print("my_df size", len(my_df))

my_df size 311660


In [7]:
MAX_AUTHOR_CHAR = 4
MAX_TITLE_CHAR = 12
MIN_CONTENT_CHAR = 10
MAX_CONTENT_CHAR = 64


def trim_author_fn(row):
    return row.s_author[:MAX_AUTHOR_CHAR]


def trim_title_fn(row):
    trimed_title = row.s_title[:MAX_TITLE_CHAR].replace(" ", "").replace("(", "").replace(")", "")
    return trimed_title


def trim_content_fn(row):
    trimed_content = row.s_content[:MAX_CONTENT_CHAR]
    return trimed_content


# Trim the size, a soft copy to avoid the view/copy conflict warning
my_df['s_author_trim'] = my_df.copy().apply(trim_author_fn, axis=1)
my_df['s_title_trim'] = my_df.copy().apply(trim_title_fn, axis=1)
my_df['s_content_trim'] = my_df.copy().apply(trim_content_fn, axis=1)

In [8]:
# Title cannot be empty
empty_title_mask = (my_df['s_title_trim'].str.len() == 0)
too_short_cotent_mask = (my_df['s_content_trim'].str.len() <= MIN_CONTENT_CHAR)
invalid_mask = (('无正文' == my_df['s_content_trim']) | ('无正文' == my_df['s_author_trim']))
too_short_mask =  empty_title_mask | too_short_cotent_mask | invalid_mask

qualitied_df = my_df.loc[~too_short_mask][['s_author_trim', 's_title_trim', 's_content_trim']]

In [9]:
qualitied_df.sample(3)

Unnamed: 0,s_author_trim,s_title_trim,s_content_trim
186711,陆游,别张敎授归独登拟岘,小阁敞朱扉，停车暂息机。行人呼晚渡，幼妇浣秋衣。霜树欹危堞，风鸦满落晖。登临客愁裏，况是送将归。
136328,葛胜仲,次韵叔才幽居二首其二,粗才濩落与时乖，湖海幽居一竹斋。月径清游疑不夜，风窗高趣到无怀。身闲有分白莲社，地禁无心红药...
268913,叶茵,枕簟入林僻茶瓜留客迟十韵,君顔犹少年，我发不胜帻。百岁垒一丘，主翁惊是客。


In [10]:
TITLE_PROMPT = "作诗："
AUTHOR_PROMPT = "作者："
EOS_TOKEN = '</s>'


def build_dataset_df(df, include_author=True):
    dfc = df.copy()
    dfc['prefix'] = TITLE_PROMPT
    if include_author:
        dfc['input_text'] = df['s_title_trim'] + EOS_TOKEN + AUTHOR_PROMPT + df['s_author_trim']
    else:
        dfc['input_text'] = df['s_title_trim']
    dfc['target_text'] = df['s_content_trim']
    dfc = dfc[['prefix', 'input_text', 'target_text']]
    return dfc

In [11]:
df_author_title_content = build_dataset_df(qualitied_df, True)
df_author_title_content[:3]

Unnamed: 0,prefix,input_text,target_text
0,作诗：,帝京篇十首一</s>作者：太宗皇帝,秦川雄帝宅，函谷壮皇居。绮殿千寻起，离宫百雉余。连甍遥接汉，飞观迥凌虚。云日隐层阙，风烟出绮疎。
1,作诗：,帝京篇十首二</s>作者：太宗皇帝,岩廊罢机务，崇文聊驻辇。玉匣啓龙图，金绳披凤篆。韦编断仍续，缥帙舒还卷。对此乃淹留，欹案观坟典。
2,作诗：,帝京篇十首三</s>作者：太宗皇帝,移步出词林，停舆欣武宴。琱弓写明月，骏马疑流电。惊雁落虚弦，啼猿悲急箭。阅赏诚多美，于兹乃忘倦。


In [12]:
df_title_content = build_dataset_df(qualitied_df, False)
df_title_content[:3]

Unnamed: 0,prefix,input_text,target_text
0,作诗：,帝京篇十首一,秦川雄帝宅，函谷壮皇居。绮殿千寻起，离宫百雉余。连甍遥接汉，飞观迥凌虚。云日隐层阙，风烟出绮疎。
1,作诗：,帝京篇十首二,岩廊罢机务，崇文聊驻辇。玉匣啓龙图，金绳披凤篆。韦编断仍续，缥帙舒还卷。对此乃淹留，欹案观坟典。
2,作诗：,帝京篇十首三,移步出词林，停舆欣武宴。琱弓写明月，骏马疑流电。惊雁落虚弦，啼猿悲急箭。阅赏诚多美，于兹乃忘倦。


In [13]:
merged_df = pd.concat([df_author_title_content, df_title_content])

In [14]:
merged_df

Unnamed: 0,prefix,input_text,target_text
0,作诗：,帝京篇十首一</s>作者：太宗皇帝,秦川雄帝宅，函谷壮皇居。绮殿千寻起，离宫百雉余。连甍遥接汉，飞观迥凌虚。云日隐层阙，风烟出绮疎。
1,作诗：,帝京篇十首二</s>作者：太宗皇帝,岩廊罢机务，崇文聊驻辇。玉匣啓龙图，金绳披凤篆。韦编断仍续，缥帙舒还卷。对此乃淹留，欹案观坟典。
2,作诗：,帝京篇十首三</s>作者：太宗皇帝,移步出词林，停舆欣武宴。琱弓写明月，骏马疑流电。惊雁落虚弦，啼猿悲急箭。阅赏诚多美，于兹乃忘倦。
3,作诗：,帝京篇十首四</s>作者：太宗皇帝,鸣笳临乐馆，眺听欢芳节。急管韵朱弦，清歌凝白雪。彩凤肃来仪，玄鹤纷成列。去兹郑卫声，雅音方可悦。
4,作诗：,帝京篇十首五</s>作者：太宗皇帝,芳辰追逸趣，禁苑信多奇。桥形通汉上，峰势接云危。烟霞交隐映，花鸟自参差。何如肆辙迹？万里赏瑶池。
...,...,...,...
311850,作诗：,状元峰,马蹄一日遍长安，萤火鸡窗千载寒。从此锦衣归故里，文峰高并彩云端。
311851,作诗：,蜕龙洞,苍岩磊落任龙蟠，绵亘千年露未干。一自爲霖破壁去，至今风雨逼山寒。
311852,作诗：,登竺云山,独上千峰与万峰，晴岚淡写海江容。偶从动问山居事，笑拍岩前一树松。
311853,作诗：,寒云千叠山,松竹阴森护上方，老仙蓬髪一簪霜。闲来欹枕松风裏，归夢不知山水长。


In [15]:
from sklearn.model_selection import train_test_split
merged_df = merged_df.sample(frac=1) # Shuffle
train_df, eval_df = train_test_split(merged_df, test_size=0.01)
print("train", len(train_df), "eval", len(eval_df))

train_df = train_df.sample(300) if IS_TEST_FLOW else train_df
eval_df = eval_df.sample(30) if IS_TEST_FLOW else eval_df
print("train", len(train_df), "eval", len(eval_df))

train 613978 eval 6202
train 300 eval 30


## Modeling

In [16]:
# Quiet install textgen package
!pip install -q textgen

In [17]:
import sys

sys.path.append('../..')
from textgen.language_generation import LanguageGenerationModel
from textgen.language_modeling import LanguageModelingModel

  from tqdm.autonotebook import tqdm


In [18]:
model_type = 'gpt2'
model_name = "uer/gpt2-distil-chinese-cluecorpussmall"
output_dir = 'outputs/gpt2_distil_poem/'
max_seq_length = 50
num_epochs = 5
batch_size = 32
num_return_sequences = 1

In [25]:
def encode(data):
    """Encode data to src trg token ids"""
    tokenizer, src, trg = data
    cls_id = tokenizer.cls_token_id
    sep_id = tokenizer.sep_token_id
    input_ids = [cls_id] + tokenizer.encode(src, add_special_tokens=False, max_length=max_seq_length) + [sep_id] + \
                tokenizer.encode(trg, add_special_tokens=False, max_length=max_seq_length) + [sep_id]
    return input_ids


class SrcTrgDataset(Dataset):
    """Custom dataset, use it by dataset_class from train args"""

    def __init__(self, tokenizer, args, data, mode, block_size=512, special_tokens_count=2):
        cached_features_file = os.path.join(
            args.cache_dir,
            args.model_name.replace("/", "_")
            + "_cached_"
            + str(args.max_seq_length)
            + str(len(data)),
        )

        if os.path.exists(cached_features_file) and (
                (not args.reprocess_input_data and not args.no_cache)
                or (mode == "dev" and args.use_cached_eval_features and not args.no_cache)
        ):
            logger.info(f" Loading features from cached file {cached_features_file}")
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info(f" Creating features from dataset file at {args.cache_dir}")
            lines = [(tokenizer, input_text, target_text)
                        for input_text, target_text in zip(data["input_text"], data["target_text"])
            ]
            self.examples = [encode(line) for line in lines]
            logger.info(f" Saving features into cached file {cached_features_file}")
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [26]:
train_args = {
    "dataset_class": SrcTrgDataset,
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "block_size": 512,
    "max_seq_length": max_seq_length,
    "learning_rate": 5e-6,
    "train_batch_size": batch_size,
    "gradient_accumulation_steps": 8,
    "num_train_epochs": num_epochs,
    "mlm": False,
    "output_dir": output_dir,
    "save_best_model": True,
    "evaluate_during_training": True,
    "num_return_sequences": num_return_sequences,
}
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = LanguageModelingModel(model_type, model_name, args=train_args, tokenizer=tokenizer)

2022-08-13 13:07:23.944 | DEBUG    | textgen.language_modeling.language_modeling_model:__init__:153 - Device: cuda


In [27]:
model.tokenizer("桥形通汉上，峰势接云危。</s>烟霞交隐映，花鸟自参差。")

{'input_ids': [101, 3441, 2501, 6858, 3727, 677, 8024, 2292, 1232, 2970, 756, 1314, 511, 133, 120, 161, 135, 4170, 7459, 769, 7391, 3216, 8024, 5709, 7881, 5632, 1346, 2345, 511, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [29]:
model.tokenizer.decode([101, 3441, 2501, 6858, 3727, 677, 8024, 2292, 1232, 2970, 756, 1314, 511, 133, 120, 161, 135, 4170, 7459, 769, 7391, 3216, 8024, 5709, 7881, 5632, 1346, 2345, 511, 102])

'[CLS] 桥 形 通 汉 上 ， 峰 势 接 云 危 。 < / s > 烟 霞 交 隐 映 ， 花 鸟 自 参 差 。 [SEP]'

In [30]:
def predict_now(sentences, model_dir):
    tokenizer = BertTokenizerFast.from_pretrained(model_dir)
    m = LanguageGenerationModel(model_type, model_dir,
                                        args={"max_length": max_seq_length,
                                              "num_return_sequences": num_return_sequences},
                                        tokenizer=tokenizer)
    for prompt in sentences:
        generated = m.generate(prompt, verbose=False, add_cls_head=True, split_on_space=False, keep_prompt=True)
        print("inputs:", prompt)
        print("outputs:", generated)

In [31]:
predict_now(["过温汤"], model_name)

2022-08-13 13:07:51.160 | DEBUG    | textgen.language_generation.language_generation_model:__init__:107 - Device: cuda


inputs: 过温汤
outputs: ['过温汤的话价格一般，菜的种类也不是太多，每次去都会点大排档的特价菜，性价比不错哦~我个人认为还不错的一']


# Training

In [33]:
def sim_text_chars(text1, text2):
    if not text1 or not text2:
        return 0.0
    same = set(text1) | set(text2)
    m = len(same)
    n = len(text1) if len(text1) > len(text2) else len(text2)
    return m / n


def count_matches(labels, preds):
    logger.debug(f"labels: {labels[:10]}")
    logger.debug(f"preds: {preds[:10]}")
    match = sum([sim_text_chars(label, pred) for label, pred in zip(labels, preds)]) / len(labels)
    logger.debug(f"match: {match}")
    return match


# Train model for pair data (format: src \t trg)
model.train_model(train_df, eval_file=eval_df)
print(model.eval_model(eval_df, matches=count_matches))

2022-08-13 13:08:19.417 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 13:08:19.492 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_50300
2022-08-13 13:08:19.498 | INFO     | textgen.language_modeling.language_modeling_model:train:562 -  Training started


Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

2022-08-13 13:08:19.519 | INFO     | textgen.language_modeling.language_modeling_model:train:594 -    Starting fine-tuning.


Running Epoch 0 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 13:08:20.448 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 13:08:20.457 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 1 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 13:08:23.998 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 13:08:24.007 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 2 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 13:08:26.735 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 13:08:26.743 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 3 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 13:08:29.604 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 13:08:29.612 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 4 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 13:08:32.457 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 13:08:32.466 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030
2022-08-13 13:08:35.312 | INFO     | textgen.language_modeling.language_modeling_model:train_model:386 -  Training of gpt2 model complete. Saved to outputs/gpt2_distil_couplet/.
2022-08-13 13:08:35.315 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 13:08:35.323 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

2022-08-13 13:08:35.375 | INFO     | textgen.language_modeling.language_modeling_model:eval_model:926 - {'eval_loss': 6.89624547958374, 'perplexity': tensor(988.5562)}


{'eval_loss': 6.89624547958374, 'perplexity': tensor(988.5562)}


# Predict

In [34]:
predict_now(["过温汤"], output_dir)

2022-08-13 13:08:39.006 | DEBUG    | textgen.language_generation.language_generation_model:__init__:107 - Device: cuda


inputs: 过温汤
outputs: ['过温汤、清蒸、咸鱼、青鱼。为什么会变色呢？我是吃了不长时间的鱼，一个星期都不怎么长，而且每次上班都是吃青鱼']


本节完。