[Open In Colab](https://colab.research.google.com/github/shibing624/textgen/blob/main/examples/language_generation/GPT2_Finetune_Chinese_Couplet.ipynb)


# GPT2 对联
- 设计：Pretrained GPT2 + “对联 prompt” fine-tuning
  - 对比我的 [T5 training from scratch](https://github.com/shibing624/textgen/blob/main/examples/T5/T5_Finetune_Chinese_Couplet.ipynb)
- 数据：[对联github](https://github.com/wb14123/couplet-dataset)
- 相关内容
  - [Huggingface](https://huggingface.co/)
  - LangZhou Chinese [MengZi T5 pretrained Model](https://huggingface.co/Langboat/mengzi-t5-base) and [paper](https://arxiv.org/pdf/2110.06696.pdf)
  - [textgen](https://github.com/shibing624/textgen)

In [1]:
# If for quick test purpose, if so, use 3k samples instead of 800k
IS_TEST_FLOW = True  #@param {type:"boolean"}

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

## Prepare Data

In [3]:
from loguru import logger
import pandas as pd
from transformers import BertTokenizerFast
import torch
from torch.utils.data import Dataset
import os
import pickle
import sys

In [5]:
working_dir = "./"
!mkdir -p {working_dir}
!wget https://github.com/wb14123/couplet-dataset/releases/download/1.0/couplet.tar.gz -P {working_dir}
!ls -l {working_dir}

total 533212
drwxrwxr-x  2 xuming xuming      4096 Jun 29 22:22 cache_dir
drwxrwxr-x  3 xuming xuming        31 Jun 27 18:16 ckiplab
-rw-------  1 xuming xuming 379153061 Jun 27 18:23 ckiplab.tar.gz
drwxrwxr-x  3 xuming xuming        21 Aug 13 11:14 couplet_files
-rw-r--r--  1 xuming xuming  27412598 Aug 13 11:14 couplet.tar.gz
-rw-rw-r--  1 xuming xuming    867741 Aug 13 11:44 couplet.tar.gz.1
-rw-rw-r--  1 xuming xuming    125989 Aug 13 11:28 GPT2_Finetune_Chinese_Couplet.ipynb
-rw-rw-r--  1 xuming xuming     94552 Aug 13 11:43 GPT2_Finetune_Chinese_Poem.ipynb
-rw-------  1 xuming xuming   1458764 Jun 30 11:07 nohup.out
drwxrwxr-x  4 xuming xuming        50 Aug 13 11:31 outputs
-rw-------  1 xuming xuming 136859610 Aug 13 11:14 poems.csv
drwxrwxr-x 37 xuming xuming      4096 Aug 13 11:29 runs
-rw-rw-r--  1 xuming xuming      6126 Aug 13 10:52 training_couplet_gpt2_demo.py
-rw-rw-r--  1 xuming xuming      2439 Jun 29 20:45 training_en_gpt2_demo.py
-rw-rw-r--  1 xuming xuming      3133

In [6]:
!mkdir -p {working_dir}/couplet_files
!tar -xf {working_dir}/couplet.tar.gz -C {working_dir}/couplet_files

In [7]:
!head -1 {working_dir}/couplet_files/couplet/train/in.txt {working_dir}/couplet_files/couplet/train/out.txt

==> .//couplet_files/couplet/train/in.txt <==
晚 风 摇 树 树 还 挺 

==> .//couplet_files/couplet/train/out.txt <==
晨 露 润 花 花 更 红 


## Load Data

In [8]:
COUPLET_PATH = f'{working_dir}/couplet_files/couplet'
MAX_SEQ_LEN = 32  # Max 32 chinese char including punctuation marks
COUPLET_PROMPOT = '对联：'

train_data = []
test_data = []
for t in ['train', 'test']:
    ins, outs = [], []
    for i in ['in', 'out']:
        with open(f"{COUPLET_PATH}/{t}/{i}.txt", "r") as f:
            for line in f:
                clean_line = line.strip().replace(' ', '').replace('\n', '').replace('\r', '')[:MAX_SEQ_LEN]
                if i == 'in':
                    ins.append(clean_line)
                else:
                    outs.append(clean_line)
    # The column names to match textgen
    data_dict = {
        'input_text': ins,
        'target_text': outs,
    }
    if t == 'train':
        train_df = pd.DataFrame(data_dict)
    else:
        test_df = pd.DataFrame(data_dict)

In [23]:
eval_df = test_df
print("train", len(train_df), "eval", len(eval_df))

train_df = train_df.sample(300) if IS_TEST_FLOW else train_df
eval_df = eval_df.sample(30) if IS_TEST_FLOW else eval_df
print("train", len(train_df), "eval", len(eval_df))

train 300 eval 4000
train 300 eval 30


## Prepare Model

In [24]:
!nvidia-smi  # Check GPU, P100/16G takes 100mins per epoch similar to 1080

Sat Aug 13 11:47:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   37C    P0    36W / 300W |   4737MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  

In [25]:
# Quite install textgen package
!pip install -q textgen

In [26]:
import sys

sys.path.append('../..')
from textgen.language_generation import LanguageGenerationModel
from textgen.language_modeling import LanguageModelingModel

In [32]:
model_type = 'gpt2'
model_name = "uer/gpt2-distil-chinese-cluecorpussmall"
output_dir = 'outputs/gpt2_distil_couplet/'
max_seq_length = 50
num_epochs = 5
batch_size = 32
num_return_sequences = 1

In [33]:
def encode(data):
    """Encode data to src trg token ids"""
    tokenizer, src, trg = data
    cls_id = tokenizer.cls_token_id
    sep_id = tokenizer.sep_token_id
    input_ids = [cls_id] + tokenizer.encode(src, add_special_tokens=False, max_length=max_seq_length) + [sep_id] + \
                tokenizer.encode(trg, add_special_tokens=False, max_length=max_seq_length) + [sep_id]
    return input_ids


class SrcTrgDataset(Dataset):
    """Custom dataset, use it by dataset_class from train args"""

    def __init__(self, tokenizer, args, data, mode, block_size=512, special_tokens_count=2):
        cached_features_file = os.path.join(
            args.cache_dir,
            args.model_name.replace("/", "_")
            + "_cached_"
            + str(args.max_seq_length)
            + str(len(data)),
        )

        if os.path.exists(cached_features_file) and (
                (not args.reprocess_input_data and not args.no_cache)
                or (mode == "dev" and args.use_cached_eval_features and not args.no_cache)
        ):
            logger.info(f" Loading features from cached file {cached_features_file}")
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info(f" Creating features from dataset file at {args.cache_dir}")
            lines = [(tokenizer, input_text, target_text)
                        for input_text, target_text in zip(data["input_text"], data["target_text"])
            ]
            self.examples = [encode(line) for line in lines]
            logger.info(f" Saving features into cached file {cached_features_file}")
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)


In [34]:
train_args = {
    "dataset_class": SrcTrgDataset,
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "block_size": 512,
    "max_seq_length": max_seq_length,
    "learning_rate": 5e-6,
    "train_batch_size": batch_size,
    "gradient_accumulation_steps": 8,
    "num_train_epochs": num_epochs,
    "mlm": False,
    "output_dir": output_dir,
    "save_best_model": True,
    "evaluate_during_training": True,
    "num_return_sequences": num_return_sequences,
}
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = LanguageModelingModel(model_type, model_name, args=train_args, tokenizer=tokenizer)

2022-08-13 11:48:52.195 | DEBUG    | textgen.language_modeling.language_modeling_model:__init__:153 - Device: cuda


In [35]:
model.tokenizer("回答：天上有没有云彩？")

{'input_ids': [101, 1726, 5031, 8038, 1921, 677, 3300, 3766, 3300, 756, 2506, 8043, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [36]:
model.tokenizer.decode([101, 1726, 5031, 8038, 1921, 677, 3300, 3766, 3300, 756, 2506, 8043, 102])

'[CLS] 回 答 ： 天 上 有 没 有 云 彩 ？ [SEP]'

In [None]:
def predict_now(sentences, model_dir):
    tokenizer = BertTokenizerFast.from_pretrained(model_dir)
    m = LanguageGenerationModel(model_type, model_dir,
                                        args={"max_length": max_seq_length,
                                              "num_return_sequences": num_return_sequences},
                                        tokenizer=tokenizer)
    for prompt in sentences:
        generated = m.generate(prompt, verbose=False, add_cls_head=True, split_on_space=False, keep_prompt=True)
        print("inputs:", prompt)
        print("outputs:", generated)

In [43]:
predict_now(["灵蛇出洞千山秀"], model_name)

2022-08-13 11:56:06.356 | DEBUG    | textgen.language_generation.language_generation_model:__init__:107 - Device: cuda


inputs: 灵蛇出洞千山秀
outputs: ['灵蛇出洞千山秀水的美景怎么能错过?从此万人狂叫，到山外大水如梭，直奔梦想，每次去过都是那么那么那么心情，而且最爱的']


## Training

In [38]:
def sim_text_chars(text1, text2):
    if not text1 or not text2:
        return 0.0
    same = set(text1) | set(text2)
    m = len(same)
    n = len(text1) if len(text1) > len(text2) else len(text2)
    return m / n


def count_matches(labels, preds):
    logger.debug(f"labels: {labels[:10]}")
    logger.debug(f"preds: {preds[:10]}")
    match = sum([sim_text_chars(label, pred) for label, pred in zip(labels, preds)]) / len(labels)
    logger.debug(f"match: {match}")
    return match


# Train model for pair data (format: src \t trg)
model.train_model(train_df, eval_file=eval_df)
print(model.eval_model(eval_df, matches=count_matches))

2022-08-13 11:51:30.416 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 11:51:30.459 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_50300
2022-08-13 11:51:30.464 | INFO     | textgen.language_modeling.language_modeling_model:train:562 -  Training started


Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

2022-08-13 11:51:30.483 | INFO     | textgen.language_modeling.language_modeling_model:train:594 -    Starting fine-tuning.


Running Epoch 0 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 11:51:31.141 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 11:51:31.147 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 1 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 11:51:33.675 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 11:51:33.680 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 2 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 11:51:36.297 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 11:51:36.304 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 3 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 11:51:38.945 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 11:51:38.950 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Epoch 4 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

2022-08-13 11:51:41.618 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 11:51:41.624 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030
2022-08-13 11:51:44.514 | INFO     | textgen.language_modeling.language_modeling_model:train_model:386 -  Training of gpt2 model complete. Saved to outputs/gpt2_distil_couplet/.
2022-08-13 11:51:44.517 | INFO     | __main__:__init__:32 -  Creating features from dataset file at cache_dir/
2022-08-13 11:51:44.523 | INFO     | __main__:__init__:37 -  Saving features into cached file cache_dir/uer_gpt2-distil-chinese-cluecorpussmall_cached_5030


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

2022-08-13 11:51:44.569 | INFO     | textgen.language_modeling.language_modeling_model:eval_model:926 - {'eval_loss': 8.068880081176758, 'perplexity': tensor(3193.5234)}


{'eval_loss': 8.068880081176758, 'perplexity': tensor(3193.5234)}


In [42]:
predict_now(["灵蛇出洞千山秀"], output_dir)

2022-08-13 11:54:37.166 | DEBUG    | textgen.language_generation.language_generation_model:__init__:107 - Device: cuda


inputs: 灵蛇出洞千山秀
outputs: ['灵蛇出洞千山秀水，不要轻易说脏话！这两天，我们几个同事去灵蛇湖玩，刚开始大家是以为灵蛇会出来，结果他们说这里环境很']


本节完。