# Generate Poems with Pre-Trained Language Models

In this notebook, we are developing a way to generate poem-like content with a pre-trained language model.

## Sentence-by-sentence approach

It is hard to control the lengthy content generated by the pre-trained model, and hence we design a sentence-by-sentence approach. The basic idea is to generate multiple short sentences at a time, evaluate the meaning-vectors of those short sentences, select the best sentence (so that we may control the flow with the meaning-vectors), and then generate the next sentence with the selected sentence.

A rough pseudo-code is like:

- starting with the seed_sentence
- while(not stopping_criteria):
    - generate multiple sentences with `model.generate()`
    - post-process the generated sentences:
        - evaluate the word-embedding of the seed_sentence, vec_seed
        - for s in sentences:
            - remove the seed_sentence
            - split s into sentences, take the first K, where k=random(1,2,3)
            - segment s_k into terms and evaluate the word-embedding, vec_s
            - calculate the inner product of (vec_seed, vec_s)
        - s_next = argmax(np.dot(vec_seed, vec_s))
    - s_next becomes the new seed_sentence
    - continue



In [40]:
from __future__ import print_function
import logging, os, json, argparse
from transformers import BertTokenizerFast, AutoModelForCausalLM
import pandas as pd
import numpy as np
import jieba


def decode_generated_ids(generated, tokenizer):
    ''' Decode the ids generated by the language model. '''
    output=[]
    for i in range(10):
        text = tokenizer.decode(generated[i], skip_special_tokens= True)    # Decode the generated text
        text = text.replace(' ','')                                         # Remove spaces between tokens
        text = text.replace(',','，')
        output.append(text)
    return(output)

def generate_new_sentences(input, tokenizer, model):
    ''' Generate new sentences with specified model and tokenizer. '''
    # Parse seeding string
    input_ids = tokenizer.encode(input, return_tensors='pt')
    # Generate text
    generated = model.generate(input_ids, 
                            max_length=50,  
                            num_return_sequences=10,
                            no_repeat_ngram_size=2,
                            repetition_penalty=1.05,
                            length_penalty=0.8,
                            top_p=0.75,
                            temperature=.5,
                            do_sample=True,
                            top_k=128,
                            early_stopping=True)
    # Decode
    output = decode_generated_ids(generated, tokenizer)
    # Done
    return(output)




In [46]:
tokenizer = BertTokenizerFast.from_pretrained('../data/tokenizer_bert_base_chinese', clean_text=True)
#model = AutoModelForCausalLM.from_pretrained('../model/ckip', from_tf=True)
model = AutoModelForCausalLM.from_pretrained('../model/gpt2pt_gulong', from_tf=False)

In [4]:
# Generate random numbers
np.random.seed(10)                          # Set random-state
total_lines = np.random.randint( 5,15)      # Define the total lines
print(total_lines)
#
seed_sentence = "巴格達到台北的距離"

14


In [5]:
generated = generate_new_sentences(seed_sentence, tokenizer, model)
print(generated)

Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


['巴格達到台北的距離有三十公里，而且是在中東地區。美國國務院發言人勃恩斯說，布希總統將於今天稍後訪', '巴格達到台北的距離在二十四小時內，台灣的中華民國是一個主權獨立的國家。他說，中共對台政策的改變，', '巴格達到台北的距離不過，他也說，如果台灣要加入世界衛生組織，必須先經過中國大陸的同意才能入會。他', '巴格達到台北的距離不過，他也強調，中共在台灣海峽附近的軍事演習，對台海兩岸都有利。他說，美國的政', '巴格達到台北的距離但是在台灣的大陸人民，不會有任何理由對他們進行政治迫害。」他說:「我們不希望中', '巴格達到台北的距離但是，這種情況在台灣也不會有任何改變。」他說:「我們必須繼續努力，以便儘快恢復', '巴格達到台北的距離是一個小時。這項協議將使巴國政府在未來幾年內，能夠加速推動經濟發展，並且讓巴拉', '巴格達到台北的距離在十二月二十九日，美國總統柯林頓和中共國家主席江澤民在北京舉行高峰會談，雙方就', '巴格達到台北的距離有二十公里，因此，他們不會因為中共的軍事演習而受到影響。但是，在台灣的外交官也', '巴格達到台北的距離只有一百五十公里，而且在台灣，它的飛行時間比較長。他說:「我們的目標是要把台海']


In [6]:
# Load the word embedding for evaluation
import pickle

with open('D:\workspace\word_embedding\zh_wiki_word2vec_300.pkl', 'rb') as f:
    we = pickle.load(f)

print(len(we.keys()))

614083


In [55]:
def get_word_vector(word, word_embedding):
    ''' Look-up the word vector from the given word-embedding. '''
    try:
        vec = word_embedding[word]
    except KeyError as e:
        print(e)
        vec = np.zeros((300))    # Return zero vector if not found
    return(vec)


def evaluate_tokens(tokens, word_embedding):
    ''' Evaluate the  '''
    n_token = len(tokens)
    vec = np.zeros((300))
    for i in range(n_token):
        vec = vec + get_word_vector(tokens[i], word_embedding)
    vec = vec/n_token
    return(vec)


def postprocess_generated_sentences(sentences, seed_sentence, vdict):
    ''' Post-process the generated paragraph. '''
    # Define sentence-break symbols
    bs = ['，','。','；','！','？']
    # Loop through all generated snetences
    svecs = []
    stokens = []
    for s in sentences:
        temp = s.replace(seed_sentence, '')     # Remove the seed sentence
        tokens = list(jieba.cut(temp))          # Tokenize the sentence with jieba
        # Looking for sentence-break symbols
        idxs = [i for i, x in enumerate(tokens) if x in bs]
        print(len(idxs))
        if len(idxs)>1:                         # Keep tokens before the fisrt break
            tokens = tokens[idxs[0]+1:idxs[1]]
        elif len(idxs)>0:
            tokens = tokens[:idxs[0]]
        else:
            tokens = tokens
        svec = evaluate_tokens(tokens, vdict)   # Calculate the word-embedding vectors of the tokens
        svecs.append(svec)
        stokens.append(tokens)
    #
    return({'sentences':stokens,'wvectors':svecs})


def select_next_sentence(candidates, seed_vec):
    ''' Select the best candidate. '''
    scores = []
    for i in range(len(candidates['sentences'])):
        print(candidates['sentences'][i])
        score = np.dot(seed_vec, candidates['wvectors'][i])
        print(score)
        scores.append(score)
    return({'sentence':candidates['sentences'][scores.index(max(scores))], 
            'vector':candidates['wvectors'][scores.index(max(scores))]})

In [9]:
pps = postprocess_generated_sentences(generated, seed_sentence, we)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\tsyo\AppData\Local\Temp\jieba.cache
Loading model cost 0.627 seconds.
Prefix dict has been built successfully.


'一百五十公里'


In [12]:
seed_vec = evaluate_tokens(list(jieba.cut(seed_sentence)), we)
print(seed_sentence)

scores = []
for i in range(len(generated)):
    print(pps['sentences'][i])
    score = np.dot(seed_vec, pps['wvectors'][i])
    print(score)
    scores.append(score)


巴格達到台北的距離
['有', '三十公里']
2.065612835053948
['在', '二十四小', '時內']
2.0885943965330505
['不過']
2.240431904950311
['不過']
2.240431904950311
['但是', '在', '台灣', '的', '大陸', '人民']
1.9036225756816176
['但是']
1.6261310035297365
['是', '一個', '小時']
2.4542005939882654
['在', '十二月', '二十九日']
1.3226899392892093
['有', '二十公里']
1.8746241015857688
['只有', '一百五十公里']
0.9610658728828538


In [14]:
pps['sentences'][scores.index(max(scores))]

['是', '一個', '小時']

In [56]:
# Generate random numbers
np.random.seed()                 # Set random-state
total_lines = np.random.randint( 8,12)      # Define the total lines
print('To generate '+str(total_lines)+' sentences in total.')
# Generate starting sentence
output = []
seed_sentence = '巴格達到台北的距離'
output.append(seed_sentence)
seed_vec = evaluate_tokens(list(jieba.cut(seed_sentence)), we)
# Generate followed-up sentences
for i in range(total_lines):
    generated = generate_new_sentences(seed_sentence, tokenizer, model)
    candidates = postprocess_generated_sentences(generated, seed_sentence, we)
    selected = select_next_sentence(candidates, seed_vec)
    output.append(''.join(selected['sentence']))
    seed_vec = selected['vector']
    seed_sentence = ''.join(selected['sentence'])
# done
print('\n'.join(output))

Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


To generate 11 sentences in total.


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


1
2
1
2
2
2
2
1
1
1
['他們', '的', '目的地']
1.9705591310022152
['就', '在', '那裡']
1.7935014486517125
['他們', '一定', '要', '等到', '明年', '才能', '看到', '他']
1.4773623678492964
['就是', '要', '讓', '這個', '世界', '上', '有', '一個', '人', '能夠', '看見', '他', '的', '蹤', '影']
1.8235508858579936
['他們', '的', '行動', '已經', '是', '非常', '危險', '的']
1.9119363037679267
['只不過', '是', '一種', '非常', '危險', '的', '行為']
1.7566897225904405
['就是', '要', '讓', '別人', '瞭解', '他', '的', '生死']
1.5969375718779921
['他們', '都', '很', '遙遠']
1.9634429421973043
['他們', '都', '不會', '像', '楚留香', '那樣']
1.5553901897471025
['這個', '世界', '上', '有', '很多', '事', '都', '是', '這樣子', '的']
1.6968698230985952


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


1
1
1
1
1
1
0
'?'
'」'
1
1
1
['在', '那裡']
2.5753824364244142
['在', '那裡']
2.5753824364244142
['在', '那裡']
2.5753824364244142
['在', '那裡']
2.5753824364244142
['在', '於', '阻止', '他', '的', '人']
2.265638248031104
['在', '那裡']
2.5753824364244142
['在', '哪裡', '?', '」']
1.1062530919681064
['在', '那裡']
2.5753824364244142
['在', '那裡']
2.5753824364244142
['在', '那裡']
2.5753824364244142


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


1
1
1
2
1
1
0
'」'
1
1
1
[]
nan
[]
nan
[]
nan
['就是', '他們', '的', '朋友']
2.690904658188446
[]
nan
['他們', '都', '不', '知道']
2.70154428853963
['」']
0.0
[]
nan
['他們', '的', '人', '都', '已', '經死', '了']
2.6765193003630356
[]
nan


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


1
'「'
2
1
'「'
2
3
3
2
'」'
'：'
'「'
2
0
'「'
'?'
'」'
1
'「'
['「', '這個', '人', '是不是', '因為', '他們', '的', '身份', '而', '改變', '了']
nan
['就是', '我們', '這個', '男人']
nan
['「', '我們', '是不是', '也', '不', '知道']
nan
['不是', '為', '了', '我', '而', '做']
nan
['就算', '他', '是', '個', '死', '人', '也', '不會', '相應']
nan
['是', '為', '了', '要', '讓', '人', '知道', '他們', '的', '行', '蹤']
nan
['」', '他', '說', '：', '「', '我', '知道', '這一點']
nan
['你們', '這次', '行動', '的', '目的', '是', '要', '讓', '人', '知道', '他們', '是', '在', '想要', '他', '的', '計劃', '中', '遭到', '的']
nan
['「', '你們', '不是', '要', '看見', '這個', '人', '?', '」']
nan
['「', '這個', '世界', '上', '有', '很多', '事', '都', '是', '不', '可能', '的', '事']
nan


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


1
1
1
1
1
1
1
1
0
'」'
1
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
['」']
0.0
[]
nan


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


2
1
'「'
2
1
'「'
0
'「'
'?'
'」'
1
'「'
1
'「'
1
'「'
1
'「'
0
'「'
'?'
'」'
['而是', '因為', '他', '是', '個', '非常', '聰明', '的', '女人']
nan
['「', '你們', '都', '是', '這樣子', '的']
nan
['而是', '想要', '他', '殺', '死', '他']
nan
['「', '我們', '的', '確有', '一點', '意思']
nan
['「', '這個', '世界', '上', '有', '一些', '人', '是不是', '能夠', '看', '得', '見', '的', '?', '」']
nan
['「', '這個', '世界', '上', '有', '很多', '事', '都', '是', '不能', '算是', '事實', '的']
nan
['「', '我', '知道', '這個', '世界', '上', '有', '很多', '事', '都', '是', '這樣子', '的']
nan
['「', '你', '知道']
nan
['「', '我', '不', '知道']
nan
['「', '我們', '是不是', '要', '殺', '他們', '的', '人', '?', '」']
nan


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


0
'」'
2
0
'」'
0
'」'
0
'」'
2
'聰麗'
0
'：'
'「'
'?'
'」'
2
2
1
['」']
0.0
['而且', '很快', '就會', '變得', '非凡']
2.174045268942904
['」']
0.0
['」']
0.0
['」']
0.0
['就', '好像', '是', '一個', '很', '聰麗', '的', '人']
2.4377726911408506
['所以', '她', '才', '會問', '：', '「', '你', '說', '的', '是', '什麼', '?', '」']
1.8519584559064883
['就', '像是', '一個', '不能', '讓', '別人', '看到', '的', '小鬼', '一樣']
2.5564123898508813
['忽然', '間', '就', '已', '經變', '得', '很', '奇怪']
2.15733491654467
['所以', '她', '才', '會', '覺得', '自己', '一定', '會', '想到', '自我']
2.536676634685342


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


2
2
3
0
'」'
3
2
1
2
2
'每一處'
1
['就', '好像', '一', '隻', '眼']
2.320698607242605
['一定', '要', '有', '這種', '感覺']
2.7739131254219345
['他們', '的', '行動']
2.474049664988776
['」']
0.0
['有', '一些', '人', '是不是', '要', '看', '見', '的']
2.4674524629943737
['每一條', '路', '都', '有', '一片', '狼來', '的', '痕跡']
1.9661304287162713
['他們', '也', '許還', '是', '會', '看', '得', '見', '的']
2.4472775472269057
['一定', '要', '有', '一種', '非常', '嚴肅', '的', '態度']
2.5787584488465294
['每一處', '都', '有', '一些', '小小的', '屍體']
1.923661172433699
[]
nan


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


0
'」'
0
'」'
1
0
'」'
1
0
'」'
1
1
0
'」'
0
'」'
['」']
0.0
['」']
0.0
['因為', '他們', '的', '感情', '都', '好像', '是', '一樣']
2.8725921333903526
['」']
0.0
['才能', '讓', '人', '瞭解']
2.691455452280936
['」']
0.0
['才能', '讓', '人', '看得出', '來']
2.6557389053110065
['才能', '讓', '他們', '知道']
3.034763665752032
['」']
0.0
['」']
0.0


Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


0
0
0
0
0
0
0
0
0
0
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
[]
nan
1
'「'
2
1
'「'
1
'「'
1
'「'
2
0
'「'
'?'
'」'
0
'「'
'?'
'」'
0
'「'
'?'
'」'
2
['「', '我', '不是', '這樣子', '的']
nan
['就是', '要', '讓', '你', '們', '的', '人', '都', '瞭解', '他們']
nan
['「', '你們', '不能不', '承認', '的']
nan
['「', '這個', '世界', '上', '有', '很多', '事', '都', '是不是', '事實', '的']
nan
['「', '我', '不是', '這樣子', '的']
nan
['而是', '你', '的', '兄弟']
nan
['「', '你', '是不是', '要', '把', '你', '的', '頭顱', '割', '下來', '?', '」']
nan
['「', '你', '是不是', '這樣子', '的', '?', '」']
nan
['「', '這個', '人', '是', '誰', '?', '」']
nan
['有', '很多', '事', '都', '是', '不能', '回答', '的']
nan
巴格達到台北的距離
他們的目的地
在那裡

「這個人是不是因為他們的身份而改變了

而是因為他是個非常聰明的女人
就像是一個不能讓別人看到的小鬼一樣
一定要有這種感覺
才能讓他們知道

「我不是這樣子的
