## Use LSTM to generate the poem

The procedure:

1. Embedding Layer;
2. LSTM Decoder-only;
3. Sample for generation;
4. Use all data to train, which overfits the training data.

### 1. Download The Dataset

In [1]:
# Download the dataset
!mkdir -p data
!wget -nv --show-progress https://raw.githubusercontent.com/xiu-ze/Poetry/refs/heads/main/%E8%AF%97%E6%AD%8C%E6%95%B0%E6%8D%AE%E9%9B%86/%E5%94%90.csv -O data/tang_poetry.csv

data/tang_poetry.cs  15%[==>                 ]   1.88M  90.1KB/s    in 6m 29s  
2025-08-19 17:59:43 URL:https://raw.githubusercontent.com/xiu-ze/Poetry/refs/heads/main/%E8%AF%97%E6%AD%8C%E6%95%B0%E6%8D%AE%E9%9B%86/%E5%94%90.csv [12482729/12482729] -> "data/tang_poetry.csv" [2]


In [2]:
import os
import pandas as pd

file = os.path.expanduser('data/tang_poetry.csv')
df = pd.read_csv(file)

filter_by_wujue = df[df["体裁"].astype(str).str.contains("五言绝句", na=False)].copy()
wujue = filter_by_wujue['内容']
wujue[:10]

25    写书今日了，先生莫嫌迟。明朝是假日，早放学生归。
26    他道侧书易，我道侧书难。侧书还侧读，还须侧眼看。
36    攀藤招逸客，偃桂协幽情。水中看树影，风里听松声。
37    携琴侍叔夜，负局访安期。不应题石壁，为记赏山时。
38    泉石多仙趣，岩壑写奇形。欲知堪悦耳，唯听水泠泠。
39    岩壑恣登临，莹目复怡心。风篁类长笛，流水当鸣琴。
40    懒步天台路，惟登地肺山。幽岩仙桂满，今日恣情攀。
41    暂游仁智所，萧然松桂情。寄言栖遁客，勿复访蓬瀛。
42    瀑溜晴疑雨，丛篁昼似昏。山中真可玩，暂请报王孙。
43    傍池聊试笔，倚石旋题诗。豫弹山水调，终拟从钟期。
Name: 内容, dtype: object

### 2. Clean the dataset

1. Truncate the poems to 24 characters;
2. Check the invalid signs in the fixed positions and remove the abnormal items.

In [3]:
# Transform the pandas Series to numpy array and check
import numpy as np

def check_wujue_signs(wujue_array):
    # Check the fixed location values
    signs_in_wujue = set(wujue_array[:, [5, 11, 17, 23]].reshape(-1))
    print('signs in wujue:', ''.join(signs_in_wujue))

    # There are abnormal sign in specific locations
    valid_chars = set("！，？。")
    invalid_chars = signs_in_wujue - valid_chars
    print('invalid signs:', ''.join(invalid_chars))

    return invalid_chars

# Adjust the size of every poem item to 24
wujue_truncated = wujue.map(lambda x: x[:24])

# Transform the pandas Series to numpy array
wujue_array = np.array(
    wujue_truncated.map(list).to_list()
)
print('shape of numpy array', wujue_array.shape)

# Check the signs in the wujue array
invalid_chars = check_wujue_signs(wujue_array)

# Find the abnormal items
abnormal_items = wujue_array[
    np.isin(wujue_array[:, [5, 11, 17, 23]], list(invalid_chars)).any(axis=1)
]
print('abnormal count:', len(abnormal_items))
print('abnormal item: ', ''.join(abnormal_items[0]))

# Remove the abnormal item
wujue_array = wujue_array[
    ~np.isin(wujue_array[:, [5, 11, 17, 23]], list(invalid_chars)).any(axis=1)
]

print('===== After removing the abnormal items =====')
print('shape of numpy array', wujue_array.shape)
check_wujue_signs(wujue_array)

shape of numpy array (2711, 24)
signs in wujue: 。！？，沽此
invalid signs: 沽此
abnormal count: 1
abnormal item:  勒马问樵夫，前村酒有无。「杜康家在此，一任君来沽
===== After removing the abnormal items =====
shape of numpy array (2710, 24)
signs in wujue: ？，。！
invalid signs: 


set()

### 3. Dataset to token id sequences

In [4]:
from keras import layers

tv = layers.TextVectorization(
    max_tokens=10000,
    standardize=None,
    split="character",
    output_mode="int",
    output_sequence_length=24
)
tv.adapt(wujue_truncated)

2025-08-19 17:59:48.784165: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
# Demo usage of tv

# Print the vocabulary
print('Vocabulary size:', tv.vocabulary_size())
print('Vocabulary:', ''.join(tv.get_vocabulary()[:20]))

# Encode
encoded = tv(wujue_truncated.values[0])

# Decode
vocab = tv.get_vocabulary()
decoded =[vocab[i] for i in encoded]
print('Encoded:', encoded.numpy())
print('Decoded:', ''.join(decoded))

Vocabulary: [UNK]，。不人风山一无日何花来春水月中上知
Encoded: [1230  202   44   10 1137    2  297   42   67  608  323    3   48   85
   38 1001   10    2  288  576  441   42   40    3]
Decoded: 写书今日了，先生莫嫌迟。明朝是假日，早放学生归。


In [6]:
# Encode all the poems
wujue_token_ids = tv(wujue_truncated.values)
print('shape of wujue_token_ids:', wujue_token_ids.shape)

shape of wujue_token_ids: (2711, 24)


### 4. Build the LSTM Decoder model

In [7]:
# Prepare the train dataset
train_sequences = wujue_token_ids[:, :-1]
target_sequences = wujue_token_ids[:, 1:]

train_sequences.shape, target_sequences.shape

(TensorShape([2711, 23]), TensorShape([2711, 23]))

In [69]:
# Build a simple LSTM Decoder model

import keras
from keras import models, layers

def build_model(vocab_size):
    inputs = keras.Input(shape=(None,), dtype="int32", name="inputs")
    x_embedded = layers.Embedding(
        input_dim=vocab_size, output_dim=100, name="embedding"
    )(inputs)
    x_lstm_output = layers.LSTM(
        128, return_sequences=True, name="lstm"
    )(x_embedded)
    outputs = layers.Dense(
        vocab_size, activation="softmax", name="output"
    )(x_lstm_output)

    return models.Model(inputs=inputs, outputs=outputs, name="lstm_decoder")

model = build_model(tv.vocabulary_size())
model.summary()


In [67]:
# Define the sample generate function
def generate(prompt, max_length=24, temperature=1.0):
    """
    Generate a poem based on the start prompt

    Returns:
        A generated poem as a string.
    """
    generated = tv(prompt)[:len(prompt)].numpy().tolist()
    while len(generated) < max_length:
        input_sequence = np.array(generated).reshape(1, -1)
        predictions = model.predict(input_sequence, verbose=0)[0]
        next_token_id = sample(predictions[-1], temperature)
        generated.append(next_token_id)
    return ''.join(tv.get_vocabulary()[token_id] for token_id in generated)

def sample(predictions, temperature=1.0):
    """
    Sample a token from the predictions with temperature scaling.
    """
    predictions = np.asarray(predictions).astype("float64")
    predictions = np.log(predictions) / temperature
    exp_preds = np.exp(predictions)
    predictions = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas[0])

def sample(predictions, temperature=1.0, eps1=1e-20, eps2=1e-9):
    p = np.asarray(predictions, dtype=np.float64)

    # The two key points: log(p + eps1) divide by (T + eps2)
    logits = np.log(p + eps1) / (float(temperature) + eps2)

    # Subtract the max logit to prevent overflow
    logits -= np.max(logits)

    q = np.exp(logits)
    q /= q.sum()
    return int(np.random.choice(len(q), p=q))


generate("海外", temperature=0)

'海外不山里，不是不人人。不知不不处，不是不人人。'

In [68]:
# Define callback to print the sample generative poem every 10 epochs
class PoetryGenerateCallback(keras.callbacks.Callback):
    def __init__(self):
        super().__init__()

        self.next_print_epoch = 1

    def on_epoch_end(self, epoch, logs=None):
        epoch += 1
        if epoch != self.next_print_epoch:
            return

        print(f"Generating poems at epoch {epoch + 1}:\n")
        self._print_generated_poems()
        self.next_print_epoch *= 2

    @staticmethod
    def _print_generated_poems():
        temperatures = [0, 0.5, 1.0, 1.5]
        generated_texts = [
            generate('海外', max_length=24, temperature=temp)
            for temp in temperatures
        ]

        for temp, text in zip(temperatures, generated_texts):
            print(f"temperature {temp}:{text}\n")

In [70]:
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)
model.fit(
    train_sequences,
    target_sequences,
    batch_size=64,
    epochs=10,
    callbacks=[PoetryGenerateCallback()],
    verbose=2
)

Epoch 1/10
Generated poem with temperature 0:
海外。。。。。。。。。。。。。。。。。。。。。。

Generated poem with temperature 0.5:
海外缩寂，，，，愁，。，，。。。。，。，。，。不

Generated poem with temperature 1.0:
海外早和医峡小城，二。，满。辞数。垂。有寒。饥嗟

Generated poem with temperature 1.5:
海外萧虑静泪舟残逗动弄亭禄枯草渺逐彭苦甘撩无生犹

43/43 - 13s - 307ms/step - accuracy: 0.0855 - loss: 7.2277
Epoch 2/10
Generated poem with temperature 0:
海外，，。。。。。。。。。。。。。。。。。。。。

Generated poem with temperature 0.5:
海外情水。。。，。。。家，，。，。。时。，。。，

Generated poem with temperature 1.0:
海外逢弃白天。浮。株年旧渔我人嫌唯，，三送，，起

Generated poem with temperature 1.5:
海外透松待客财碧黄肠应谈羡争尽终容子语侧范高欲叠

43/43 - 11s - 256ms/step - accuracy: 0.0902 - loss: 6.3583
Epoch 3/10
Generated poem with temperature 0:
海外，，，。，。，。，。，。，。，。，。，。，。

Generated poem with temperature 0.5:
海外溪，，，。。。，人，，，。。。。，。，。。。

Generated poem with temperature 1.0:
海外越时韵年。，乡颓龙君独齐君远不阁年。望水附，

Generated poem with temperature 1.5:
海外联鸡檐劣棋辈璧用舞，教趣月心喜忍终婉危。每实

43/43 - 11s - 263ms/step - accuracy: 0.0939 - loss: 6.3154
Epoch 4/10
Generated poem with temperatu

<keras.src.callbacks.history.History at 0x14f81b740>

### 5. Using the model to generate poems

In [71]:
generate('海外')

'海外声春落。已为莹别若，虽二难幕争。写夜渔薜白，'