# Using LSTM to generate poems of Percy Bysshe Shelley by Pytorch

Allowing computer to write poems that similar to one poet is interesting. One way to finish this task is to use N-gram model. However, with the development of deep learning, people find that using recurrent neural network (RNN) to finish this task is much more efficient. In this tutorial, I will introduce you about how to use long-short-term-memory (LSTM), which is a kind of RNN, to train the computer to write poems. We will use pytorch as tools to finish this task. Since Percy Bysshe Shelley is one of my favorite poet, so let start with him!

The tutorial will contain these parts:
1. Some basic concepts about RNN and LSTM, which can help people who know nothing about neural network to understand the basic things.
1. How to collect poems of your favorite poet? This part will introduce you how to use beautiful soup to catch poems from web and use NLTK to clean it.
2. How to transform the data into a form that is suitable to feed in a LSTM network? This part will introduce you how to use Dataset and Dataloader of pytorch to preprocess the training data and generate batch.
3. How to build and train the LSTM model?
4. How to use the model to write poems?	

Attention: This tutorial just tries to tell you how to build a basic LSTM model, so I leave out some important parts, such as dividing data into training and validation set and evaluating model performance.

## 1. Introduction to RNN and LSTM

Recurrent neural network (RNN) is a kind of Neural Network that takes sequence data as input and share parameters over each iteration. Long-short-term-memory (LSTM) is a transformation of RNN which contains three gates to simulate the memory of humans. This kind of neural network is good at deal with data whose current state has dependency on previous states.

To understand it better. You can see:
1. RNN: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns
2. LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs

## 2. Data collection and clean

In order to imitate Percy Bysshe Shelley's poems, we need to collect some training data. So let's using BeautifulSoup to crawl multiple poems. BeautifulSoup is a package of python that can generate a HTML tree so that we can get the context or attributes of a tag easily. We collect the poems from https://www.poemhunter.com/poems.

Here are the steps that we collect the data.
1. We need to see the source code to find which part of the HTML contains the poems that we need. Here, according to my manual analyzation in Chrome, I know that the poems are under the second 'p' of each page. By this, we can get one poem. Then we need to find the link of the the next poem, so that we can find the next poems by iteration. The function get_poem() finish this function.
2. The get_poems() does the iteration to find all the poems.

In [2]:
from bs4 import BeautifulSoup
import requests
import time

# Get one poem and find the next poem.
# html: The url of the web page that contains the poem.
# return:
# poem: The text of one poem.
# next_html: The next url that contains next poem.
def get_poem(html):
    html = requests.get(html).content.decode("utf-8")
    soup = BeautifulSoup(html, "html.parser")
    
    count = 0
    poem = None
    for p in soup.find_all("p"):
        count += 1
        if count == 2:
            poem = p.text
            break
    html_prefix = "https://www.poemhunter.com"
    html_body = soup.find("li", class_="next")
    if html_body:
        html_body = html_body.find("a")["href"]
        next_html = html_prefix + html_body
        return poem, next_html
    else:
        return poem, None

# Do iterations to collect all the poems.
def get_poems(start_html):
    poems = []
    next_html = start_html
    count = 0
    while True:
        poem, next_html = get_poem(next_html)
        poems.append(poem)
        if not next_html:
            break
        time.sleep(1)
        count += 1
    return poems, count

In [13]:
# Here are the poems that we collected.
start_html = "https://www.poemhunter.com/poem/ozymandias/"
poems, count = get_poems(start_html)
print("The first 3 raw poems:")
print(poems[:3])
print("I got " + str(count) + " poems.")

The first 3 raw poems:
['\r\n                        I met a traveller from an antique landWho said: `Two vast and trunkless legs of stoneStand in the desert. Near them, on the sand,Half sunk, a shattered visage lies, whose frown,And wrinkled lip, and sneer of cold command,Tell that its sculptor  well those passions readWhich yet survive, stamped on these lifeless things,The hand that mocked them and the heart that fed.And on the pedestal these words appear --"My name is Ozymandias, king of kings:Look on my works, ye Mighty, and despair!"Nothing beside remains. Round the decayOf that colossal wreck, boundless and bareThe lone and level sands stretch far away.\' \r\n                        \n', '\r\n                        Listen, listen, Mary mine,To the whisper of the Apennine,It bursts on the roof like the thunder’s roar,Or like the sea on a northern shore,Heard in its raging ebb and flowBy the captives pent in the cave below.The Apennine in the light of dayIs a mighty mountain dim a

In [4]:
# save raw dataset
import pickle
# pickle.dump(poems, open("poems" + ".pickle", "wb" ))
poems = pickle.load(open("poems" + ".pickle", "rb" ))

However, from the data we can see that there still exists some "bad" characters in our texts, such as "\r\n", "II". We need to remove these kind of words. To be more details, we only need to remain the a-z0-9 characters and punctuation and then split the text into words in lower case. Then, we join all the words into a big list. Also, in order to train the model. We need to transform words text data into digits, which means each word can be represent by a digit.

In [8]:
# First Step:
# Clean the text.
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import re

def clean_text(poems):
    word_list = []
    # replace all whitespace with a single space
    poems_content = [re.sub(r"\s+", " ", p).lower() for p in poems]

    # add spaces before all punctuation, so they are separate tokens
    punctuation = set(re.findall(r"[^\w\s]+", " ".join(poems_content))) - {"-", "'"}
    
    for c in punctuation:
        poems_content = [p.replace(c, " " + c + " ") for p in poems_content]
        
    poems_content = [re.sub(r"\s+", " ", p).lower().strip() for p in poems_content]
    
    # remove word that is not a-z0-9,!?();:-
    poems_content = [re.sub(r"[^\sa-z0-9,!?();:-]", "", p) for p in poems_content]
    
    for poem in poems_content:
        token_poem = word_tokenize(poem)
        word_list += token_poem

    return word_list

word_list = clean_text(poems)
print(word_list[:100])

['i', 'met', 'a', 'traveller', 'from', 'an', 'antique', 'landwho', 'said', ':', 'two', 'vast', 'and', 'trunkless', 'legs', 'of', 'stonestand', 'in', 'the', 'desert', 'near', 'them', ',', 'on', 'the', 'sand', ',', 'half', 'sunk', ',', 'a', 'shattered', 'visage', 'lies', ',', 'whose', 'frown', ',', 'and', 'wrinkled', 'lip', ',', 'and', 'sneer', 'of', 'cold', 'command', ',', 'tell', 'that', 'its', 'sculptor', 'well', 'those', 'passions', 'readwhich', 'yet', 'survive', ',', 'stamped', 'on', 'these', 'lifeless', 'things', ',', 'the', 'hand', 'that', 'mocked', 'them', 'and', 'the', 'heart', 'that', 'fed', 'and', 'on', 'the', 'pedestal', 'these', 'words', 'appear', '--', 'my', 'name', 'is', 'ozymandias', ',', 'king', 'of', 'kings', ':', 'look', 'on', 'my', 'works', ',', 'ye', 'mighty', ',']


In [9]:
# Second Step:
def construct_dictionary_and_word_transform(word_list):
    word2index = {}
    index2word = {}
    index = 0
    index_word_list = []
    for word in word_list:
        if word not in word2index:
            word2index[word] = index
            index2word[index] = word
            index += 1
        index_word_list.append(word2index[word])
    return word2index, index2word, index_word_list, len(index2word)

word2index, index2word, index_word_list, dic_length = construct_dictionary_and_word_transform(word_list)
print(len(word2index))
print(len(index2word))
print(index_word_list[0:100])
print(dic_length)

13741
13741
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 18, 24, 22, 25, 26, 22, 2, 27, 28, 29, 22, 30, 31, 22, 12, 32, 33, 22, 12, 34, 15, 35, 36, 22, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 22, 47, 23, 48, 49, 50, 22, 18, 51, 38, 52, 21, 12, 18, 53, 38, 54, 12, 23, 18, 55, 48, 56, 57, 58, 59, 60, 61, 62, 22, 63, 15, 64, 9, 65, 23, 59, 66, 22, 67, 68, 22]
13741


## 3. Using Dataset and Dataloader to prepare data

In previous section, we had finish the basic processing of text and convert them into digits. However, since we want to train a neural network, we need more data preparation. In neural network, we always collect the data and then pack them as batchs before training. This is because creating a batch is the same as pack many single vector as a huge matrix, which can take the advantage of matrix multiplication and the acceleration of GPU. 

To learn more about batch, you can see:
https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

In Pytorch, it provide us with two classs that help us create our own dataset and batch clearly and easily.
First, let's start with Dataset.

The fuction of Dataset is that it overide the function __getitem__() to allow Dataloader to automatically generate a batch. The example here is easy, since we just need to directly get one item (one item is consist of one training data and its label). However, for some complex problem, you need to make some futher modification of your data, you can feel free to deal with them in the Dataset.

PS: For this certain problem, the label of one data is the next data of it.

Example:

data  : I    love you . 

label : love you  .

I -> love, love -> you, you -> .

In [11]:
import torch.utils.data as data
import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable

class PoemsDataset(data.Dataset):
    def __init__(self, data):
        self.data = np.array(data)
        self.bptt = 35
        
    def __getitem__(self, i):
        i *= self.bptt
        return self.data[i : i + self.bptt], self.data[i + 1: i + 1 + self.bptt]

    def __len__(self):
        return (len(self.data) - 1) // self.bptt

Let's turn to the Dataloader. Taking a dataset as an input, dataloader can automatically generate batch. However, sometimes we are not very satisfied with it's batch, so we need to make some modification. The way we do that is overide the **collate_fn()**.

Here, according to the definition of the LSTM model of pytorh, we need to transpose each batch.

In [131]:
# The shape of a batch data is [batch_size, data, label]
# Here we do not need do anything.

def collate_fn(batch):
    return batch

Now we can use Dataset and Dataloader to create our batch data.

In [12]:
batch_size = 10

dataset = PoemsDataset(index_word_list)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=False, drop_last=True)

count = 0
for (data, label) in data_loader:
    if count == 1:
        break
    print("Data: ")
    print(data)
    print("Label: ")
    print(label)
    count += 1

Data: 


Columns 0 to 12 
    0     1     2     3     4     5     6     7     8     9    10    11    12
   30    31    22    12    32    33    22    12    34    15    35    36    22
   12    18    53    38    54    12    23    18    55    48    56    57    58
   73    74    18    75    38    76    77    22    78    12    79    80    12
   18    94    95    18    96    97    98    22    99    95    18   100    23
  111    15   112     2    68   113   114    12   115    22   116   117    18
   18    91   132   133   134    18   135    22   136   137    17   138   139
   23   152   153   154   155    39   156    17   138   157   158   159   160
    2   177    22   178   179   180   181    23   118   182   178    22   152
   22   172   194   195   196   197   198   199    22    59   165   200   201

Columns 13 to 25 
   13    14    15    16    17    18    19    20    21    22    23    18    24
   37    38    39    40    41    42    43    44    45    46    22    47    23
   59    60    61  

## 5. Build and train a LSTM model

Here, we will try to build one simple LSTM model to train the data in order to help you know how to use pytorch to build a basic neural network model. Here, you can see some codes name embedding (encode) and decode, these part of codes condense each word (digits) into a small embedding, which can make the training space more dense. 

To see more about embedding, you can see:
http://deeplearning.net/tutorial/rnnslu.html

You can see from the code that:
A simple neural network in pytorch is consist of a __init__ function and a __forward__ function. The first function define the basic architecture of the neural and the second function define the operation when the neural network implement the forward pass.

In [152]:
class LSTMModel(torch.nn.Module):
    # A model with an encoder, a LSTM module, and a decoder.
    def __init__(self, vocab_size, hidden_size, nlayers):
        super(LSTMModel, self).__init__()
        self.encoder = nn.Embedding(vocab_size, hidden_size)
        self.LSTM = nn.LSTM(hidden_size, hidden_size, nlayers)
        self.decoder = nn.Linear(hidden_size, vocab_size)
        self.decoder.weight = self.encoder.weight

    def forward(self, data, h):
        embedding = self.encoder(data)
        o, h = self.LSTM(embedding, h)
        decoded = self.decoder(o.view(o.size(0) * o.size(1), o.size(2)))
        return decoded.view(o.size(0), o.size(1), decoded.size(1)), h

After we build the model, lets begin to train it. For a model, you need to specify its loss function, hidden_size, layers and so on. Here for the simplicity, this parameters are all set very simple.

In [157]:
def train(data_loader, batch_size, learning_rate, dic_length):
    model = LSTMModel(vocab_size=dic_length, hidden_size=100, nlayers=1)
    
    loss_function = nn.CrossEntropyLoss()
    
    model.train()
    total_loss = 0
    
    h = None
    for (data, label) in data_loader:
        model.zero_grad()
        o, h = model(Variable(data.t().contiguous()).long(), h)
        loss = loss_function(o.view(-1, dic_length), Variable(label))
        loss.backward()

        total_loss += loss.data

        if batch % 100 == 0 and batch > 0:
            cur_loss = total_loss[0] / 100
            print('loss:{}'.format(cur_loss))
            total_loss = 0

In [None]:
model = train(data_loader, batch_size, 0.001, dic_length)

## 6. Using LSTM Model to genereate poems

Now, the exciting time comes! Let's using our trained model to generate poems!

In [145]:
# forward means how many words we want to generate
# start_sequence allow us to choose some word to begin
def generation(model, forward, start_sequence, dic_length):
    model.eval()

    result = [0] * forward
    i = 0
    hidden = None
    single_input = Variable(torch.rand(1, 1).mul(dic_length).long(), volatile=True)

    for word_idx in start_sequence:
        single_input.data.fill_(int(word_idx))
        output, hidden = model(single_input, hidden)

    word_weights = output.squeeze().data.div(1).exp().cpu()
    word_idx = torch.multinomial(word_weights, 1)[0]
    single_result[i] = word_idx
    i += 1
    single_input.data.fill_(word_idx)

    for nextword_idx in range(forward - 1):
        output, hidden = model(single_input, hidden)
        word_weights = output.squeeze().data.div(1).exp().cpu()
        word_idx = torch.multinomial(word_weights, 1)[0]
        single_result[i] = word_idx
        i += 1
        single_input.data.fill_(word_idx)

    return result

In [None]:
generation(model, 10, [0, 2], dic_length)

## Reference:

[1] https://github.com/pytorch/examples/tree/master/word_language_model

[2] http://deeplearning.net/tutorial/rnnslu.html

[3] https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

[4] http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns

[5] http://colah.github.io/posts/2015-08-Understanding-LSTMs