Objective:
My objevtive for the machine learning project is to create a Recurrent Neural Network (RNN) that can generate (somewhat) legible poetry based on a collection of poems with a GAN external network to optimize the RNN potential. To do so, I will take a kaggle dataset with 500+ poems written from the reniassance and modern era of poetry to be used as the 'real' data for the discriminator.

Tools:
Libraries include pandas, numpy, keras.preprocessing for tokenizer (discussed below), GloVe dictionary (below), and keras.layers/models and the like for neural network model architecture and execution. 

Various hygiene methods are required to standardize each poem into the same length, as well as adding padding for poems less in length. Instead of feeding in actual words, I will encode each word into a tokenizer, so integers are fed into the network rather than words. Further, to better contend with writing legible poetry, I use feature engineering from https://nlp.stanford.edu/projects/glove/ to encode words to higher dimensional space. Words with similar meaning should have a similar vector space, for example. 

Both the generator and discriminator have RNN structures, including one LSTM layer to predict the next word based on the series of words prior in sequence. The discriminator will take samples from our real poems and full generated poems to try and classify real from fake. 


link to data : https://www.kaggle.com/ishnoor/poetry-analysis-with-machine-learning

References: some hygiene steps came from https://www.kaggle.com/hsankesara/mr-poet and some GAN framework coding came from https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.5-introduction-to-gans.ipynb

In [1]:
import pandas as pd
import numpy as np

poems = pd.read_csv("all.csv")

#### Adding character length column

In [2]:
poems['length'] = 0
for i in range(len(poems)):
    poems['length'][i] = len(poems['content'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### Clean data by deleting null entries and non-poems

In [3]:
poems = poems.sort_values(by='length') #Sort by length of poem
poems = poems[14:len(poems)-5] # Delete tails on both sides
poems = poems[poems['content'].str.contains('Published')==False]# Eliminate non-poems with 'Published'
print(len(poems))
poems = poems[poems['content'].str.contains('from Selected Poems')==False]# Eliminate non-poems with 'from Selected Poems'
print(len(poems))
poems = poems[poems['content'].str.contains('Collected Poems')==False]# Eliminate non-poems with 'from Collected Poems'
print(len(poems))
#Eliminate where poem is just intro
for ind, row in poems.iterrows():
    if row['author'] in row['content'].upper() or str(row['poem name']) in row['content'][:40]:
        poems = poems.drop([ind])
print(len(poems))

552
536
518
465


### Hygiene: only poems between 100 & 1000 in character length

In [4]:
num_poems = len(poems)
poem = poems['content'][:num_poems]
poem = poem[poems['length'] > 100]
poem = poem[poems['length'] < 1000]
poem = poem.reset_index(drop=True)
X = poem
num_poems = len(poem)

### Create vocab size and word dictionary

In [5]:
temp = ''
for i in range(num_poems):
    temp += poem[i] + ' '
poem = temp

import re
#poem = re.sub(' +',' ',poem)
poem = poem.lower()
poem = re.findall(r'[\w]+|[\'!"#$%&()*+,-./:;<=>?@[\]^_`{|}~]',poem)
words = list(set(poem))
vocab_size = len(words)
#print(vocab_size)


In [6]:
print(X.describe())

count                                                   349
unique                                                  311
top       Potuia, potuia\r\nWhite grave goddess,\r\nPity...
freq                                                      3
Name: content, dtype: object


In [7]:
X[0]

'The fog comes\r\non little cat feet.\r\n\r\nIt sits looking\r\nover harbor and city\r\non silent haunches\r\nand then moves on.'

In [8]:
for i in range(len(X)):
    X[i] = X[i].replace("\r\n"," ")  

### Feature Engineering: Convert words to integers (tokenizer)  

In [9]:
from keras.preprocessing.text import  Tokenizer
from keras.preprocessing.sequence import pad_sequences

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [10]:
tokenizer = Tokenizer( num_words=vocab_size)

In [11]:
tokenizer.fit_on_texts(X)

In [12]:
text = tokenizer.texts_to_sequences(X)
maxlen = 0
for i in text:
    if len(i) > maxlen:
        maxlen = len(i)
text = pad_sequences(text, maxlen=maxlen, padding='post')

In [13]:
word_dict = tokenizer.word_index

In [14]:
maxwords = len(word_dict)

### Map words to 50-dim vector (embedding_matrix)

In [15]:
embedding_matrix = np.zeros((maxwords+1,50))

In [16]:
with open('glove.6B.50d.txt') as f:
    for line in f:
        l = line.split()
        if l[0] in word_dict:
            indx = word_dict[l[0]]
            for i in range(50):
                embedding_matrix[indx][i] = l[i+1]

### Create 349 x 177 x 50 X_train matrix

In [17]:
x_train = np.zeros((349,177,50),dtype='float32')
for indp, poem in enumerate(text):
    for indw, word in enumerate(poem):
        x_train[indp,indw,:] = embedding_matrix[word]

## Network 

In [18]:
from keras.models import Sequential, Model
from keras.layers import Embedding, LSTM, Dropout, TimeDistributed, Dense, Activation, Input
from keras.optimizers import RMSprop


num_steps = 177
#hidden_size = 350
feature_dim = 50

### Generator

In [19]:
generator_input = Input(shape=(num_steps,1))

x = LSTM(feature_dim, return_sequences=True)(generator_input)
#x = TimeDistributed(Dense(feature_dim))(x)

generator = Model(generator_input, x)

### Discriminator

In [20]:
discriminator_input = Input(shape=(num_steps,feature_dim))

x = LSTM(feature_dim)(discriminator_input) 
#x = LSTM(hidden_size, return_state=True)(x)
x = Dense(1, activation='sigmoid')(x)

discriminator = Model(discriminator_input, x)

discriminator_optimizer = RMSprop(lr=0.0008, clipvalue=1.0) #decay=1e-8
discriminator.compile(optimizer=discriminator_optimizer, loss='binary_crossentropy')

### GAN Framework

In [21]:
from keras import backend

# Set discriminator weights to non-trainable
# (will only apply to the `gan` model)
discriminator.trainable = False

gan_input = Input(shape=(num_steps,1))
gen_output = generator(gan_input)
gan_output = discriminator(gen_output)
gan = Model(gan_input, gan_output)

gan_optimizer = RMSprop(lr=0.0004, clipvalue=1.0) #decay=1e-8
gan.compile(optimizer=gan_optimizer, loss='binary_crossentropy')

### Training Steps

In [22]:
iterations = 500
batch_size = 30

# Start training loop
start = 0
for step in range(iterations):
    # Sample random points in the latent space
    random_latent_vectors = np.random.normal(size=(batch_size,num_steps, 1))

    # Decode them to fake poems
    generated_poems = generator.predict(random_latent_vectors)

    # Combine them with real poems
    stop = start + batch_size
    real_poems = x_train[start: stop]
    combined_poems = np.concatenate([generated_poems, real_poems])

    # Assemble labels discriminating real from fake poems
    labels = np.concatenate([np.ones((batch_size, 1)),
                             np.zeros((batch_size, 1))])
    # Add random noise to the labels - important trick!
    labels += 0.05 * np.random.random(labels.shape)

    # Train the discriminator
    d_loss = discriminator.train_on_batch(combined_poems, labels)

    # sample random points in the latent space
    random_latent_vectors = np.random.normal(size=(batch_size, num_steps, 1))

    # Assemble labels that say "all real poems"
    misleading_targets = np.zeros((batch_size, 1))

    # Train the generator (via the gan model,
    # where the discriminator weights are frozen)
    g_loss = gan.train_on_batch(random_latent_vectors, misleading_targets)
    
    start += batch_size
    if start > len(x_train) - batch_size:
        start = 0
    
    if step % 10 == 0:
        # Save model weights
        #gan.save_weights('gan.h5')

        # Print metrics
        print('discriminator loss at step %s: %s' % (step, d_loss))
        print('generator loss at step %s: %s' % (step, g_loss))

  'Discrepancy between trainable weights and collected trainable'


discriminator loss at step 0: 0.692512
generator loss at step 0: 0.705456
discriminator loss at step 10: 0.69128674
generator loss at step 10: 0.7360985
discriminator loss at step 20: 0.69160616
generator loss at step 20: 0.74496174
discriminator loss at step 30: 0.68997955
generator loss at step 30: 0.7435515
discriminator loss at step 40: 0.68980426
generator loss at step 40: 0.7476873
discriminator loss at step 50: 0.67291605
generator loss at step 50: 0.83934313
discriminator loss at step 60: 0.106737964
generator loss at step 60: 2.5219164
discriminator loss at step 70: 0.042421106
generator loss at step 70: 3.4494157
discriminator loss at step 80: 0.015179757
generator loss at step 80: 4.2981086
discriminator loss at step 90: 0.005743395
generator loss at step 90: 5.1372514
discriminator loss at step 100: -0.017270032
generator loss at step 100: 6.065909
discriminator loss at step 110: -0.031956997
generator loss at step 110: 7.2712293
discriminator loss at step 120: -0.032872453