# Long Short Term Memory (LSTM) with Keras (1)

We will use an LSTM to generate text character by character.

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras.models import load_model
import numpy as np
import random
import sys
import json


Using TensorFlow backend.


#### Configuration

In [2]:
num_epochs = 10
batch_size = 128
generate_len = 160

seq_len = 60
step = 6

model_exists = False
model_name = "char_rnn_4.h5"

The dataset we're going to use are Trump's tweets from 01/2015 to 07/2017.
The data have been downloaded from  https://github.com/bpb27/trump_tweet_data_archive.

In [3]:
# extract text from the tweets and store everything in one long string
import re
pattern = re.compile(r'text": "(.+?)", "created')

with open('condensed_201567.json', 'r') as content_file:
    content = content_file.read()
    #print(content[:1000])
    tweets = re.findall(pattern, content)
    
print(len(tweets))

text = " ".join(tweets)
text[:1000]

12894


'Heading back to Washington, D.C. Much will be accomplished this week on trade, the military and security! Congratulations to Sung Hyun Park on winning the 2017 @USGA #USWomensOpen\\ud83c\\uddfa\\ud83c\\uddf8 I am at the @USGA  #USWomensOpen. An amateur player is co-leading for the first time in many decades - very exciting! The #USSJohnFinn will provide essential capabilities to keep America safe. Our sailors are the best anywhere in the world. Congratulations! https://t.co/yTnMwSh1Kg The ABC/Washington Post Poll, even though almost 40% is not bad at this time, was just about the most inaccurate poll around election time! With all of its phony unnamed sources &amp; highly slanted &amp; even fraudulent reporting, #Fake News is DISTORTING DEMOCRACY in our country! Thank you to former campaign adviser Michael Caputo for saying so powerfully that there was no Russian collusion in our winning campaign. Thank you to all of the supporters, who far out-numbered the protesters, yesterday at th

In [4]:
# create dictionaries for char -> index lookup and index -> char lookup, respectively
unique_chars = sorted(set(text))
print('Total unique chars:', len(unique_chars))
char2index = dict((c, i) for i, c in enumerate(unique_chars))
index2char = dict((i, c) for i, c in enumerate(unique_chars))
#char2index
#index2char

Total unique chars: 92


In [5]:
# generate training data
# length of every sequence will be seq_len
# degree of overlap is determined by step

# this will yield X_train
seqs = []
# this will yield y_train
next_chars = []

for i in range(0, len(text) - seq_len, step):
    seqs.append(text[i: i + seq_len])
    next_chars.append(text[i + seq_len])
print('Number of training sequences: ', len(seqs))


Number of training sequences:  253041


In [6]:
seqs[:10]

['Heading back to Washington, D.C. Much will be accomplished t',
 'g back to Washington, D.C. Much will be accomplished this we',
 ' to Washington, D.C. Much will be accomplished this week on ',
 'shington, D.C. Much will be accomplished this week on trade,',
 'on, D.C. Much will be accomplished this week on trade, the m',
 'C. Much will be accomplished this week on trade, the militar',
 'h will be accomplished this week on trade, the military and ',
 ' be accomplished this week on trade, the military and securi',
 'complished this week on trade, the military and security! Co',
 'shed this week on trade, the military and security! Congratu']

In [7]:
# Prepare the X and y matrices for training
# X shape is (number of sequences, seq_len, number of unique characters)
X = np.zeros((len(seqs), seq_len, len(unique_chars)), dtype=np.bool)
X.shape

(253041, 60, 92)

In [8]:
# y shape is (number of sequences,  number of unique characters)
y = np.zeros((len(seqs), len(unique_chars)), dtype=np.bool)
y.shape

(253041, 92)

In [9]:
# Fill the X and y matrices, one-hot-encoding the characters
# this yields the <number of unique characters> features for the LSTM
for i, s in enumerate(seqs):
    for j, char in enumerate(s):
        X[i, j, char2index[char]] = 1
        y[i, char2index[next_chars[i]]] = 1

In [10]:
if not model_exists:
    model = Sequential()
    # LSTM input is shaped (batch_size, timesteps, input_dim) where input_dim == number of features
    model.add(LSTM(128, input_shape=(seq_len, len(unique_chars))))
    model.add(Dense(len(unique_chars)))
    model.add(Activation('softmax'))
    model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               113152    
_________________________________________________________________
dense_1 (Dense)              (None, 92)                11868     
_________________________________________________________________
activation_1 (Activation)    (None, 92)                0         
Total params: 125,020
Trainable params: 125,020
Non-trainable params: 0
_________________________________________________________________


In [11]:
if not model_exists:
    optimizer = RMSprop(lr=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)

In [12]:
# this function allows for manipulating the "raw probabilities" returned by the network
# temperature > 1 enhances likelihood for low-probability characters
# temperature < 1 favors high-probability characters disproportionately

def sample(preds, temperature=1.0):
    preds = preds.astype('float64')
    #print("Original preds: {}".format(preds))
    preds = np.log(preds) / temperature
    preds = np.exp(preds)
    preds = preds / np.sum(preds)
    #print("Adjusted preds: {}".format(preds))
    outcome = np.random.multinomial(1, preds, 1)
    draw = np.argmax(outcome)
    #print("Multinomial draw: {} - max index is {}".format(outcome, draw))
    return draw

In [13]:
# illustrate sample function
#preds = np.array([0.15, 0.2, 0.5, 0.15])

#print(sample(preds, temperature = 0.2))
#print(sample(preds, temperature = 0.5))
#print(sample(preds, temperature = 1))
#print(sample(preds, temperature = 1.2))


In [14]:
# Generate text after every epoch, to allow for comparisons
# For every epoch, text is generated using different temperatures/diversities
def generate():
    
    # create seed for 
    start_index = random.randint(0, len(text) - seq_len - 1)
    seed = text[start_index: start_index + seq_len]
    print("####################################################################")
    print('#####    Seed: "' + seed + '"    #####')
    print("####################################################################")

    for diversity in [0.1, 0.3, 0.6, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity, " -----")

        generated = ''
        seed = text[start_index: start_index + seq_len]
        generated += seed

        for i in range(generate_len):

            # prepare the test input data
            x = np.zeros((1, seq_len, len(unique_chars)))
            for j, char in enumerate(seed):
                x[0, j, char2index[char]] = 1.
                
            preds = model.predict(x)[0]
            
            next_index = sample(preds, diversity)
            next_char = index2char[next_index]

            generated += next_char
            seed = seed[1:] + next_char
        print(generated)
    

In [15]:
if not model_exists:

    # train the model, output generated text after each iteration
    for iteration in range(0, num_epochs):
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(X, y,
                  batch_size=batch_size,
                  epochs=1)
        model.save("char_rnn_{}.h5".format(iteration))
        generate()

else:
    model = load_model(model_name)
    generate()


--------------------------------------------------
Iteration 0
Epoch 1/1
####################################################################
#####    Seed: "e- no cred! Why does FOX put him on? .@lindseygraham, who ha"    #####
####################################################################

----- diversity: 0.1  -----
e- no cred! Why does FOX put him on? .@lindseygraham, who have the @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump 

----- diversity: 0.3  -----
e- no cred! Why does FOX put him on? .@lindseygraham, who have is the great support to know the going the support the great said a mad say this show @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @re

----- diversity: 0.6  -----
e- no cred! Why does FOX put him on? .@lindseygraham, who has guy be and my beall the DE PAN I.CL campaign Benca and the me in the leasting it is the great and I am with th

####################################################################
#####    Seed: "e highly respected new national poll that just came out via "    #####
####################################################################

----- diversity: 0.1  -----
e highly respected new national poll that just came out via @realDonaldTrump with the Was @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonald

----- diversity: 0.3  -----
e highly respected new national poll that just came out via @realDonaldTrump is a the worse and back and but her country and the Fall on her being on @realDonaldTrump @realDonaldTrump @realDonaldTrump @realDonaldTrump @r

----- diversity: 0.6  -----
e highly respected new national poll that just came out via @DonaldJTrump\u2019 Great @realDonaldTrump to make the tough hate formillely by my more his the hirty for that supporters and supporter and his big a big reacti

----- diversity: 1.0  --

  


ary Clinton. What she did was wrong! What Bill did was stupide to have been the was the makes will be a great support to get the believates and problem! \"@Hepichanninger: @realDonaldTrump @realDonaldTrump #CelebApprenti

----- diversity: 0.6  -----
ary Clinton. What she did was wrong! What Bill did was stupides to #BollNews to the world\" \"@Andersovan: @realDonaldTrump our appringion! We woral debate about for no our job he is from the vote our crose want on @DanS

----- diversity: 1.0  -----
ary Clinton. What she did was wrong! What Bill did was stuping win inte fer lampine given En start!\" \"@GloranonnapaneGh:!!!\" \"@trhanukeonyfordo: @realDonaldTrump How case of North I arcies. I hope they condoess Iran 

----- diversity: 1.2  -----
ary Clinton. What she did was wrong! What Bill did was stupided lool Duffoon! \"@DChe2: Marons to @jovenoveryzadin does what ChrisLy A. dandy Hillary...Sleds. Bloe 'Bf Give Hispordanis! He is races its a onne you \nis m.

----------------------------