OUTLINE OF FINAL REPORT?

- Data
- Sentiment Analysis
- Character-level Text Generation
 - Maximum likelihood language model
 - LSTMs
 - Prediction methods (temperature, etc. -- could try beam search or ensemble methods)
- Word-level Text Generation
 - (Embedding methods?)
 - Maximum likelihood language model
 - LSTMs
 - Prediction methods (temperature, etc.)
 - (Look at some form of variable importance for the final models?)
- Conclusions and Future Directions

In [2]:
import os
import sys
os.environ['KERAS_BACKEND'] = 'tensorflow'

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Lambda
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
# Load in STAT 110 for now; convert to lower case
filename = "cleaned_data/STAT/STAT 110.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()

## Simple character-level maximum likelihood language model

In [18]:
# Try a (not-so-)cute baseline; ML language model is pretty good, but bad grammar?
# Code adapted from http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

from collections import *
from random import random

def train_char_lm(data, order = 4):    
    # Initialize dictionary to hold sequences and their probable next letters
    lm = defaultdict(Counter)
    # Pad the data to start
    pad = "~" * order
    data = pad + data
    
    # Loop over every sequence in the corpus, tracking the letters that tend to appear after each sequence
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    # Normalize the counts into probabilities
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c, cnt/s) for c, cnt in counter.items()]
    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

def generate_letter(lm, history, order):
    # Get previous sequence for which we'll predict the next char
    history = history[-order:] 
    # Get distribution of probable chars to follow
    dist = lm[history]
    # Roll the dice to generate next char with the probability given in the dist
    x = random()
    for c, v in dist:
        x = x - v
        if x <= 0: return c

def generate_text(lm, order, nletters=1000):
    # Initialize with padding tildes
    history = "~" * order
    out = []
    # Generate letters
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

lm = train_char_lm(raw_text, order = 10)
print(generate_text(lm, 10))


make sure to do plenty of cuties, most of the problems. this means you need to devote to this course with very difficult but very hard. but one of the toughest course i have take a high school. they aren't necessarily translate into the night to solve the problem sets! don't start the problem solving skills and a strong math background knowledge is invaluable.
only take a pass/fail. i should have failed the course simply because of scheduling and it was really good compared to other math class will be easy). but if you don't do well in the park. still worth attending lectures. highly recommend it. but if you are interesting
very useful material than any other introductory level course. then i heard that the course, but i also see changes.  that said, be forewarned that it's over, but i'd recommend this class is definitely try to attempt each problems as you can get through it, and it teaches you how to be a better understanding that he does.yeah, yeah, it will definitely, definitely ta

## Long short-term memory recurrent neural networks!

### Data processing

In [4]:
# Code from https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

In [5]:
# Map unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

# Prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# Reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# Normalize
X = X / float(n_vocab)
# One hot encode the output variable
y = np_utils.to_categorical(dataY)

Total Characters:  349982
Total Vocab:  66
Total Patterns:  349882


### Some useful functions

In [9]:
# Implement some helpful generating functions
# Includes a temperature parameter, which we'll play with later -- this changes how conservative the predictions are

def sample(preds, temperature = 1.0):
    # Helper function to sample an index from a probability array
    # Code from https://stackoverflow.com/questions/37246030/how-to-change-the-temperature-of-a-softmax-output-in-keras
    preds = np.asarray(preds).astype('float64')
    preds = np.array(preds)**(1/temperature)
    probas = np.random.multinomial(1, preds / preds.sum(), 1)
    return np.argmax(probas)

def generate_from_lstm(dataX, model, num_chars = 1000, temperature = None):
    # Pick a random seed
    start = np.random.randint(0, len(dataX)-1)
    pattern = dataX[start]
    print("Seed:")
    print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

    # Generate characters
    for i in range(num_chars):
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(max(max(dataX)) + 1)
        prediction = model.predict(x, verbose=0)
        if temperature == None:
            index = np.argmax(prediction)
        else:
            index = sample(prediction[0], temperature = temperature)
        result = int_to_char[index]
        seq_in = [int_to_char[value] for value in pattern]
        sys.stdout.write(result)
        pattern.append(index)
        pattern = pattern[1:len(pattern)]
    print("\nDone.")

### Fit a 1-layer LSTM

In [38]:
# Define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Define the checkpoint
filepath="char_keras_checkpoints/weights-improvement-{epoch:02d}-{loss:.4f}-stat110-chars.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [6]:
# Fit
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20

Epoch 00001: loss improved from inf to 2.84595, saving model to keras_checkpoints/weights-improvement-01-2.8459.hdf5
Epoch 2/20

Epoch 00002: loss improved from 2.84595 to 2.60772, saving model to keras_checkpoints/weights-improvement-02-2.6077.hdf5
Epoch 3/20

Epoch 00003: loss improved from 2.60772 to 2.39548, saving model to keras_checkpoints/weights-improvement-03-2.3955.hdf5
Epoch 4/20

Epoch 00004: loss improved from 2.39548 to 2.25133, saving model to keras_checkpoints/weights-improvement-04-2.2513.hdf5
Epoch 5/20

Epoch 00005: loss improved from 2.25133 to 2.14723, saving model to keras_checkpoints/weights-improvement-05-2.1472.hdf5
Epoch 6/20

Epoch 00006: loss improved from 2.14723 to 2.06885, saving model to keras_checkpoints/weights-improvement-06-2.0689.hdf5
Epoch 7/20

Epoch 00007: loss improved from 2.06885 to 2.00851, saving model to keras_checkpoints/weights-improvement-07-2.0085.hdf5
Epoch 8/20

Epoch 00008: loss improved from 2.00851 to 1.95739, saving mo

<keras.callbacks.History at 0x18268a2f4e0>

In [51]:
# Load the network weights
filename = "char_keras_checkpoints/weights-improvement-20-1.6601.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

generate_from_lstm(dataX, model, num_chars = 1000)

Seed:
"  do not take it. you're much better taking stat 104 or stat 139, where you will actually learn and r "
oe th the mrte taae the course an a sratt cnass and toeerstand the material in the course and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the material in the coass and toeerstand the m

### 2-layer LSTMs

In [53]:
# Try a bigger model

model2 = Sequential()
model2.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model2.add(Dropout(0.2))
model2.add(LSTM(256))
model2.add(Dropout(0.2))
model2.add(Dense(y.shape[1], activation='softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='adam')

filepath="char_keras_checkpoints/weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [25]:
# Fit
model2.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20

Epoch 00001: loss improved from inf to 2.69919, saving model to weights-improvement-01-2.6992-bigger.hdf5
Epoch 2/20

Epoch 00002: loss improved from 2.69919 to 2.16834, saving model to weights-improvement-02-2.1683-bigger.hdf5
Epoch 3/20

Epoch 00003: loss improved from 2.16834 to 1.91078, saving model to weights-improvement-03-1.9108-bigger.hdf5
Epoch 4/20

Epoch 00004: loss improved from 1.91078 to 1.77462, saving model to weights-improvement-04-1.7746-bigger.hdf5
Epoch 5/20

Epoch 00005: loss improved from 1.77462 to 1.68462, saving model to weights-improvement-05-1.6846-bigger.hdf5
Epoch 6/20

Epoch 00006: loss improved from 1.68462 to 1.62170, saving model to weights-improvement-06-1.6217-bigger.hdf5
Epoch 7/20

Epoch 00007: loss improved from 1.62170 to 1.57220, saving model to weights-improvement-07-1.5722-bigger.hdf5
Epoch 8/20

Epoch 00008: loss improved from 1.57220 to 1.53127, saving model to weights-improvement-08-1.5313-bigger.hdf5
Epoch 9/20

Epoch 00009: los

<keras.callbacks.History at 0x182044d0278>

In [54]:
# Load the network weights
filename = "char_keras_checkpoints/weights-improvement-20-1.3097-bigger.hdf5"
model2.load_weights(filename)
model2.compile(loss='categorical_crossentropy', optimizer='adam')

generate_from_lstm(dataX, model2, num_chars = 1000)

Seed:
" h had friday due dates. this made for a very challenging semester and some very late and stressful n "
athematical and the material is very difficult and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very interesting and the material is very int

In [59]:
# Temperature?

generate_from_lstm(dataX, model2, num_chars = 1000, temperature = 0.25)

Seed:
" tration requirement. it starts out easy but becomes surprisingly difficult, and the exams are harder "
 than you the course that i have taken at harvard. this class is very interesting and the material is a great course that is alazing. but it is a great class that is a great course that is alazing the tes are gard and the problem sets are very interesting and the material is very hilarious and i don't take it if you want to do well in the eod of the course that is all about the class and the material is a great class. but it is a great class. but it is very dasefrlly and the most of the course is very hard and the material is very difficult and interesting and worth it if you are a lot of time in the work in the class is teaching the course is in the end of the course with the material and tecching statistics and the material is a great course that is interesting and the tes are interesting and the material is a great class. but it is a great course that i vould recommend this

In [61]:
# Batch size?

filepath="char_keras_checkpoints/weights-improvement-{epoch:02d}-{loss:.4f}-bigger-batch512.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

model2.fit(X, y, epochs=20, batch_size=512, callbacks=callbacks_list)

Epoch 1/20

Epoch 00001: loss improved from inf to 1.26079, saving model to keras_checkpoints/weights-improvement-01-1.2608-bigger-batch512.hdf5
Epoch 2/20

Epoch 00002: loss improved from 1.26079 to 1.24609, saving model to keras_checkpoints/weights-improvement-02-1.2461-bigger-batch512.hdf5
Epoch 3/20

Epoch 00003: loss improved from 1.24609 to 1.23699, saving model to keras_checkpoints/weights-improvement-03-1.2370-bigger-batch512.hdf5
Epoch 4/20

Epoch 00004: loss improved from 1.23699 to 1.23137, saving model to keras_checkpoints/weights-improvement-04-1.2314-bigger-batch512.hdf5
Epoch 5/20

Epoch 00005: loss improved from 1.23137 to 1.22275, saving model to keras_checkpoints/weights-improvement-05-1.2228-bigger-batch512.hdf5
Epoch 6/20

Epoch 00006: loss improved from 1.22275 to 1.21994, saving model to keras_checkpoints/weights-improvement-06-1.2199-bigger-batch512.hdf5
Epoch 7/20

Epoch 00007: loss improved from 1.21994 to 1.21327, saving model to keras_checkpoints/weights-impr

<keras.callbacks.History at 0x20791c7bb38>

In [65]:
# Load the network weights
filename = "char_keras_checkpoints/weights-improvement-20-1.1602-bigger-batch512.hdf5"
model2.load_weights(filename)
model2.compile(loss='categorical_crossentropy', optimizer='adam')

generate_from_lstm(dataX, model2, num_chars = 500, temperature = 0.3)

Seed:
"  incredible amount of support (lots of sections, office hours, quora intuitive explanations, practic "
e problems and sections are very difficult and interesting and that she course is very difficult to start the world and the course is very hnteresting and terribly difficult and the class is very difficult and it is a great course that i vould recommend this class that i would recommend this class in the class in the concepts and make sure you are all the material in the class iard and teaching the course statt classes that i dad ciallenge your gand class that i dound the course seations are rea
Done.


In [6]:
# Bigger hidden layers? 2 layers, dim 512, and dropout 0.5
# http://karpathy.github.io/2015/05/21/rnn-effectiveness/

model3 = Sequential()
model3.add(LSTM(512, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model3.add(Dropout(0.5))
model3.add(LSTM(512))
model3.add(Dropout(0.5))
model3.add(Dense(y.shape[1], activation='softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='adam')

filepath = "char_keras_checkpoints/weights-improvement-{epoch:02d}-{loss:.4f}-bigger_dim512.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [7]:
# Fit (only 5 epochs for the sake of time)
model3.fit(X, y, epochs=5, batch_size=128, callbacks=callbacks_list)

Epoch 1/5

Epoch 00001: loss improved from inf to 2.72102, saving model to keras_checkpoints/weights-improvement-01-2.7210-bigger_dim512.hdf5
Epoch 2/5

Epoch 00002: loss improved from 2.72102 to 2.13929, saving model to keras_checkpoints/weights-improvement-02-2.1393-bigger_dim512.hdf5
Epoch 3/5

Epoch 00003: loss improved from 2.13929 to 1.84744, saving model to keras_checkpoints/weights-improvement-03-1.8474-bigger_dim512.hdf5
Epoch 4/5

Epoch 00004: loss improved from 1.84744 to 1.69425, saving model to keras_checkpoints/weights-improvement-04-1.6943-bigger_dim512.hdf5
Epoch 5/5

Epoch 00005: loss improved from 1.69425 to 1.59923, saving model to keras_checkpoints/weights-improvement-05-1.5992-bigger_dim512.hdf5


<keras.callbacks.History at 0x282b99c3518>

In [14]:
# We need more epochs
# filename = "char_keras_checkpoints/weights-improvement-05-1.5992-bigger_dim512.hdf5"
# model3.load_weights(filename)
# model3.compile(loss='categorical_crossentropy', optimizer='adam')

filepath = "char_keras_checkpoints/weights-improvement-{epoch:02d}-2-{loss:.4f}-bigger_dim512.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
model3.fit(X, y, epochs=15, batch_size=128, callbacks=callbacks_list)

Epoch 1/15

Epoch 00001: loss improved from inf to 1.52643, saving model to keras_checkpoints/weights-improvement-01-2-1.5264-bigger_dim512.hdf5
Epoch 2/15

Epoch 00002: loss improved from 1.52643 to 1.46950, saving model to keras_checkpoints/weights-improvement-02-2-1.4695-bigger_dim512.hdf5
Epoch 3/15

Epoch 00003: loss improved from 1.46950 to 1.42379, saving model to keras_checkpoints/weights-improvement-03-2-1.4238-bigger_dim512.hdf5
Epoch 4/15

Epoch 00004: loss improved from 1.42379 to 1.38595, saving model to keras_checkpoints/weights-improvement-04-2-1.3860-bigger_dim512.hdf5
Epoch 5/15

Epoch 00005: loss improved from 1.38595 to 1.35453, saving model to keras_checkpoints/weights-improvement-05-2-1.3545-bigger_dim512.hdf5
Epoch 6/15

Epoch 00006: loss improved from 1.35453 to 1.32524, saving model to keras_checkpoints/weights-improvement-06-2-1.3252-bigger_dim512.hdf5
Epoch 7/15

Epoch 00007: loss improved from 1.32524 to 1.29963, saving model to keras_checkpoints/weights-impr

<keras.callbacks.History at 0x2831f9b1b38>

In [17]:
# Load the network weights
filename = "char_keras_checkpoints/weights-improvement-15-2-1.1802-bigger_dim512.hdf5"
model3.load_weights(filename)
model3.compile(loss='categorical_crossentropy', optimizer='adam')

generate_from_lstm(dataX, model3, num_chars = 500, temperature = 0.3)

Seed:
" f thing, so if you have a busy schedule be prepared for an all-nighter almost every thursday. if you "
 are willing to do well. but it is nnt a large amount of time in the class to anyone whoh a group of probability is and the tes are a great class to take concepts and think in a teal sequirg of the her a sections and office hours. it is iard to detote to this class to the problem sets and exams. it's a great class, but it is a great class, it's a great class, but it is a great class. bnd the tes are a great class. the material is very hnteresting and the material is interesting and the tes are a
Done.


In [None]:
# Plot loss vs. epoch for the various models

In [None]:
# Beam search?
# https://github.com/karpathy/char-rnn/issues/138

In [None]:
# GRUs? Sequence length?
# https://stackoverflow.com/questions/47125723/keras-lstm-for-text-generation-keeps-repeating-a-line-or-a-sequence/48430652#48430652

In [4]:
# Ensembling?
# https://arxiv.org/pdf/1704.00109.pdf

# Other background / params
# https://cs.stanford.edu/~zxie/textgen.pdf
# https://arxiv.org/pdf/1707.05589.pdf