<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Deep-Learning-for-NLP" data-toc-modified-id="Deep-Learning-for-NLP-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Deep Learning for NLP</a></span></li><li><span><a href="#Embeddings-Example:-Skipgrams!" data-toc-modified-id="Embeddings-Example:-Skipgrams!-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Embeddings Example: Skipgrams!</a></span><ul class="toc-item"><li><span><a href="#Loading-in-Data" data-toc-modified-id="Loading-in-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Loading in Data</a></span></li><li><span><a href="#Keras-Tools-for-Preprocessing-Text-Data" data-toc-modified-id="Keras-Tools-for-Preprocessing-Text-Data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Keras Tools for Preprocessing Text Data</a></span></li><li><span><a href="#Generating-Input-and-Output-Labels" data-toc-modified-id="Generating-Input-and-Output-Labels-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Generating Input and Output Labels</a></span></li><li><span><a href="#Creating-the-Model-Architecture" data-toc-modified-id="Creating-the-Model-Architecture-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Creating the Model Architecture</a></span></li><li><span><a href="#Compiling-and-Training-the-Model" data-toc-modified-id="Compiling-and-Training-the-Model-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Compiling and Training the Model</a></span></li><li><span><a href="#Saving-the-Word-Vectors" data-toc-modified-id="Saving-the-Word-Vectors-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Saving the Word Vectors</a></span></li><li><span><a href="#Examining-the-Vectors" data-toc-modified-id="Examining-the-Vectors-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Examining the Vectors</a></span></li></ul></li><li><span><a href="#RNN-Example:-Text-Classification" data-toc-modified-id="RNN-Example:-Text-Classification-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>RNN Example: Text Classification</a></span></li><li><span><a href="#LSTM-Example:-Sentiment-Analysis" data-toc-modified-id="LSTM-Example:-Sentiment-Analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>LSTM Example: Sentiment Analysis</a></span></li><li><span><a href="#CNN-Example:-Sentiment-Analysis" data-toc-modified-id="CNN-Example:-Sentiment-Analysis-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>CNN Example: Sentiment Analysis</a></span></li><li><span><a href="#BRNN-Example:-Sentiment-Analysis" data-toc-modified-id="BRNN-Example:-Sentiment-Analysis-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>BRNN Example: Sentiment Analysis</a></span></li><li><span><a href="#LSTM-Example:-Text-Generation" data-toc-modified-id="LSTM-Example:-Text-Generation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>LSTM Example: Text Generation</a></span></li><li><span><a href="#Save-Trained-Models" data-toc-modified-id="Save-Trained-Models-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Save Trained Models</a></span></li></ul></div>

In [None]:
# Imports
# Basics
import pandas as pd 
import numpy as np
import random
from matplotlib import pyplot as plt
%matplotlib inline

# gensim
import gensim

# keras
np.random.seed(13)
from keras.models import Sequential
from keras.layers import (Dense, Embedding, Reshape, Activation, 
                          SimpleRNN, LSTM, Convolution1D, 
                          MaxPooling1D, Dropout, Bidirectional)
from keras.utils import np_utils
from keras.utils.data_utils import get_file
from keras.preprocessing.text import Tokenizer
from keras.datasets import imdb, reuters
from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop


# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Deep Learning for NLP

We took a quick look at an NLP task in the previous notebook's Dropout example. In this notebook we'll look at some more complete NLP pipelines and learn how to work with several new types of keras Layers.

We'll start with a detailed look at building a network with an Embeddings model, and then we'll run through several examples using more advanced layer types: RNNs, LSTMs, CNNs, and BRNNs.

# Embeddings Example: Skipgrams!

Let's use keras to train a word embeddings model. Along the way, we'll use some special keras tools to create a complete NLP pipeline.

## Loading in Data

We'll work with text data from H.P. Lovecraft's story, [The Nameless City](https://raw.githubusercontent.com/urschrei/lovecraft/master/lovecraft.txt). Each item in our corpus will be a paragraph from the story. Note that there are line breaks in the text that will lead to empty elements, and poems in which each line gets its own item in the corpus.

In [None]:
# Load in Corpus using Keras utility
# We'll use some Lovecraft
!curl -o lovecraft.txt https://raw.githubusercontent.com/urschrei/lovecraft/master/lovecraft.txt

corpus = open("lovecraft.txt").readlines()[0:200]

corpus[6]

## Keras Tools for Preprocessing Text Data

`Keras` has some nice text preprocessing functions too!

In [None]:
# For simplicity, one "sentence" per line 
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

# Tokenize using Keras
tokenizer = Tokenizer(filters="""!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',""")
tokenizer.fit_on_texts(corpus)

# Convert tokenized sentences to sequence format
sequences = tokenizer.texts_to_sequences(corpus)
nb_samples = sum(len(s) for s in corpus)

print(corpus[3])
print(sequences[3])

## Generating Input and Output Labels
Now we need to generate our `X_train` and `y_train` so we can train the model.

In [None]:
# Setting parameters for our model:

# Vocab size
V = len(tokenizer.word_index) + 1

# Dimension to reduce to (length of word embedding vectors)
dim = 100
window_size = 2

print("vocabulary size: ", V)

In [None]:
# Function to generate the inputs and outputs for all windows
def generate_data(sequences, window_size, V):
    maxlen = window_size*2
    # For each line (sentence)
    for words in sequences:
        L = len(words)
        # Choose the target word
        for index, word in enumerate(words):
            # Create the window
            s = index-window_size
            e = index+window_size+1
                    
            in_words = []
            labels = []
            # Create the input/outputs for skipgrams
            for i in range(s, e):
                if i != index and 0 <= i < L:
                    in_words.append([word] )
                    labels.append(words[i])

            x = np.array(in_words,dtype=np.int32)
            y = np_utils.to_categorical(labels, V)
            yield (x, y)
            
# We'll call this later on, from within our train_skipgram_model function.

## Creating the Model Architecture
- Lastly, we create the (shallow) network!

In [None]:
# Create the Keras model and view it 
skipgram = Sequential()
skipgram.add(Embedding(input_dim=V, input_length=1, embeddings_initializer="glorot_uniform", output_dim=dim))
skipgram.add(Reshape((dim, )))
skipgram.add(Dense(input_dim=dim, units=V, activation='softmax'))

skipgram.summary()

## Compiling and Training the Model
- Time to compile and train
- We use crossentropy, common loss for classification

In [None]:
def train_skipgram_model(skipgram, sequences, window_size, dimension_size=V):
    """
    skipgram: Keras model to train
    sequences: list of lists of integers. 
               sequences[i][j] is the encoding of word j in document i
    window_size: number of words in the window
    dimension_size: integer. Size of the vector space for the word vectors.
    
    Note: this is slow to train. Took an 1hr 40min (no GPU) on 2016 Macbook Pro.
    """
    # Note this cell took 1hr 40min on my machine (no GPU)
    # Compile the Keras Model
    skipgram.compile(loss='categorical_crossentropy', optimizer="adadelta")

    # Fit the Skipgrams
    for iteration in range(10):
        loss = 0.
        for sequence, label in generate_data(sequences, window_size, dimension_size):
            loss += skipgram.train_on_batch(sequence, label)

        print(iteration, loss)

## Saving the Word Vectors
- Let's save the vectors to a file
- So we can load them into word2vec and test them out!

In [None]:
def write_model_weights(skipgram, tokenizer, filename='vectors.txt'):
    # Write the resulting vectors to a text file
    with open(filename ,'w') as f:
        f.write(f"{V-1} {dim}\n")
        vectors = skipgram.get_weights()[0]
        for word, i in tokenizer.word_index.items():
            line = f"{word} " + " ".join([str(num) for num in vectors[i,:]]) + "\n"
            f.write(line)

In [None]:
import os

if not os.path.exists('vectors.txt'):
    train_skipgram_model(skipgram, sequences, window_size, V)
    write_model_weights(skipgram, token)

## Examining the Vectors
- Let's load the vectors in to query them with `gensim`!

In [None]:
# Load the vectors into word2vec and see how we did!
w2v = gensim.models.KeyedVectors.load_word2vec_format('./vectors.txt', binary=False)
w2v.most_similar(positive=['white', 'rabbit'])

# RNN Example: Text Classification
- Let's try a Recurrent Neural Net for the same Reuters classification task from the Intro to Keras notebook:

In [None]:
# Same data loading as before
max_features = 2000
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=max_features)
maxlen = 10
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
nb_epoch = 20

In [None]:
# Create the model
rnn_model = Sequential()
# The Embedding layer allows us to map words into dense vectors as inputs, common first layer
rnn_model.add(Embedding(input_dim=max_features, output_dim=100, embeddings_initializer='glorot_uniform', input_length=maxlen))
# This is the most basic kind of RNN!  We're using 20 units, 
#which somewhat reflects our "memory" of past events in a sequence
# For the purposes of keras, it's just another type of "unit" you can try!
rnn_model.add(SimpleRNN(20, return_sequences=False))
rnn_model.add(Dense(46))
rnn_model.add(Activation('softmax'))

rnn_model.compile(loss='sparse_categorical_crossentropy',
                  optimizer='adadelta',
                  metrics=['accuracy'])

rnn_model.summary()

In [None]:
# Train the model
rnn_model.fit(X_train, y_train, batch_size=256, epochs=nb_epoch, 
              validation_data=(X_test, y_test))

**WHOA!** Over 100% improvement on ANN!

# LSTM Example: Sentiment Analysis

Here is some code to train sentiment analysis on IMDB reviews.

*(Note that this network takes about 20 minutes to train over 15 epochs.)*

In [None]:
# Set parameters
max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

# Load data (Keras utility)
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

# Pad Short sentences
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

# Build our model!
print('Build model...')
lstm_model = Sequential()
lstm_model.add(Embedding(max_features, 128, dropout=0.2))
# Here's the LSTM magic!
lstm_model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun
lstm_model.add(Dense(1))
# Sigmoid for binary classification
lstm_model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
lstm_model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
lstm_model.fit(X_train, y_train, batch_size=batch_size, epochs=15,
               validation_data=(X_test, y_test))
score, acc = lstm_model.evaluate(X_test, y_test,
                                 batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

# CNN Example: Sentiment Analysis
- Here's the same Sentiment Analysis task with a CNN + LSTM:

In [None]:
# Embedding
max_features = 20000
maxlen = 100
embedding_size = 128

# Convolution
filter_length = 5
nb_filter = 64
pool_length = 4

# LSTM
lstm_output_size = 70

# Training
batch_size = 30
nb_epoch = 2

'''
Note:
batch_size is highly sensitive.
Only 2 epochs are needed as the dataset is very small.
'''

# Load data
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

# Pad sentences
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')

cnn_model = Sequential()
cnn_model.add(Embedding(max_features, embedding_size, input_length=maxlen))
cnn_model.add(Dropout(0.25))

# Convolution!
cnn_model.add(Convolution1D(nb_filter=nb_filter,
                            filter_length=filter_length,
                            border_mode='valid',
                            activation='relu',
                            subsample_length=1))
cnn_model.add(MaxPooling1D(pool_length=pool_length))

# LSTM!
cnn_model.add(LSTM(lstm_output_size))
cnn_model.add(Dense(1))
cnn_model.add(Activation('sigmoid'))

cnn_model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
cnn_model.fit(X_train, y_train, batch_size=batch_size, epochs=nb_epoch,
          validation_data=(X_test, y_test))
score, acc = cnn_model.evaluate(X_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

# BRNN Example: Sentiment Analysis
- One more time on the sentiment, now with a Bidirectional Recurrent Neural Net!

*(This takes about 2 minutes per epoch to train)*

In [None]:
max_features = 20000
maxlen = 100  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print("Pad sequences (samples x time)")
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
y_train = np.array(y_train)
y_test = np.array(y_test)

brnn_model = Sequential()
brnn_model.add(Embedding(max_features, 128, input_length=maxlen))

# Bidirectional LSTM!!!
brnn_model.add(Bidirectional(LSTM(64)))
brnn_model.add(Dropout(0.5))
brnn_model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
brnn_model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

print('Train...')
brnn_model.fit(X_train, y_train,
               batch_size=batch_size,
               epochs=4,
               validation_data=[X_test, y_test])

score, acc = brnn_model.evaluate(X_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

In [None]:
cnn_model.summary()

In [None]:
brnn_model.summary()

# LSTM Example: Text Generation
- Finally, a different example!
- We're going to do **Text Generation** with **LSTM**
- We'll watch our model start spitting out words of Nils's favorite philosopher in real time!
  - ps, it's Friedrich Nietzsche
- As we go through epoch's, the ability to generate Nietzcheian sentences will get better and better!
- Let's go!

In [None]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
textgen_model = Sequential()
textgen_model.add(LSTM(128, input_shape=(maxlen, len(chars))))
textgen_model.add(Dense(len(chars)))
textgen_model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
textgen_model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# train the model, output generated text after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    textgen_model.fit(X, y, batch_size=128, epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = textgen_model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

# Save Trained Models

In [None]:
import pickle

with open('models/skipgram.pkl', 'wb') as pkl:
    pickle.dump(skipgram, pkl)
with open('models/rnn_model.pkl', 'wb') as pkl:
    pickle.dump(rnn_model, pkl)
with open('models/lstm_model.pkl', 'wb') as pkl:
    pickle.dump(lstm_model, pkl)
with open('models/cnn_model.pkl', 'wb') as pkl:
    pickle.dump(cnn_model, pkl)
with open('models/brnn_model.pkl', 'wb') as pkl:
    pickle.dump(brnn_model, pkl)
with open('models/textgen_model.pkl', 'wb') as pkl:
    pickle.dump(brnn_model, pkl)
