# Natural Language Processing in Keras, Part 2

Alright, let's look at some code! First, let's load in the libraries we will need.

# How does this relate to deep learning?

In [1]:
# LSTM and CNN for sequence classification in the IMDB dataset
import numpy as np
import keras
from keras.datasets import imdb, reuters
from keras.models import Sequential, Model
from keras.layers import Input, Dense, LSTM, Dropout, RepeatVector

from keras.layers.noise import GaussianNoise
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Convolution1D, MaxPooling1D

from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras_tqdm import TQDMNotebookCallback

import helper

Using TensorFlow backend.


#### Keras recently rolled out V2.0 with some breaking changes. I'll try to keep this tutorial as v1/v2 compatible as possible in case people are still running v1. 
If you see errors like `TypeError: Received unknown keyword arguments: {'epochs': 3}`, or size/shape mismatches, you probably have a version mismatch

In [2]:
print(keras.__version__)

2.0.2


In [3]:
# Fix random seed for reproducibility
np.random.seed(7)

# Preprocess the dataset into suitable shape to feed to NN

We will load in the data into memory, and then process it from the current form into a form that our model can accept. 

### Note on sequence length

The variable `sequence_len` here describes the length of the review data that we feed into the network. The longer this is, the better context the network can formulate about the review, and hopefully lead to better accuracy. However, this comes at a cost of longer training times. Initially, I was using len = 500. I cut it down to 50 so that this demo can run in less time, and amazingly the network functions nearly as well! That is the awesome power of deep learning. Feel free to experiment with any of these parameters!

### Basic parameters

In [4]:

nb_top_words           = 5000  # Keep the top n words, zero the rest. These will be our 'vocab' in our abridged corpus
sequence_len           = 50    # This is the length in the "time axis". Originally was 500, but it turns out 50 works *almost* as well, and way faster!
embed_vector_len       = 32    # Size of the feature vector each word will map to
nb_lstm                = 100   # Number of LSTM nodes
batch_size             = 64    # Number of samples to feed into the model for each forward/backward pass

DEFAULT_EPOCHS         = 1     # For testing the whole notebook quickly. 
RUN_EVERYTHING         = False # switch for letting the whole notebook execute (including training, which can take a while)

### Load the data
In case you missed the preproccessing part, or for some reason the data isn't working, you can get the [data for this step directly](https://www.dropbox.com/s/cu1kpojxzai5ru4/imdb_rotten_proc.zip?dl=0). 

In [5]:
data_path = '/media/mike/tera/data/nlp/techvalley/' # Point this to the path to where IMDB data is stored

(x_train, y_train), (x_test, y_test) = helper.unpackage_dataset(data_path + 'imdb.npz')

(x_rotten_train, y_rotten_train), (x_rotten_test, y_rotten_test) = helper.unpackage_dataset(data_path + 'rotten.npz')


The data is split into training and test sets so we can perform simple validation. A better approach would be k-fold cross-validation, but that is a topic for another talk. 
<script>/* unused section, ignore this!

# This is for running data from a slightly different data set. You can probably ignore this, but I'm keeping it
# in the tutorial in case I need to make last minute changes. 

# (x_train, y_train), (x_test, y_test) = imdb.load_data(nb_words=nb_top_words, skip_top=0, index_from=3)#nb_words=nb_top_words)
# twitter_x = np.load('data/twitter_phrases.npy') # pre-converted indexes from Twitter set using IMDB word list
# twitter_y = np.load('data/twitter_labels.npy')
# rotten_x = np.load('data/rotten_phrases.npy') # pre-converted indexes from Twitter set using IMDB word list
# rotten_y = np.load('data/rotten_labels.npy')


## Let's peer into the data. I always like to start by getting a feel for the data

We processed the data ourselves, but since I wrote this section before the preprocessing section, it's a little redundant. 

> This data is in the form of a sequence of integers.  Each `int` maps to a word in the corpus dictionary. It is a giant lookup table, with the index roughly corresponding to the frequency rank of the word. This is necessary because computers do not directly process words, only numbers. 

word_index = imdb.get_word_index()
index_word = {value: key for (key, value) in word_index.items()} # flip key:value pairs to get the integer as the key
index_word.update({0: " "})
list(index_word.items())[:10]

sample_num = 3
x_train[sample_num][:10]

' '.join([index_word[idx] for idx in x_train[sample_num]])

*/ </script>

<script> /* 
Unused stuff, ignore this!
# wat.
Ok, I totally did not expect that. It seems super nonsensical. I dug into it and I think the situation is that the Keras IMDB data has already been processed with bag-of-words. I left this in here because it's important to remember that **Data science is a science**. Unexpected things happen all the time. 
</script>

### Resize the arrays
#### Next, we need to pad the tensors out to the proper dimension in the time axis. 
Even though LSTM can handle variable length data, the backend still prefers rectangular tensors. We will pad/crop our variable-length movie reviews so that they are all exactly the same length. Once we have regularly-sized tensors, we are ready to build the model. 

In [6]:
x_train = sequence.pad_sequences(x_train, maxlen=sequence_len)
x_test = sequence.pad_sequences(x_test, maxlen=sequence_len)

In [7]:
# twitter_x = sequence.pad_sequences(twitter_x, maxlen=sequence_len)
x_rotten_test = sequence.pad_sequences(x_rotten_test, maxlen=sequence_len)
print(x_train.shape, x_test.shape,)# twitter_x.shape, rotten_x.shape)

(25000, 50) (25000, 50)


### Without further ado, let's build an RNN in 5 lines of code. I'll walk you through each layer in detail. 
___
# (Model 1) Basic LSTM

First, we initialize the model.

> ```python
model = Sequential()```

The [Keras Sequential Model](https://keras.io/getting-started/sequential-model-guide/) is based around building up the model layer-by-layer, like a cake. This is the easiest to graps for beginners, and works well, since many, if not most, neural networks can be represented this way. 
Calling model.add(layer) sticks the layer onto the topmost, and that becomes the new top. 



### Embedding

We start off with an efficient embedding layer which maps our vocabulary to a lower-dimensional space.

> ```python
model.add(Embedding(nb_top_words,                 
                    embed_vector_len, 
                    input_length=sequence_len
                   ))
```

#### One-hot encoding

Currently, we have every word mapped to some integer, which is great because the computer can parse it, but another problem arises: These numbers are sort of arbitrary, and not much like the "thought vector" idea from before. Take the numbers 55 (time) and 56 (she): they have absolutely no conection, even though they are numerically close. Since neural networks are essentially giant (non)linear equations, this is not ideal. Larger numbers would get more 'weight' than smaller ones. What we need to do is map it in to some sort of categorical representation, where each word is on 'equal footing'. 

We achieve this by using a [*one hot* representation](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science). This is where you have an Nx1 vector, where the dimension associated with the integer has value 1 ('hot') and all other are 0. N is the size of your vocabulary. Each word would look like this:

[0,0,0,0,0,0,0,...,0,**1**,0,0,...,0,0,0]
<br>        ^ this is the $kth$ dim, where $k$ corresponds to the index in the table


Here is a toy example:

```
here    = [0,0,0,0,1,0]
is      = [0,1,0,0,0,0]
a       = [1,0,0,0,0,0]
toy     = [0,0,0,1,0,0]
example = [0,0,0,0,0,1]
```

Normally, you would have way more dimensions (typically thousands). Fortunately, Keras takes care of this automatically. We give it a sequence of M integers, and it'll automatically convert it to a NxM matrix.

Currently, the "shape" of each data vector is Nx1, where N = 5000. That's really problematic, due to the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) and also computational constraints. We want to transform our **uncondensed** and **sparse** data into a **compact** and **dense** representation, in our case a 32-vector. This process is known as embedding and is what gives us our **dense** thought vector. **Word2Vec** is a popular word embedding system. Keras Embedding layer works very similarly, so if you want to know more, I suggest checking out how Word2Vec works. 

Keras will do this automatically for us with the Embedding layer. After embedding, we will have a much more manageable 32x50 matrix. Each "thought" has dimensionality 32, and a sequence consists of 50 of these vectors in order. 

![Socher-Bilingual-tsne](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/img/Socher-BillingualTSNE.png)
<center> t-SNE visualization of the bilingual word embedding. Green is Chinese, Yellow is English. (Socher et al. (2013a))</center>

### LSTM

> ```python
model.add(LSTM(nb_lstm))```


The LSTM (long short-term memory) cell is a neuron with memory. This allows the neuron to remember things over a span of time (such as from the start of a review to the end). It accomplishes this by have a memory state, which can be written to and read from. It's like a tiny ~~casette tape~~  [~~floppy disk~~](http://i.imgur.com/Osxo1UF.jpg)  USB flash drive. In short, the inputs from the prior layer (mathematically) control gates. These gates determine whether to erase, write, and/or read from the memory cell. The actual workings of these units are quite involved, so I won't go into much detail. LSTMs have been covered really well in depth in a lot of places. In particular I recommend the articles by [Chris Olah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and [Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). 

![lstm](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)

![LSTM](http://tc.sinaimg.cn/maxwidth.800/tc.service.weibo.com/cdn_images_1_medium_com/58ad765e09eacb5116c9dfc5897c7296.png)

### Dense

> ```python
model.add(Dense(1, activation='sigmoid'))```


Finally, we have a densely-connected layer. This is your "typical" neural network layer - each node from the prior layer connects to each node of the following. In this case, we are crunching down to a single node since we want a single answer - "Positive" or "Negative". We'll use a sigmoid activation to squash the output to the range 0-1. 

That's it for our network! Pretty simple, right? 

> ["Keras is so good that it is effectively cheating in machine learning"](https://news.ycombinator.com/item?id=13872670)

### Compiling 

> ```python
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])```

All that's left is compiling the model. This tells the model which loss function, optimizer, and metrics to use. 

- **Optimizer**: This is the algorithm which describes how we approach gradient descent. Adam is pretty modern and works quite well for a lot of problems, so it is typically the first go-to when picking hyperparameters
- **Metrics**: This does not actually affect the direct training of the model. Rather, it gives us humans a way to track the performance of the model over time. This can also be used for automatically early-stopping to avoid overfitting. 
- **Loss function**: This determines how the "penalty" for incorrect predictions is calculated. 

### Brief tangent: Cross-entropy
I'm working on a [simple summary of Cross-Entropy](https://github.com/xkortex/TechValleyMachineLearning/blob/master/CrossEntropy.ipynb). If you would like to know more, then check out that post, otherwise it's a bit of a tangent for this particular project. 

For now, all we need to know is cross-entropy is a very common loss function, and many Keras models use binary (yes/no problems) or categorical (multiple labels) cross-entropy. You are probably familiar with Mean Squared Error, which is a commonly used loss function if you are performing continuous regression. Cross-entropy is used to predict labels (logistic regression). The legendary [Andrew Ng Coursera Course](https://www.coursera.org/learn/machine-learning) covers this in more detail. 
___
# Let's write the code for the model


In [8]:
# Really basic LSTM model. 
model = Sequential()
model.add(Embedding(nb_top_words,                 
                    embed_vector_len, 
                    input_length=sequence_len
                   ))
model.add(LSTM(nb_lstm))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [9]:
# Have a look at our model
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 32)            160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301.0
Trainable params: 213,301
Non-trainable params: 0.0
_________________________________________________________________
None


#### Note the shape of each layer
Each layer will have a shape (N, a, b, ...). 

- The "None" for the first dimension indicates that the model will be fit to whatever our batch size is. 
- The embedding layer is (None, 50, 30), indicating it has a variable batch size, a time sequence length of 50, and a vector size of 32. 
- The LSTM is (None, 100). Since it is taking in a sequence, our sequence length parameter disappears, and we are left with the number of nodes (100)
- The dense layer smashes the output of 100 LSTMs down to 1 node


# Fit the model to the data
This is where the magic happens! The Tensor Gnomes will do their ritual dance, and the weights will manifest new values.

...What do you mean, that's not how it works?

### Training

In [10]:
# This will take about 20-60 seconds per epoch on a GPU. CPU will take anywhere from 30s (i7 w/ 8 threads) to several minutes per epoch. 
# Larger networks will see a bigger advantage to GPU, though CPU did not do as poorly as I expected!
# We'll be making improvements to the model, so you don't have to run this just yet.
RUN_MODEL1 = False # I'm just using a switch here so I can run Kernel -> Restart and Run All
if RUN_MODEL1 or RUN_EVERYTHING: 
    model.fit(x_train, y_train, 
              validation_data=(x_test, y_test), 
              epochs=DEFAULT_EPOCHS, 
              batch_size=batch_size, 
              verbose=0, # Some versions of Jupyter bork on Keras' progress bar. We replace it with Keras-TQDM instead
              callbacks=[TQDMNotebookCallback()])
    loss, metric = model.evaluate(x_test, y_test, verbose=0)
    print('\nIMDB:\nLabel ratio: {:.2f}%'.format(np.mean(y_test*100)))
    print('Loss: {}\nAccuracy: {:.2f}%'.format(loss, metric*100))

This network will get us to about 87% accuracy. However, we had to stop pretty early because of the risk of overfitting. A model that overfits easy is often a strong sign that the model will generalize poorly to new, unseen data, or data from different distributions. Let's see how it performs on a similar dataset, the ~~Twitter~~ Rotten Tomatoes dataset. This will be challenging, as the reviews are much shorter, which gives the RNN less time to 'get up to speed'.  

In [11]:
if RUN_MODEL1 or RUN_EVERYTHING:
    loss, metric = model.evaluate(x_rotten_test, y_rotten_test)
    print('\nRotten dataset:\nLabel ratio: {:.2f}%'.format(np.mean(y_rotten_test*100)))
    print('Loss: {}\nAccuracy: {:.2f}%'.format(loss, metric*100))

# Woohoo!
### It's not quite as good as the IMDB, but considering it's from a totally different corpus, this is quite good.
Let's see if we can do a bit better with the generalization. Since the dataset we want to extrapolate to (Twitter, Rotten Tomatoes, etc) is different in several ways, we want our IMDB-trained network to be robust to noise, idiosyncracies, and quirks unique to IMDB that do not generalize to other formats.  

___
# ( Model 2) Adding dropout and noise

**Dropout** is a technique for reducing overfitting in neural networks by preventing over-adaptation to the training data. The general idea is, every cycle you randomly drop a certain percentage of nodes or connections. This forces the network to compensate by distributing over multiple nodes and prevents any given node from getting too "specialized". 

Adding noise is another way of reducing overfitting and improving generalization. By adding noise to the vectors, this forces the network to learn to compensate, just as humans process stimuli in a noisy environment.

![dropout](http://cdn-ak.f.st-hatena.com/images/fotolife/o/olanleed/20131130/20131130221427.png)

Dropout, Noise, Batch Normalization, and other similar regularization techniques are largely empirical. That's a fancy way of saying, "We have no mathematical underpinning as to why these work well". Data scientists just sort of messed around, or used large hyperparameter searches to find configurations that work well. If you hear to someone refer to deep learning as "voodoo science" or "dark arts", this is what they are talking about.

![no idea](https://img.memesuper.com/5da3ce3ad2ddff8c4b752e089ede7d8d_download-pin-have-no-idea-dog-meme-i-have-no-idea-what-im-doing_455-290.jpeg)

In [24]:

# Simple LSTM with noise and dropout
dropout_rate = 0.2 # Rate of input units to drop
sigma=0.5         # Amount of noise to add (in terms of standard deviation)

model=Sequential()
model.add(Embedding(nb_top_words,                 
                    embed_vector_len, 
                    input_length=sequence_len
                   ))
model.add(GaussianNoise(stddev=sigma))
model.add(Dropout(dropout_rate))
model.add(LSTM(nb_lstm))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 50, 32)            160000    
_________________________________________________________________
gaussian_noise_3 (GaussianNo (None, 50, 32)            0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 50, 32)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dropout_8 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 101       
Total params: 213,301.0
Trainable params: 213,301.0
Non-trainable params: 0.0
________________________________________________________________

#### Note the shape of each layer

- The gaussian noise and dropout layers are (None, 50, 30), since they are merely applying to the output of embedding (like a filter)
- The LSTM is (None, 100), like before


### Training

In [13]:
# This will take about 1-6 minutes per epoch on a GPU. 
RUN_MODEL2 = False
if RUN_MODEL2 or RUN_EVERYTHING:
    model.fit(x_train, y_train, 
              validation_data=(x_test, y_test), 
              nb_epoch=DEFAULT_EPOCHS, 
              batch_size=batch_size, 
              verbose=0, # Some versions of Jupyter bork on Keras' progress bar. We replace it with Keras-TQDM instead
              callbacks=[TQDMNotebookCallback()])
    loss, metric = model.evaluate(x_test, y_test, verbose=0)
    print('IMDB:\nLabel ratio: {:.2f}%'.format(np.mean(y_test*100)))
    print('Loss: {}\nAccuracy: {:.2f}%'.format(loss, metric*100))

In [14]:
if RUN_MODEL2 or RUN_EVERYTHING:
    loss, metric = model.evaluate(x_rotten_test, y_rotten_test)
    print('\nRotten Dataset:\nLabel ratio: {:.2f}%'.format(np.mean(y_rotten_test*100)))
    print('Loss: {}\nAccuracy: {:.2f}%'.format(loss, metric*100))

### Accuracy dropped a little bit to about 64%, but we are running the same number of epochs. 
Noise layers trade training speed for generalizability. If we run this for a total of 6 epochs, we get up to about 74%

___
# (Model 3) LSTM + CNN = REAL ULTIMATE POWER!

![POWAH!](https://memecrunch.com/meme/9IO88/phenomenal-cosmic-power/image.jpg?w=650&c=1)

#### LSTMs are awesome. Convolutional neural networks are awesome. I was blown away when I learned you can simply and easily combine both into the same model!

### But why does this work?

Convolutional layers look at local structures in the data. Think of each node in a CNN as a filter which looks for a specific "shape" in the data. In image classifiers, this is visual features. Low-level features are things like edges and gradients. Higher level features correspond to more complex shapes.

![cnn](https://upload.wikimedia.org/wikipedia/commons/6/63/Typical_cnn.png)

In NLP, this looks at groups of words, or N-grams (sequence of N words). For instance, many English sentences contain N-grams of the form [**subject verb object**]. For example (dropping articles for simplicity):

- John rode bike
- Suzie hit ball
- Bobby made memes

Another common sequence is [**subject copula predicate**] (where copula verbs are 'linking' verbs: `is`, `are`, `was`, `will be`): 

- Roses are red
- Movie was bad
- Keras is awesome
- CUDA is rad

In [26]:
nb_filter     = 256     # This is the number of convolutional filters to use
filter_length = 5  # This is the size of the filter kernel. Since this is 1D, the kernel is Nx1
pool_length   = 2    # Size of our max pooling structures
dropout_rate  = 0.2 # Rate of input units to drop
stddev        = 0.5         # Amount of noise to add (in terms of standard deviation)

model = Sequential()
model.add(Embedding(nb_top_words, embed_vector_len, input_length=sequence_len))
model.add(GaussianNoise(stddev=sigma))
model.add(Convolution1D(filters=nb_filter, kernel_size=filter_length, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=pool_length))
model.add(Dropout(dropout_rate))
model.add(LSTM(nb_lstm))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [27]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 50, 32)            160000    
_________________________________________________________________
gaussian_noise_4 (GaussianNo (None, 50, 32)            0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 50, 256)           41216     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 25, 256)           0         
_________________________________________________________________
dropout_9 (Dropout)          (None, 25, 256)           0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               142800    
_________________________________________________________________
dropout_10 (Dropout)         (None, 100)               0         
__________

#### Note the shape of each layer
Each layer will have a shape (N, a, b, ...). 

- The convolutional layer is (None, 50, 256). It has sequence length 50 and 256 filters. It basically takes our input thought vector sequence, and creates a sequence of N-gram thought vectors
- The max pooling layer is (None, 25, 256). It takes two adjacent points in the sequence, and only retains the maximum. This halves the number of points in the sequence.
- Dropout, LSTM and dense layers are like before. 

### Training

In [17]:
# This will take about 0.5-2 minutes per epoch on a GPU. 
RUN_MODEL3 = False
if RUN_MODEL3 or RUN_EVERYTHING:
    model.fit(x_train, y_train, 
              validation_data=(x_test, y_test), 
              nb_epoch=DEFAULT_EPOCHS, 
              batch_size=batch_size, 
              verbose=0, # Some versions of Jupyter bork on Keras' progress bar. We replace it with Keras-TQDM instead
              callbacks=[TQDMNotebookCallback()])
    
    loss, metric = model.evaluate(x_test, y_test, verbose=0)
    print('IMDB:\nLabel ratio: {:.2f}%'.format(np.mean(y_test*100)))
    print('Loss: {}\nAccuracy: {:.2f}%'.format(loss, metric*100))

In [18]:
if RUN_MODEL3 or RUN_EVERYTHING:
#     loss, metric = model.evaluate(twitter_x, twitter_y)
#     print('Label ratio: {:.2f}%'.format(np.mean(twitter_y*100)))
#     print('Loss: {}\nMetric: {:.2f}%'.format(loss, metric*100))
    loss, metric = model.evaluate(x_rotten_test, y_rotten_test)
    print('\nRotten Data:\nLabel ratio: {:.2f}%'.format(np.mean(y_rotten_test*100)))
    print('Loss: {}\nMetric: {:.2f}%'.format(loss, metric*100))

# Excellent! After 3 epochs, we get about 73% accuracy on the Rotten Tomatoes database, when trained on the IMDB database!

# References

## Datasets

http://www.cs.cornell.edu/people/pabo/movie-review-data/
http://snap.stanford.edu/data/web-Amazon.html

## Papers
- [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
- [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432)
- [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078)

## Code

- [Keras Examples: IMDB CNN](https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py)
- [Machine Learning Mastery: Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras](http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/)
- [Microway: Building a Movie Review Sentiment Classifier using Keras and Theano Deep Learning Frameworks](https://www.microway.com/hpc-tech-tips/keras-theano-deep-learning-frameworks/)
- [Using  Pre-trained Word Embeddings in Keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)

## Tangentally related

- [McCormick: Word2vec Tutorial](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)

___
# Bonus feature: Pure CNN Model

I was not sure how much ground I would be able to cover, so this section is here in case people are having a great time, and are really thirsty for knowledge! This is a pure-CNN based model, and works by essentially dumping all the N-grams together (global pooling). It does not care about the respective order of N-grams. And yet, it works quite well! It is a bit slower to train than the LSTM with seq_len=50, however.

In [19]:
'''This example demonstrates the use of Convolution1D for text classification.
Gets to 0.89 test accuracy after 2 epochs.
90s/epoch on Intel i5 2.4Ghz CPU.
10s/epoch on Tesla K40 GPU.
'''

from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, BatchNormalization
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb

# set parameters:
max_features = 1000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Loading data...




25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 400)
x_test shape: (25000, 400)


In [20]:
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
for i in range(1):
    model.add(BatchNormalization())
    model.add(Conv1D(filters,
                     kernel_size,
                     padding='valid',
                     activation='relu'))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# for i in range(1):
#     model.add(BatchNormalization())
#     model.add(Conv1D(filters*2,
#                      kernel_size,
#                      padding='valid',
#                      activation='relu'))

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [21]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 400, 50)           50000     
_________________________________________________________________
dropout_5 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 400, 50)           200       
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_6 (Dropout)          (None, 250)               0         
__________

In [22]:
RUN_FINAL = False
if RUN_FINAL:
    model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_test, y_test), verbose=0, callbacks=[TQDMNotebookCallback()])