### Class Lab:  NLP

Welcome to tonight's lab!  Tonight we're going to build a neural network to analyze text data.  We'll be using the IMDB dataset to train our model on movie reviews to predict whether or not they convery a positive or negative sentiment.  

During the lab we'll use Keras to build a 3 layer neural network with word embeddings and densely connected outer layers.

#### Step 1:  Read in the IMDB dataset

In [9]:
# your answer here
import pandas as pd
df = pd.read_csv('IMDB.csv')

#### Step 2: Process Your Data

Take the following steps:

 - For the target variable, encode `positive` and `negative` to `1` and `0`
 - Create a training and a test set.  Since there's no order to this dataset, randomly shuffling is fine.  

In [10]:
# your answer here
df.sentiment.replace(['positive', 'negative'], [1, 0], inplace=True)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=0)

#### Step 3:  Tokenize Your Word Documents


**3a:** Import the necessary portions of the keras library:

To do this, you'll need the following parts of Keras:

 - `keras.preprocessing.text.Tokenizer`
 - `keras.preprocessing.sequences.pad_sequences`

In [5]:
## your answer here
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

**3b:** Use the `Tokenizer` to process your text data.

Use the following methods to appropriately process your training and test data:

 - `fit_on_texts`
 - `texts_to_sequences`
 
**Note:** Use a maximum vocabulary size of 10000 words when you initialize the Tokenizer.

In [16]:
# your answer here
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test  = tokenizer.texts_to_sequences(X_test)

**3c:** Equalize the length of each review

You have some discretion on this step, and you might want to play around with different variations of this if you have additional time, but for now set each document to 150 characters long, using the `pad_sequences` method in Keras.

In [20]:
# your answer here
X_train = pad_sequences(X_train, maxlen=150)
X_test  = pad_sequences(X_test, maxlen=150)

**3d:** Double check your data

At this point, it's probably a good idea to make sure you understand what you just did, and how your data is setup.  

Try and do the following, and make sure you can connect the dots:

 - Check the `word_index` of your tokenizer
 - Check the data type of your new training and test sets -- what are they?
 - What does each document consist of?  What about documents that are less than 150 words?

In [21]:
# your answer here -- the word index tells you what index position each word is at
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'with': 15,
 'for': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'about': 41,
 'out': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'some': 46,
 'there': 47,
 'what': 48,
 'good': 49,
 'more': 50,
 'when': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'even': 55,
 'time': 56,
 'my': 57,
 'she': 58,
 'would': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'can': 66,
 'had': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'been': 74,
 'get': 75,
 'bad': 76,
 'will': 77,
 'also': 78,
 'great': 79,
 'other': 80,
 'do': 81,
 'into': 82,
 'h

#### Step 4:  Initialize Your Keras Model

**What you'll need:** The `Sequential` method from Keras.  This is how you connect different neural network layers together

**How it will be setup:** Make it have the following layers:

 - A word embedding -- make sure the dimensions are as follows:
  - `num_words`, `num_weights`, `document_length`
 - A Dense layer with your choice of neurons and activation function
 - A Dense layer with **1** neuron and your choice of activation function

In [25]:
# your answer here
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding

mod = Sequential([
    Embedding(10000, 25, input_length=150),
    Flatten(),
    Dense(12, activation='relu'),
    Dense(1, activation='sigmoid')
])

#### Step 5: Compile Your Neural Network

Unlike scikit-learn, you have to specify a few additional parameters to fit your neural network.  

They are as follows:

 - `optimizer`: this is the technique you use to update your weights.  The standard method is **Stochastic Gradient Descent**, which can be entered as `sgd`.  The more modern method is **ADAM**, which can be entered as `adam`.  Take your pick of which one to choose.
 - `loss`: this is the loss function you use to **train** your weights.  Since we are doing binary classification then the correct one to use is **binary cross_entropy**
 - `metrics`: this is how you **score** your model.  This is optional.  But accuracy is always a solid choice here.  This can be entered as `acc`, passed in through a list.

In [26]:
# your answer here
mod.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

#### Step 6:  Fit Your Model

Now you can go ahead and call fit.  A few arguments to keep in mind:  

 - `validation_split`: how much of your training data to use for test data.  This takes a decimal less than 1 as an argument.
 - `epochs`:  how many rounds of training to do to update your weights
 
You can choose the appropriate values for these as you see fit.

**Hint:** Keras does not takes pandas as input, so you'll need to make sure it's converted to numpy first.

In [27]:
# your answer here
mod.fit(X_train, y_train.values, validation_split=0.2, epochs=10)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 32000 samples, validate on 8000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7fdd23f0c358>

#### Step 7: Diagnostics

Now is a good time to take a look at your results.  

By the end of your training run, were you overfitting or underfitting?  Did it look like your results were converging towards a stable answer, or was there more training that needed to be done?  

A reasonably good performance on this dataset is a validation accuracy of about 86-89%. 

If you hit this level, then you should be fine, if you didn't, then you might try changing a few things, including:

 - Adding more neurons to give your model greater potential for accuracy
 - Trying a different optimizer (this probably won't help much, but it never hurts)
 - Using a different set of activation functions
 - Making your samples longer or shorter in length, or changing the size of the vocabulary
 
Try and fiddle around with a few parameters to see if you can get some measurable improvement.  

**Bonus:** The Deep Learning antidote to overfitting is a special type of layeer called **dropout**:  it allows a portion of the data that will be randomly removed between one layer and the next, to prevent a neural network from randomly memorizing spurious connections within your data.

It's very easy to setup:

`keras.model.layers.Dropout(0.3)`, where `0.3` is the amount of data to randomly remove.  You can add it just like any other layer in your model

In [29]:
# the model began overfitting @ epoch 3, so we'll add a dropout layer of 0.2
from keras.layers import Dropout

# build
mod = Sequential([
    Embedding(10000, 25, input_length=150),
    Flatten(),
    Dropout(0.2),
    Dense(12, activation='relu'),
    Dense(1, activation='sigmoid')
])

# compile
mod.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

# fit
mod.fit(X_train, y_train.values, validation_split=0.2, epochs=10)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 32000 samples, validate on 8000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7fdd238460b8>

In [30]:
# since we had the best validation performance after one round, we'll go ahead and just train it for that much
mod.fit(X_train, y_train.values, validation_split=0.2, epochs=1)

Train on 32000 samples, validate on 8000 samples
Epoch 1/1


<keras.callbacks.callbacks.History at 0x7fdd23e57ac8>

In [31]:
# and finally score on our test set
mod.evaluate(X_test, y_test)



[0.886080420306325, 0.8533999919891357]