## 15.9.1 Loading the IMDb Movie Reviews Dataset (1 of 2)
* Contains **25,000 training samples** and **25,000 testing samples**, each **labeled** with its positive (1) or negative (0) sentiment

In [1]:
from tensorflow.keras.datasets import imdb

* **Over 88,000 unique words** in the dataset
* Can specify **number of unique words to import** when loading **training and testing data**
* We'll use top **10,000 most frequently occurring words** 
    * Due to **system memory limitations** and **training on a CPU** (intentionally)
    * Most people don't have systems with Tensorflow-compatible **GPUs** or **TPUs**
* **More data** takes **longer to train**, but may produce **better models**

## 15.9.1 Loading the IMDb Movie Reviews Dataset (1 of 2)
* **`load_data`** **replaces** any words **outside the top 10,000** with a **placeholder** value (discussed shortly)

In [2]:
number_of_words = 10000

**NOTE:** Following cell was added to work around a **known issue with TensorFlow/Keras and NumPy**&mdash;this issue is already fixed in a forthcoming version. [See this cell's code on StackOverflow.](https://stackoverflow.com/questions/55890813/how-to-fix-object-arrays-cannot-be-loaded-when-allow-pickle-false-for-imdb-loa)

In [3]:
import numpy as np

# save np.load
np_load_old = np.load

# modify the default parameters of np.load
# np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)
np.load = lambda *a,**k: np_load_old(*a, **k)

In [4]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=number_of_words)

In [5]:
# This cell completes the workaround mentioned above
# restore np.load for future normal usage
np.load = np_load_old

<hr style="height:2px; border:none; color:black; background-color:black;">

## 15.9.2 Data Exploration (1 of 2)
* Check sample and target dimensions
* **Note that `X_train` and `X_test` appear to be one-dimensional**
    * They're actually **NumPy arrays of objects** (lists of integers)

In [6]:
X_train.shape

(25000,)

In [7]:
y_train.shape

(25000,)

In [8]:
X_test.shape

(25000,)

In [9]:
y_test.shape

(25000,)

<hr style="height:2px; border:none; color:black; background-color:black;">

## 15.9.2 Data Exploration (2 of 2)
* The **arrays `y_train` and `y_test`** are **one-dimensional** arrays containing **1s and 0s**, indicating whether each review is **positive** or **negative**
* `X_train` and `X_test` are **lists** of integers, each representing one review’s contents
* **Keras models require numeric data** &mdash; **IMDb dataset is preprocessed for you**

In [10]:
%pprint  # toggle pretty printing, so elements don't display vertically

Pretty printing has been turned OFF


In [11]:
X_train[123]

[1, 307, 5, 1301, 20, 1026, 2511, 87, 2775, 52, 116, 5, 31, 7, 4, 91, 1220, 102, 13, 28, 110, 11, 6, 137, 13, 115, 219, 141, 35, 221, 956, 54, 13, 16, 11, 2714, 61, 322, 423, 12, 38, 76, 59, 1803, 72, 8, 2, 23, 5, 967, 12, 38, 85, 62, 358, 99]

<hr style="height:2px; border:none; color:black; background-color:black;">

### Movie Review Encodings (1 of 2)
* Because the **movie reviews** are **numerically encoded**, to view their original text, you need to know the word to which each number corresponds
* **Keras’s IMDb dataset** provides a **dictionary** that **maps the words to their indexes**
* **Each word’s value** is its **frequency ranking** among all words in the dataset
    * **Ranking 1** is the **most frequently occurring word**
    * **Ranking 2** is the **second most frequently occurring word**
    * ...

<hr style="height:2px; border:none; color:black; background-color:black;">

### Movie Review Encodings (2 of 2)
* Ranking values are **offset by 3** in the training/testing samples
    * **Most frequently occurring word has the value 4** wherever it appears in a review
* **0, 1 and 2** in each encoded review are **reserved**:
    * **padding (0)** 
        * All training/testing samples **must have same dimensions**
        * Some reviews may need to be padded with **0** and some shortened
    * **start of a sequence (1)** &mdash; a **token** that Keras uses internally for learning purposes
    * **unknown word (2)** &mdash; typically a word that was **not loaded**
        * **`load_data`** uses **2** for words with **frequency rankings greater than `num_words`** 

<hr style="height:2px; border:none; color:black; background-color:black;">

### Decoding a Movie Review (1 of 3)
* Must account for offset when **decoding reviews**
* Get the **word-to-index dictionary**

In [12]:
word_to_index = imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


* The word `'great'` might appear in a positive movie review:

In [13]:
word_to_index['great']  # 84th most frequent word

84

<hr style="height:2px; border:none; color:black; background-color:black;">

### Decoding a Movie Review (2 of 3)
* **Reverse `word_to_index` mapping**, so we can **look up words** by **frequency rating**

In [14]:
index_to_word = {index: word for (word, index) in word_to_index.items()}

* **Top 50 words**—**most frequent word** has the key **1** in the **new dictionary**

In [15]:
[index_to_word[i] for i in range(1, 51)]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'are', 'his', 'have', 'he', 'be', 'one', 'all', 'at', 'by', 'an', 'they', 'who', 'so', 'from', 'like', 'her', 'or', 'just', 'about', "it's", 'out', 'has', 'if', 'some', 'there', 'what', 'good', 'more']

<hr style="height:2px; border:none; color:black; background-color:black;">

### Decoding a Movie Review (3 of 3)
* Now, we can **decode a review**
* **`i - 3`** accounts for the **frequency ratings offsets** in the encoded reviews 
* For `i` values `0`–`2`, `get` returns `'?'`; otherwise, `get` returns the word with the **key `i - 3`** in the **`index_to_word` dictionary**

In [16]:
' '.join([index_to_word.get(i - 3, '?') for i in X_train[123]])

'? beautiful and touching movie rich colors great settings good acting and one of the most charming movies i have seen in a while i never saw such an interesting setting when i was in china my wife liked it so much she asked me to ? on and rate it so other would enjoy too'

* Can see from **`y_train[123]`** that this **review** is **classified as positive**

In [17]:
y_train[123]

1

<hr style="height:2px; border:none; color:black; background-color:black;">

## 15.9.3 Data Preparation (1 of 2)
* Number of words per review varies
* Keras **requires all samples to have the same dimensions**
* **Prepare data** for learning
	* Restrict every review to the **same number of words**
	* **Pad** some with **0s**, **truncate** others
* **`pad_sequences` function** reshapes samples and **returns a 2D array**

In [18]:
words_per_review = 200  

In [19]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [20]:
X_train = pad_sequences(X_train, maxlen=words_per_review)

In [21]:
X_train.shape

(25000, 200)

## 15.9.3 Data Preparation (2 of 2)
* Must also **reshape `X_test`** for evaluating the model later

In [22]:
X_test = pad_sequences(X_test, maxlen=words_per_review) 

In [23]:
X_test.shape

(25000, 200)

<hr style="height:2px; border:none; color:black; background-color:black;">

### Splitting the Test Data into Validation and Test Data
* Split the **25,000 test samples** into **20,000 test samples** and **5,000 validation samples**
* We'll pass validation samples to the model’s `fit` method via **`validation_data`** argument
* Use **Scikit-learn’s `train_test_split` function** 

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
X_test, X_val, y_test, y_val = train_test_split(
    X_test, y_test, random_state=11, test_size=0.20) 

* Confirm the split by checking `X_test`’s and `X_val`’s shapes:

In [26]:
X_test.shape

(20000, 200)

In [27]:
X_val.shape

(5000, 200)

<hr style="height:2px; border:none; color:black; background-color:black;">

## 15.9.4 Creating the Neural Network
* Begin with a **`Sequential` model** and import the other layers

In [28]:
from tensorflow.keras.models import Sequential

In [29]:
rnn = Sequential()

In [30]:
from tensorflow.keras.layers import Dense, LSTM, Embedding

<hr style="height:2px; border:none; color:black; background-color:black;">

### Adding an Embedding Layer (1 of 2)
* RNNs that process **text sequences** typically begin with an **embedding layer** 
* Encodes each word in a **dense-vector representation**
* These capture the **word’s context**—how a given word **relates to words around it**
* Help **RNN learn word relationships** 
* **Predefined word embeddings**, such as **Word2Vec** and **GloVe**
	* Can **load** into neural networks to **save training time**
	* Sometimes used to **add basic word relationships** to a model when **smaller amounts of training data** are available
	* **Improve model accuracy** by **building upon previously learned word relationships**, rather than trying to learn those relationships with insufficient data

<hr style="height:2px; border:none; color:black; background-color:black;">

### Adding an `Embedding` Layer (2 of 2)

In [31]:
rnn.add(Embedding(input_dim=number_of_words, output_dim=128,
                  input_length=words_per_review))

* **`input_dim=number_of_words`**—Number of **unique words**
* **`output_dim=128`**—Size of each word embedding
    * If you [load pre-existing embeddings](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) like **Word2Vec** and **GloVe**, you must set this to **match the size of the word embeddings you load**
* **`input_length=words_per_review`**—Number of words in each input sample

<hr style="height:2px; border:none; color:black; background-color:black;">

### Adding an LSTM Layer

In [32]:
rnn.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))

* **`units`**—**number of neurons** in the layer
	* **More neurons** means **network can remember more**
	* [**Guideline**](https://towardsdatascience.com/choosing-the-right-hyperparameters-for-a-simple-lstm-using-keras-f8e9ed76f046): Value between **length of the sequences** (200 in this example) and **number of classes to predict** (2 in this example)
* **`dropout`**—**percentage of neurons to randomly disable** when processing the layer’s input and output
	* Like **pooling layers** in a **convnet**, **dropout** is a proven technique that **reduces overfitting**
        * Yarin, Ghahramani, and Zoubin. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” October 05, 2016. https://arxiv.org/abs/1512.05287
        * Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” _Journal of Machine Learning Research_ 15 (June 14, 2014): 1929-1958. http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
	* Keras also provides a **`Dropout`** layer that you can add to your models 
* **`recurrent_dropout`**—**percentage of neurons to randomly disable** when the **layer’s output** is **fed back into the layer** again to allow the network to **learn from what it has seen previously**
    * **Mechanics of how the LSTM layer performs its task are beyond scope**.
        * Chollet says: “you don’t need to understand anything about the specific architecture of an LSTM cell; **as a human, it shouldn’t be your job to understand it**. Just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time.”
		* Chollet, François. _Deep Learning with Python_. p. 204. Shelter Island, NY: Manning Publications, 2018.

<hr style="height:2px; border:none; color:black; background-color:black;">

### Adding a Dense Output Layer 
* Reduce the **LSTM layer’s output** to **one result** indicating whether a review is **positive** or **negative**, thus the value **`1` for the `units` argument**
* **`'sigmoid`' activation function** is preferred for **binary classification**
	* Chollet, François. _Deep Learning with Python_. p.114. Shelter Island, NY: Manning Publications, 2018.
	* Reduces arbitrary values into the range **0.0–1.0**, producing a probability

In [33]:
rnn.add(Dense(units=1, activation='sigmoid'))

<hr style="height:2px; border:none; color:black; background-color:black;">

### Compiling the Model and Displaying the Summary
* **Two possible outputs**, so we use the **`binary_crossentropy` loss function**:

In [34]:
rnn.compile(optimizer='adam',
            loss='binary_crossentropy', 
            metrics=['accuracy'])

* **Fewer layers** than our **convnet**, but nearly **three times as many parameters** (the network’s **weights**)  
	* **More parameters means more training time**
	* The large number of parameters primarily comes from the **number of words in the vocabulary** (we loaded 10,000) **times the number of neurons in the `Embedding` layer’s output (128)**

In [35]:
rnn.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 200, 128)          1280000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________


<hr style="height:2px; border:none; color:black; background-color:black;">

## 15.9.5 Training and Evaluating the Model (1 of 2)
* For each **epoch** the **RNN model** takes **significantly longer to train** than our **convnet**
    * Due to the **larger numbers of parameters** (weights) our **RNN model** needs to learn

In [36]:
rnn.fit(X_train, y_train, epochs=10, batch_size=32, 
        validation_data=(X_val, y_val))

Train on 25000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History object at 0x15edf6a50>

<!--
```
Train on 25000 samples, validate on 20000 samples
WARNING:tensorflow:From /Users/pauldeitel/anaconda3/envs/tf_env/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
25000/25000 [==============================] - 297s 12ms/sample - loss: 0.4827 - acc: 0.7673 - val_loss: 0.3925 - val_acc: 0.8324
Epoch 2/10
25000/25000 [==============================] - 291s 12ms/sample - loss: 0.3327 - acc: 0.8618 - val_loss: 0.3614 - val_acc: 0.8461
Epoch 3/10
25000/25000 [==============================] - 272s 11ms/sample - loss: 0.2662 - acc: 0.8937 - val_loss: 0.3503 - val_acc: 0.8492
Epoch 4/10
25000/25000 [==============================] - 272s 11ms/sample - loss: 0.2066 - acc: 0.9198 - val_loss: 0.3695 - val_acc: 0.8623
Epoch 5/10
25000/25000 [==============================] - 271s 11ms/sample - loss: 0.1612 - acc: 0.9403 - val_loss: 0.3802 - val_acc: 0.8587
Epoch 6/10
25000/25000 [==============================] - 291s 12ms/sample - loss: 0.1218 - acc: 0.9556 - val_loss: 0.4103 - val_acc: 0.8421
Epoch 7/10
25000/25000 [==============================] - 295s 12ms/sample - loss: 0.1023 - acc: 0.9634 - val_loss: 0.4634 - val_acc: 0.8582
Epoch 8/10
25000/25000 [==============================] - 273s 11ms/sample - loss: 0.0789 - acc: 0.9732 - val_loss: 0.5103 - val_acc: 0.8555
Epoch 9/10
25000/25000 [==============================] - 273s 11ms/sample - loss: 0.0676 - acc: 0.9775 - val_loss: 0.5071 - val_acc: 0.8526
Epoch 10/10
25000/25000 [==============================] - 273s 11ms/sample - loss: 0.0663 - acc: 0.9787 - val_loss: 0.5156 - val_acc: 0.8536
<tensorflow.python.keras.callbacks.History object at 0x141462e48>
```
-->

## 15.9.5 Training and Evaluating the Model (2 of 2)
* Function **`evaluate`** returns the **loss and accuracy values**

In [37]:
results = rnn.evaluate(X_test, y_test)



In [38]:
results

[0.6083436147212982, 0.84665]

* **Accuracy seems low** compared to our **convnet**, but this is a **much more difficult problem**
    * Many **IMDb sentiment-analysis binary-classification studies** show results **in the high 80s**
* We did **reasonably well** with our **small recurrent neural network** of only **three layers**
    * We have not tried to tune our model