# Recurrency and Sentiment prediction on imdb reviews dataset

On this notebook we will cover the training of a simple model on the imdb reviews dataset, and we will cover the next topics:
- imdb reviews dataset:
    - Data types: working with words, strings, texts.
    - Data Preparation
- Model architecture: 
    - Embedding layer
    - LSTM layer
### The dataset: imdb reviews

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).

In [4]:
import numpy
import os
import tensorflow as tf
# physical_devices = tf.config.experimental.list_physical_devices('GPU')
# tf.config.experimental.set_memory_growth(physical_devices[0], True)
tf.keras.backend.clear_session()  # For easy reset of notebook state.


from tensorflow.keras.datasets import imdb
# fix random seed for reproducibility
numpy.random.seed(7)

# load the dataset but only keep the top n words, zero the rest
top_words = 5000
index_from =3
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words, index_from = index_from)

In [5]:
X_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 2,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 2,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 2,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 2,
 18,
 51,
 36,
 

Let's decode the sentences into something more understandable: we will use a dictionary from the dataset. Please note we are limiting the vocabulary to *top_words*, and this way, tokens not on this dictionary will be set to unknown.

In [6]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+index_from) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in X_train[0] ))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly <UNK> was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up a

For our purposes, all inputs sould have the same dimensions. So, we are going to pad the sequences, including empty characters ('<PAD>') whrn it is needed and shortening the reviews when they are longer.

<font color=red><b>Pad the train and test sequences to share the same length (500)
<br>Hint: use the function pad_sequences from keras.sequences</b>
</font>

In [7]:
from tensorflow.keras.preprocessing import sequence

# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [5]:
X_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

### The Embedding layer
A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values. Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. The position of a word in the learned vector space is referred to as its embedding.

A word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence

## Model Architecture
Build the RNN model to be trained on the data, with this config:
- Embedding layer, with the vocabulary size as the input dim and 32 as the output dimension. Limit the input lengthe to the maximum review length.
- LSTM with 100 units
- Final dense layer with a single unit, sigmoid activated
- On compilation, use adam as the optimizer. Which loss is the most suitable? Add 'accuracy' as a metric
- Print the summary


<font color=red><b> What does the embedding structure mean? <br> Hint: 
- Use the imported libraries</b></font>

In [None]:
embedding_vector_length = 32
batch_size = 64
epochs = 3
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Binary acaba en una única variable
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=batch_size)

"""
Revisar callback, sirve para parar una epoch.
Puedo decirle que si mi accuracy supera el 0.99 no siga trabajando o que sólo me guarde aquellas con un accuracy alto
"""

<font color=red>Evaluate the model </font>



In [None]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

### Convolutions + LSTM

What if we add a convolution to the model structure. Does it make sense? Essentially, It will capture the spatial structure of the embedding and extract the key features from near-words combination. Let's see if it makes sense.

#### Model Architecture
<font color=red>Build the Conv-LSTM model to be trained on the data, </font> with this config:
- Embedding layer, with the vocabulary size as the input dim and 32 as the output dimension. Limit the input lengthe to the maximum review length.
- Conv1d layer, with 32 filters and kernel of size 3. add 'same' as the padding and relu as the activation
- Max pooling layer with size = 2
- LSTM with 100 units
- Final dense layer with a single unit, sigmoid activated
- On compilation, use adam as the optimizer. Which loss is the most suitable? Add 'accuracy' as a metric


- Print the summary
- Evaluate the model


In [None]:
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))  # Se utiliza para obtener palabras y conjuntos que se relacionan haciendo sinergias entre vectores de palabras mejorando la predicción
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))  # Evita problemas de espacio
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 216,405
Trainable params: 216,405
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples
Epoch 1/3