# Sentiment Analysis

### Movie Review Dataset

This dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset

In [15]:
import tensorflow as tf
import os
import numpy as np
from keras.datasets import imdb
from keras.preprocessing import sequence
from tensorflow import keras

In [2]:
VOCAB_SIZE = 88584
MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 0us/step


In [4]:
len(train_data[0])

218

### More preprocessing

We can see that the reviews are in different lengths, but we cannot pass different length data into our neural network. Therefore we must make each review the same length.
- If the review is greater than 250 words then trim off the extra words.
- If the review is less than 250 words add the necessary amount of 0's to make it equal to 250

In [9]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

### Creating the model

In [10]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32), #transforma palabras en vectores de numeros reales que capturan relacioones semanticas entre palabras
    tf.keras.layers.LSTM(32), #capa recurrente capaz de aprender dependencias a largo plazo de secuencias de texto
    tf.keras.layers.Dense(1, activation='sigmoid') #capa densa (totalmente conectada), la funcion de activacion sigmoide comprime la salida en el rango [0,1] lo cual es ideal para tareas de clasificacion binaria
    #el modelo luego puede decidir si una instancia pertenece a una clase especifica comparando esta probabilidad con un umbral, tipicamente 0.5
])

### Training

In [38]:
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['acc'])
history = model.fit (train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 36ms/step - acc: 0.9939 - loss: 0.0193 - val_acc: 0.8782 - val_loss: 0.6331
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 39ms/step - acc: 0.9975 - loss: 0.0139 - val_acc: 0.8808 - val_loss: 0.6870
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 43ms/step - acc: 0.9975 - loss: 0.0090 - val_acc: 0.8724 - val_loss: 0.7055
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 42ms/step - acc: 0.9974 - loss: 0.0111 - val_acc: 0.8740 - val_loss: 0.7589
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 42ms/step - acc: 0.9986 - loss: 0.0070 - val_acc: 0.8768 - val_loss: 0.7397
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 42ms/step - acc: 0.9975 - loss: 0.0081 - val_acc: 0.8686 - val_loss: 0.7492
Epoch 7/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0

In [39]:
results = model.evaluate(test_data, test_labels)
print(results)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - acc: 0.8536 - loss: 0.9737
[0.988158643245697, 0.8536400198936462]


We evaluated the model to see how well it performs. 

### Making predictions

We will use the network to make predictions on our own reviews. Since our reviews are encoded well need to convert any review that we write into that form so the network can understand it. To do that well , we load the encodings from the dataset and use them to encode our own data. 

ENCODING

In [40]:
word_index = imdb.get_word_index()

def encode_text(text):
    tokens = keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return sequence.pad_sequences([tokens], MAXLEN)[0]

text = 'that movie was just amazing, so amazing'
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

DECODING

In [41]:
# decode function 
#convierte la secuencia de enteros de vuelta a su forma textual

reverse_word_index = {value: key for (key, value) in word_index.items()} #invierte word_index
def decode_integers (integers): 
    PAD = 0 #relleno, el codigo omite 0 en la decodificacion
    text = ""
    for num in integers:
        if num != PAD: #busca la palabra correspondiente en "reverse_word_index" y la agrega en la cadena text
            text += reverse_word_index[num] + " "
    return text[:-1] #devuelve el texto codificado, quitando el ultimo espacio agregado durante la concatenacion

print (decode_integers(encoded))


that movie was just amazing so amazing


Now time to make a prediction

In [52]:
def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1,250)) #crea un array de Numpy de ceros con forma (1,250). Aqui 1 representa el numero de muestras y 250 es la longitud maxima de la secuencia de texto
    pred[0] = encoded_text # Coloca la secuencia codificada en el primer (y único) ejemplo del array pred. Esto es necesario porque los modelos en Keras esperan que los datos de entrada tengan una forma (batch_size, sequence_length), incluso si el batch size es 1.
    result = model.predict(pred) #usa el modelo preentrenado para hacer la prediccion
    print(result[0]) #array de una sola probabilidad

positive_review = 'That movie was so awesome! I really loved it and would watch it again becuse it was amazingly great'
predict(positive_review)

negative_review = 'that movie sucked. I hated it and would not watch it again. Was one of the worst things I have ever watched '
predict(negative_review)



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[0.9839056]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[0.4179658]
