# IMDb movie review classification using Keras 
## Interview presentation for NAV
## By Torgeir Sandnes Laurvik

### External packages used:
* Numpy for array operations
* Tensorflow for importing Keras
* Keras for building model and loading data set

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

### Data preparation

In [2]:
imdb = keras.datasets.imdb
imdb_dataset = imdb.load_data(num_words=10000)
train_data, train_labels = imdb_dataset[0]
test_data, test_labels = imdb_dataset[1]
word_index = imdb.get_word_index()

### Add tags to dictonary
<!-- https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset -->

In [3]:
word_index = {k: (v+3) for k, v in word_index.items()}

word_index['<PAD>'] = 0
word_index['<START>'] = 1
word_index['<UNKNOWN>'] = 2
word_index['<UNUSED>'] = 3

### Help functions:

In [4]:
reverse_word_index = dict([(value, key) for key, value in word_index.items()])

def decode_review(bitmap):  # Converts our review vector back to text
    return ' '.join([reverse_word_index.get(bit,'?') for bit in bitmap])

def code_review(text):  # Convert text review to vector
    coded_string = np.zeros(256)
    for i, word in enumerate(text.split(' ')):
        coded_string[i] = word_index[word.lower()]
    return coded_string

def find_maxlen(reviews): #Find max wordcount of reviews
    maxlen = -1
    for i,review in enumerate(reviews):
        a = len(review)
        if(a>maxlen):
            maxlen=a
            j=i       
    return j,maxlen
        
    

### Examples from dataset

In [5]:
ex1 = decode_review(train_data[0])
ex2 = decode_review(train_data[100])
print('EXAMPLE 1: \n',ex1,'\n')
print('EXAMPLE 2: \n',ex2)

EXAMPLE 1: 
 <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNKNOWN> is an amazing actor and now the same being director <UNKNOWN> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNKNOWN> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNKNOWN> to the two little boy's that played the <UNKNOWN> of norman and paul they were just brilliant children are often left out of the <UNKNOWN> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are 

### Zero padding for correct array shape to input layer
#### I decide maxlen to be 256 words

In [6]:
print('Length of first review: ',len(train_data[0]))
print('Length of second review: ',len(train_data[1]))
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index['<PAD>'], padding='post', maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index['<PAD>'], padding='post', maxlen=256)
print('Length of first review: ',len(train_data[0]))
print('Length of second review: ',len(train_data[1]))

Length of first review:  218
Length of second review:  189
Length of first review:  256
Length of second review:  256


### Examples from dataset after zero padding

In [7]:
ex1 = decode_review(train_data[0])
ex2 = decode_review(train_data[100])
print('EXAMPLE 1: \n',ex1,'\n')
print('EXAMPLE 2: \n',ex2)

EXAMPLE 1: 
 <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNKNOWN> is an amazing actor and now the same being director <UNKNOWN> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNKNOWN> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNKNOWN> to the two little boy's that played the <UNKNOWN> of norman and paul they were just brilliant children are often left out of the <UNKNOWN> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are 

### Split training data into training and validation data

In [8]:
x_train = train_data[10000:]
x_valid = train_data[:10000]

y_train = train_labels[10000:]
y_valid = train_labels[:10000]

### Building neural network by adding layers
<!-- https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0 -->

In [9]:
vocab_size = 10000
model = keras.Sequential()

embedding = keras.layers.Embedding(vocab_size, 16)  # Group similar words, based on context decreases the angle between vectors.
model.add(embedding)

globalAverage = keras.layers.GlobalAveragePooling1D()
model.add(globalAverage)

model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


## ReLU activation function:

<img src='img/relu.png' width=400 height=400>

## Sigmoid activation function:

<img src='img/sigmoid.png' width=400 height=400>

### Training and validation phase
#### In this phase I 

In [10]:
history = model.fit(x_train, y_train, epochs=40, batch_size=512, validation_data=(x_valid, y_valid), verbose=0)

Instructions for updating:
Use tf.cast instead.


### Testing phase

In [11]:
# Testing phase
result = model.evaluate(test_data, test_labels)
print(f'Test loss: {result[0]}\nTest accuracy: {result[1]}')

Validation loss: 0.336444063706398
Validation accuracy: 0.871720016002655


### Predicting my self written review

In [12]:
# Prediction of my review
my_review = 'this movie was really great i liked the ending but the start was not good when i first saw this movie i thought it would be a horrible movie but it ended up being a good experience'
coded_review = code_review(my_review)
array_form = np.array([coded_review, ])
print(f'My review got predicted to {model.predict(array_form)}=> closer to 1 than 0 => positive')

My review got predicted to [[0.74220806]]=> closer to 1 than 0 => positive
