## Practical 3

## Training a binary classifier on IMDB dataset

In [1]:
# Training a binary classifier
import numpy as np
import keras
import cv2

dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative).

returns:
x_train, x_test: lists of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words - 1. If the maxlen argument was specified, the largest possible sequence length is maxlen.

y_train, y_test: lists of integer labels (1 or 0).


In [2]:
import tensorflow as tf
(data_train, labels_train), (data_test, labels_test) = tf.keras.datasets.imdb.load_data(path='imdb.npz', num_words=10000, 
                                                                                      skip_top=0, maxlen=None, seed=113, 
                                                                                      start_char=1, oov_char=2, index_from=3)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [3]:
data = np.concatenate((data_train, data_test), axis=0)
targets = np.concatenate((labels_train, labels_test), axis=0)
# Single training example
print("Label: ", targets[0], "\n") # is 1, so a positive movie review
print(data[0], ) # is a movie review encoded with the frequency of the words in that review

Label:  1 

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [4]:
print("Categories", np.unique(targets))
print("Number of unique words:", len(np.unique(np.hstack(data))))
length = [len(i) for i in data]
print("Average Review Length:", np.mean(length))
print("Standard Deviation:", np.std(length))

Categories [0 1]
Number of unique words: 9998
Average Review Length: 234.75892
Standard Deviation: 172.91149458735703


In [5]:
# Decoding the review by mapping word indices back to original words
# Unknown words are replaced with "?"
# Using get_word_index() function

word_index = tf.keras.datasets.imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join((reverse_word_index.get(i-3, '?') for i in data_train[0]))

print(decoded_review)

? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you thi

In [6]:
# Preprocessing of data for Neural Network
# Vectorizing every review and fill it with zeros so it contains exactly 10000 numers.
# We do this because the longest review is nearly that long
# Every input of neural network needs to have the same size

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence]=1
    return results

In [7]:
# Further processing of data
x_train = vectorize_sequences(data_train)
x_test = vectorize_sequences(data_test)

y_train = np.asarray(labels_train).astype('float32')
y_test = np.asarray(labels_train).astype('float32')

In [8]:
# Building neural network
model = keras.models.Sequential()
model.add(keras.layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

In [9]:
# Compiling the network
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
#model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [10]:
# Preparing validation dataset
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [11]:
# Training the neural network
history = model.fit(partial_x_train, partial_y_train, epochs=1, batch_size=512, validation_data=(x_val, y_val))



In [12]:
# Record of different epochs of the training process
hist_dict = history.history
hist_dict['accuracy'][-1]
print(np.mean(hist_dict['val_accuracy']))
print(hist_dict)

0.8414000272750854
{'loss': [0.5320253968238831], 'accuracy': [0.7728666663169861], 'val_loss': [0.4232763648033142], 'val_accuracy': [0.8414000272750854]}


In [13]:
# Test results:
results = model.evaluate(x_test, y_test)
print(results)
print('Test loss = {}, test accuracy = {}'.format(results[0], results[1]))

[0.9388705492019653, 0.4954800009727478]
Test loss = 0.9388705492019653, test accuracy = 0.4954800009727478


In [14]:
print(model.predict(x_test))

[[0.3796627 ]
 [0.79477733]
 [0.6398718 ]
 ...
 [0.22786322]
 [0.24812627]
 [0.44605574]]


Accuracy decrease for adam optimizer and increasing the number of epochs.
It increases for increase in batch size