# Project Updates

Updates since project walkthrough.

### Refactored Data Collection

The code to collect the Keras datasets has been refactored into a separate file, which is imported like so:

In [None]:
from cornell import Cornell

And loaded in like so:

In [None]:
dataset = Cornell()
(train_data, train_labels), (test_data, test_labels) = dataset.load_data()

As before, loading in the data takes a minute or two, particularly the section where it collects the dialog strings for each movie.

### Class vs. Functional Module

The decision to make the dataset based around was a class was so that the project submission could easily and quickly retrieve the genres index in order to report on some example predictions in "human-readable" form. The index is first saved as a dictionary, translating strings to encoded indices (for transforming within the dataset load), then saved back into a list with each index representing the corresponding index in the dictionary.

In [None]:
# Encode each genre name into an integer and link them via dict
genre_ints = encoder.fit_transform(genre_names)
for i in range(len(genre_names)):
    self.genre_to_idx[genre_names[i]] = genre_ints[i]

# Create a conversion table for index back to genre name
self.idx_to_genre = [''] * len(genre_names)
for genre in genre_names:
    self.idx_to_genre[self.genre_to_idx[genre]] = genre

In [None]:
Simple getters are thus implemented to retrieve these.

### Experimentation with Model Configurations

As explained within the report, the initial model based upon Chollet's IMDb reviews exercise yielded notable results from the get-go. Only a few alterations were made in order to push the accuracy to around 90.5% on the test set. No further alterations yielded higher results.

In [None]:
model = models.Sequential()
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(num_genres, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=20, batch_size=16, validation_split=0.1)

### Some extra code

To ensure that the built-in Keras evaluation function was functioning properly, I also wrote a simple set of code to check by hand that the accuracy data was correct.

In [None]:
predictions = model.predict(x_test)
predictions[predictions>=0.5] = 1
predictions[predictions<0.5] = 0

entries = len(y_test)
total = num_genres * entries

correct = 0
for i in range(entries):
    for j in range(num_genres):
        correct += int(predictions[i][j] == y_test[i][j])

print(str(correct / total))

This, as seen in the report, prints out the same accuracy percentage as evaluate().

In [None]:
def print_results(idx):
    predicted = ""
    actual = ""
    for i in range(num_genres):
        if predictions[idx][i] > 0:
            predicted += idx_to_genre[i] + ' '
        if y_test[idx][i] > 0:
            actual += idx_to_genre[i] + ' '
    print("Predicted: " + predicted)
    print("Actual: " + actual)

for i in range(20):
    print_results(i)
    print()

And this shows a few examples translated back into readable genres.