# Predicting Movie Genre From Only Its Dialog

For CS-344: Artificial Intelligence by Professor Keith VanderLinden, Calvin University.

Project and report by Nathan Meyer.

## Vision

This project is an experimentation with machine learning, neural networks specifically, to classify movies by genre through only the movie's dialog text. The project was envisioned as a multi-class, multi-label extension of the [IMDb reviews classification exercise](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/3.5-classifying-movie-reviews.ipynb) from F. Chollet's Deep Learning with Python textbook, expanding that exercise's binary classification model into one that not only could classify categorically, but could classify entries with multiple labels per entry.

The inspiration for this project came from finding the [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), which was originally conceived as a part of another machine learning research project. Given the work that had already been performed on this dataset to categorize movie lines by which movie they were in, it seemed feasible to convert this data into something I could use for this project.

Since this dataset includes movies, its (zero or more) genres, and all of the dialog lines contained within each movie, I could theoretically picture a translation of the IMDb classification exercise. Instead of classifying examples as merely positive or negative, this model could theoretically classify each film, using only its dialog, in zero or more of the genres it is associated with. This would mean that the model would not only be multi-class, like our MNIST example, but multi-label, where each example could have zero or more labels applied to it. Finding out how simple or difficult this multi-label expansion would be was another aspect of interest in this project.

Another point of both concern and interest was the dataset itself. Instead of a collection of multiple thousands of examples, the dataset is a collection of only 617 movies, but 304,713 utterances are spread between them. This is a very different distribution than the multiple thousands of reviews with shorter sequences in the IMDb example.

So the purpose of this experiment is to see how effective a machine learning model can be in identifying multi-class, multi-label classifications under relatively restricted circumstances. Since screenplays are relatively easy to find after their movies are released, I could also see, if this experiment proves successful enough, that something like this could be used to assist movie classification tools with quickly gathering initial data about their genres.

## Background

The main technology used for this machine learning model is [Keras](https://www.keras.io/). As a framework for Google's TensorFlow, and the framework utilized in the exercises that formed a basis for this project, it made sense to continue utilizing this instead of another technology. I was not planning on employing an unknown machine learning architecture to this project, so I simply employed Keras' sequential model with Dense layers. Again, the basis for the model was based upon the [IMDb reviews classification exercise](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/3.5-classifying-movie-reviews.ipynb). The only modifications made to it were in the size of the layers, and in the addition of Dropout layers to help fight overfitting.

While other avenues were considered, the implementation that ultimately worked for the multi-class, multi-layer implementation was almost identical to the IMDb example. Particularly with the help of [this Stack Overflow answer](https://stackoverflow.com/a/44165755), the realization was to have the model essentially treat each genre as its own binary classification problem. Using sigmoid instead of another activation like softmax allowed for the results to all be included rather than one answer winning out.

Then there is the dataset itself. This dataset is structured by several different files that are linked together by IDs in an SQL-like style. Because of this, and my higher familiarity with SQL than with processing .csv-like files, the dataset I opted to use is an SQLite conversion by Lee Richards, [found on Kaggle here](https://www.kaggle.com/mrlarichards/cornell-movie-dialogs-corpus-sqlite). The database is structured by tables for movies, genres, a linking table between movies and genres, characters, lines, and a conversations table linking characters to lines. For this experiment, only the movies, genres, and lines tables will be used, in order to collect the lines and genres for each movie in a fairly straightforward manner.

To process this dataset, I primarily used four technologies, with the first being [SQLite3 for Python](https://docs.python.org/2/library/sqlite3.html). As a library already included in Python, this turned out to be a very practical way to access the dataset and use familiar queries to gather what I needed. What my usage of this technology amounted to was simply opening the database and using the cursor function to fetch items from queries several times.

The second was [Numpy](https://numpy.org), which we had learned about in the course. Primarily, it was used to create the necessary Numpy arrays for Keras and to shuffle the data once it had been collected.

Third, I utilized the [Keras Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer). This was vital in being able to encode the dialog into indexed integers. In particular, I used the fit_on_texts() and texts_to_sequences() functions in order to translate the dialog strings into integers, before they would be later one-hot encoded.

Finally, for the genre collection steps, I used [scikit-learn's](https://scikit-learn.org/stable/) LabelEncoder class to encode the genres into indexes. It is a simple class which can take values (strings) and encode them into integer form. [This tutorial from Machine Learning Mastery](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/) was vital in learning how to use it for this, as it explained in detail how to use the LabelEncoder's fit_transform() function to quickly encode each genre into an integer.

It is worth noting that I initially based my approach for data collection on the `imdb` dataset module from `keras.datasets` as well. This is particularly evident in the usage of load_data() as a function and in some of its arguments. However, as implementation continued, most of the code diverged significantly from said module.

## Implementation

### Data Collection

A significant portion of time was spent intuitively collecting the data for this project. Although the SQLite database added some degree of convenience, it was still a task to translate it into something that Keras could use.

For the sake of simplicity within the implementation, I decided to refactor the code for the data collection into a Cornell class module found in `cornell.py`. It can be instantiated as a class object which contains genre-to-index and index-to-genre conversion tables as its only instance variables. I made this decision so that gathering the conversion tables would be fast, in the case of visualizing the results in human-readable form.

The Cornell class' meat is in the code for load_data(). This code utilizes the technologies described above, first performing SQL queries to gather the data (movie lines), tokenizing them with Keras' Tokenizer, and storing the results in a Numpy array. Then, for the labels (movie genres), the genres are first collected as a list of strings, then encoded into a list of integers via the LabelEncoder, and then linked together as a dictionary. This is then reversed back into a normal list in order to create the index-to-genre conversion table. Then, through numerous SQL queries, the genres for each movie are collected and, once finished, translated into integer encodings through the genre-to-index conversion table.

After this is done, Numpy is used to shuffle both the data and the labels before each tuple of data and labels is returned. This section in particular is based upon the original implementation within `keras.datasets.imdb`.

All of these steps are thus abstracted away in the following lines of code:

In [1]:
from cornell import Cornell

dataset = Cornell()
(train_data, train_labels), (test_data, test_labels) = dataset.load_data()

Using TensorFlow backend.


This will take a minute or two. Here is an example of some dialog lines encoded to word indices (the first ten since it is quite lengthy):

In [4]:
train_data[0][:10]

[554, 7, 62, 18, 2, 52, 10, 1115, 1191, 16]

And the genre indices associated with this same movie example:

In [3]:
train_labels[0]

[17, 5, 8]

From there we can collect the `idx_to_genre` list to use for later translation and collecting the number of genres.

In [5]:
idx_to_genre = dataset.get_idx_to_genre()

Then we vectorize the datasets into hot-encoded vectors. This is done in the same way as it is in the IMDb exercise, where vectorize_sequences receives an array of sequences, a specified dimension value, and returns a two-dimensional array of the sequences encoded into an array of length `dimension`.

In [None]:
idx_to_genre = dataset.get_idx_to_genre()idx_to_genre = dataset.get_idx_to_genre()

Now the data is ready-to-go. Like in the IMDb exercise, the data and labels are now vectorized into hot-encoded arrays, where binary values represent either the word-indices or the genre-indices, like so:

In [7]:
y_train[0]

array([0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0.])

### Building the Model

Although numerous tweaked configurations were attempted, the base configuration from the IMDb example already produced surprisingly effective results. However, a few tweaks were made.

The number of layers remains the same, but the size of the first hidden layer is 256, a little over half the size of the training set, and the second hidden layer is half that. This seemed to show slightly better results than networks with more or fewer layers than this.

Additionally, two dropout layers were added. Without them, the model was very quick to overfit, but with a relatively steep dropout of 50%, the training curve improved dramatically and fell more in line with the validation set.

For the output layer, it is almost identical to the IMDb example except that now its size is the number of genres, such that it may output probabilities for each encoded genre.

In [17]:
from keras import models, layers, backend

model = models.Sequential()
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(num_genres, activation='sigmoid'))

For loss, binary cross-entropy is still used. The suggestions made from [the aforementioned Stack Overflow answer](https://stackoverflow.com/a/44165755) included pointing out that this sort of multi-label model should be thought of as output that is decomposed into several binary classification labels. This makes it such that the model continuously improves each individual label regardless of the others, while a categorical classification structure would seek to only identify each example with one label. This part is, in many ways, the "secret sauce" to how the model functions properly at all.

In [19]:
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

Now the model can be fit to the training set. As Keras has built-in functionality to split the dataset into a training and validation set, the argument `validation_split=0.1` takes the last 10% of the training set to use for validation. This configuration, with 14 epochs and a batch size of 16 seemed to produce the most solid results I could attain.

In [20]:
model.fit(x_train, y_train,
          epochs=13, batch_size=16, validation_split=0.1)

Train on 415 samples, validate on 47 samples
Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13


<keras.callbacks.History at 0x14d289990>

In [6]:
import numpy as np

num_words = 10000
num_genres = len(idx_to_genre)

def vectorize_sequences(sequences, dimension):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data, num_words)
x_test = vectorize_sequences(test_data, num_words)

y_train = vectorize_sequences(train_labels, num_genres)
y_test = vectorize_sequences(test_labels, num_genres)

## Results

## Implications