As a baseline for comparison, this notebook creates a Naive Bayes classifier to predict the genre of a piece of media based on its synopsis.

In [1]:
# sklearn for various machine learning algorithms
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# CountVectorizer for converting text data into a matrix of token counts
# Not the only way to do this, but a simple and common way
from sklearn.feature_extraction.text import CountVectorizer

# Pandas for working with dataframes (including the one already created by the data processing notebook)
import pandas as pd

In [2]:
# Load the data
data = pd.read_csv('Datasets/onehotplotgenre.csv')

# Print the first few rows of the data
print(data.head())

                                    Title  \
0          #1 Cheerleader Camp (2010) (V)   
1                 #1 Serial Killer (2013)   
2  #1 at the Apocalypse Box Office (2015)   
3                             #137 (2011)   
4                              #30 (2013)   

                                                Plot  Action  Adventure  \
0  When they're hired to work at a cheerleading c...       0          0   
1  Years of seething rage against the racism he's...       0          0   
2  Jules is, self declared, the most useless pers...       0          0   
3  #137 is a SCI/FI thriller about a girl, Marla,...       0          0   
4  A bright and talented performer, Chelsea Johns...       0          0   

   Biography  Comedy  Crime  Drama  Family  Fantasy  ...  Horror  Music  \
0          0       1      0      0       0        0  ...       0      0   
1          0       0      0      0       0        0  ...       1      0   
2          0       1      0      0       0        0  

To start, we'll work on training a single Naive Bayes classifier - action movie or not.

In [3]:
# Extract just the title, plot, and Action columns
# This is the data that we will use to train the action classifier
actiondata = data[['Title', 'Plot', 'Action']]
print(actiondata.head())

# Split the data into training and testing sets
# The training set will be used to train the classifier
# The testing set will be used to evaluate the classifier
X_train, X_test, y_train, y_test = train_test_split(actiondata['Plot'], actiondata['Action'], test_size=0.2, random_state=42)

# Create a CountVectorizer object
# This object will convert the text data into a matrix of token counts
vectorizer = CountVectorizer()

# Fit the vectorizer to the training data
# This step determines which words are in the vocabulary
vectorizer.fit(X_train)

# Transform the training data using the vectorizer
# The data is transformed into a matrix of token counts
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)


                                    Title  \
0          #1 Cheerleader Camp (2010) (V)   
1                 #1 Serial Killer (2013)   
2  #1 at the Apocalypse Box Office (2015)   
3                             #137 (2011)   
4                              #30 (2013)   

                                                Plot  Action  
0  When they're hired to work at a cheerleading c...       0  
1  Years of seething rage against the racism he's...       0  
2  Jules is, self declared, the most useless pers...       0  
3  #137 is a SCI/FI thriller about a girl, Marla,...       0  
4  A bright and talented performer, Chelsea Johns...       0  


In [4]:
# Create a Multinomial Naive Bayes classifier
actionclassifier = MultinomialNB()

# Train the classifier
actionclassifier.fit(X_train, y_train)

# Evaluate the classifier
score = actionclassifier.score(X_test, y_test)
print('Accuracy:', score)

Accuracy: 0.8676405801187728


In [5]:
# Predict if a new plot is an action movie
plot = "So many fuzzy kittens."

# Convert the plot into a matrix of token counts
plot = vectorizer.transform([plot])

# Predict if the plot is an action movie
prediction = actionclassifier.predict_proba(plot)

# Print the prediction
print(prediction)

# Predict if a new plot is an action movie
plot2 = "James Bond saves the world from evil."

# Convert the plot into a matrix of token counts
plot2 = vectorizer.transform([plot2])

# Predict if the plot is an action movie
prediction2 = actionclassifier.predict_proba(plot2)

# Print the prediction
print(prediction2)

[[0.96844433 0.03155567]]
[[0.32634018 0.67365982]]


In [6]:
# Get the list of genres from the dataset
genres = data.columns[2:]
print(genres)

# Create a dictionary to store the classifiers for each genre
classifiers = {}

Index(['Action', 'Adventure', 'Biography', 'Comedy', 'Crime', 'Drama',
       'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Musical', 'Mystery',
       'Romance', 'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western'],
      dtype='object')


In [7]:
# Train a classifier for each genre
for genre in genres:
    print('Training classifier for:', genre)
    # If the genre has already been trained, skip it
    if genre in classifiers:
        continue
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(data['Plot'], data[genre], test_size=0.2, random_state=42)
    
    # Create a Multinomial Naive Bayes classifier
    classifier = MultinomialNB()

    # Use the vectorizer to convert the text data into a matrix of token counts
    X_train = vectorizer.transform(X_train)
    
    # Train the classifier
    classifier.fit(X_train, y_train)
    
    # Store the classifier in the dictionary
    classifiers[genre] = classifier
    print('Done training classifier for:', genre)

Training classifier for: Action
Done training classifier for: Action
Training classifier for: Adventure
Done training classifier for: Adventure
Training classifier for: Biography
Done training classifier for: Biography
Training classifier for: Comedy
Done training classifier for: Comedy
Training classifier for: Crime
Done training classifier for: Crime
Training classifier for: Drama
Done training classifier for: Drama
Training classifier for: Family
Done training classifier for: Family
Training classifier for: Fantasy
Done training classifier for: Fantasy
Training classifier for: History
Done training classifier for: History
Training classifier for: Horror
Done training classifier for: Horror
Training classifier for: Music
Done training classifier for: Music
Training classifier for: Musical
Done training classifier for: Musical
Training classifier for: Mystery
Done training classifier for: Mystery
Training classifier for: Romance
Done training classifier for: Romance
Training classifie

In [8]:
# Evaluate the classifiers
scores = {}
for genre, classifier in classifiers.items():
    # Use the vectorizer to convert the text data into a matrix of token counts
    X_test = vectorizer.transform(data['Plot'])
    
    # Evaluate the classifier
    score = classifier.score(X_test, data[genre])
    scores[genre] = score
    print('Accuracy for', genre, ':', score)

Accuracy for Action : 0.8805505136835428
Accuracy for Adventure : 0.8876874548434278
Accuracy for Biography : 0.8556328968932279
Accuracy for Comedy : 0.8171180855375615
Accuracy for Crime : 0.8887729747828079
Accuracy for Drama : 0.757707015348829
Accuracy for Family : 0.8984721659294764
Accuracy for Fantasy : 0.9057430348740902
Accuracy for History : 0.8737272454931538
Accuracy for Horror : 0.914582268666185
Accuracy for Music : 0.9006326325620738
Accuracy for Musical : 0.950665233404409
Accuracy for Mystery : 0.8974500854670732
Accuracy for Romance : 0.8081413995453505
Accuracy for Sci-Fi : 0.9307910550337463
Accuracy for Sport : 0.9649179692318537
Accuracy for Thriller : 0.8371508626006661
Accuracy for War : 0.9430313496748727
Accuracy for Western : 0.9679806862036759


In [9]:
# Write the scores to a file, alongside the number of movies in each genre
with open('scores.txt', 'w') as f:
    for genre, score in scores.items():
        num_movies = data[genre].sum()
        f.write(genre + ': ' + str(score) + ' (' + str(num_movies) + ')\n')