# Milestone 2: Assembling training data

We are aware that you have little time this week, due to the midterm. So this milestone is a bit easier to achieve than the others. The goal for this week is to prepare the data for the modeling phase of the project. You should end up with a typical data setup of training data X and data labels Y.

The exact form of X and Y depends on the ideas you had previously. In general though Y should involve the genre of a movie, and X the features you want to include to predict the genre. Remember from the lecture that more features does not necessarily equal better prediction performance. Use your application knowledge and the insight you gathered from your genre pair analysis and additional EDA to design Y. Do you want to include all genres? Are there genres that you assume to be easier to separate than others? Are there genres that could be grouped together? There is no one right answer here. We are looking for your insight, so be sure to describe your decision process in your notebook. 

In preparation for the deep learning part we strongly encourage you to have two sets of training data X, one with the metadata and one with the movie posters. Make sure to have a common key, like the movie ID, to be able to link the two sets together. Also be mindful of the data rate when you obtain the posters. Time your requests and choose which poster resolution you need. In most cases w500 should be sufficient, and probably a lower resolution will be fine.

The notebook to submit this week should at least include:

- Discussion about the imbalanced nature of the data and how you want to address it
- Description of your data
- What does your choice of Y look like?
- Which features do you choose for X and why? 
- How do you sample your data, how many samples, and why?

*Important*: You do not need to upload the data itself to Canvas.

## Imbalance and Model Choice

After getting a sense of the both the IMDB and TMDB databases in the first milestone, we decided to use only TMDB from here on out in the project. The TMDB database is easier to work with, and both datasets contain essentially the same information.

One of the notable concerns we discovered in the first milestone was the imbalanced nature of the genre data. Because TMDB provides a list of genres with each movie, there were underlying relationships between the occurrences of some genres. Specifically, a few genres were highly co-linear with other genres. For example, a movie labeled “War” had roughly a 75 percent chance of also being tagged for “Drama.”

We knew the imbalanced nature of the data would be a problem in developing a model, because it would be hard to create a classification model that could distinguish movies that were always grouped together. Returning to the example given above, it would be quite difficult to separate “War” movies from “Drama” movies when “War” movies are essentially a subset of “Drama” movies. We spent a decent amount of time thinking about this problem, and worked through two possible solutions.

The first one was to manually check for frequently occurring genre pairs (e.g. action-adventure) by examining the heatmap we generated in milestone 1. Then we would include those pairs as possible labels, along with their individual labels. To clarify, possible labels would include action, adventure, and action-adventure. Therefore, this would be one model that would run through the datapoints in our training set, and assign the movie either a single genre, or one of the paired genres we found to be frequently assigned.

The second solution we were considering was to have a model for each of the possible genres, and it would return “yes” or “no”. From this, we would be able to assign multiple genres for each movie, as seen in the tmdb and imdb databases. 

After talking to Rashmi about the two options, she pointed out a solution that combined the fundamental approaches of the two models discussed above. There is a multilabel classification method within the sklearn library that gives us access to the “OneVsRestClassifier” function that automatically fits one classifier per class, and the class if fitted against all the other classes. We can input what classifier we want to use as well (e.g. RandomForest, SVM, etc.), and the output gives us the probability of assigning each of the values to the particular class. For example, if we applied this classifier to one movie, and we were looking at how well the movie fit into 3 different genres, then it could output the probability of that movie fitting the first genre as 0.3, the second genre as 0.6, and the third genre as 1.0. Then, based on whatever threshold we consider appropriate, we could assign that particular movie to all the genres that are greater than or equal to a certain probability. So if we said the probability was 0.5, then this one movie would get labeled with the second and third genres. 

## The Data

As stated before, we decided we would work solely with TMDB data for this project.

### What does your choice of Y look like?

Because we will be using the “OneVsRestClassifier” to run our model, our Y must be a 2-d matrix of indicators, in which cell [i, j] is 1 if sample i has label j and 0 if it does not. The columns of this matrix are the 20 different genres we will be predicting: 'Mystery', 'Romance', 'Family', 'Science Fiction', 'Horror', 'Crime', 'Drama', 'Fantasy', 'Western', 'Animation', 'Music', 'Adventure', 'Foreign', 'Action', 'TV Movie', 'Comedy', 'Documentary', 'War', 'Thriller', and 'History'. So, for instance, a movie with the genres of ‘Mystery’ and ‘Horror’ would have a ‘1’ under these 2 columns and 0’s under every other column. This is both a computationally efficient and highly interpretable way of predicting multiple labels at once for a certain movie.  

### Which features do you choose for X and why? 

Features: title, tagline, overview, runtime, budget, revenue, release date, popularity, user rating, votes, adult, studios, countries, keywords, director, producer, editor, main actor and supporting actor.

Title, tagline, overview, keywords - These features require text analysis, but are extremely helpful in predicting genres. We might have to use a Bayesian model separately to analyze the text as priors, but they’ll provide more information about the plot of the movies, which will surely point to the movie being tagged with a certain genre. For example, there are certain types of words that would be associated with a horror movie (e.g. scary, blood, etc.). 

Runtime, budget, revenue, release date, popularity, user rating, votes, adult, studios, countries - these will likely be useful in separating genres because different genres often follow distinct patterns in these regards. For instance, we believe a drama is more likely to be a long film than a comedy. Similarly, dramatic movies that are expected to do well are released around the holidays every year because the market is bigger while horror movies are more frequently released in January and February. The other features follow the same pattern of playing off of cultural trends to distinguish genres of movies.

Director, producer, editor, main actor and supporting actor - These are members of the cast/crew that we deemed to be the most important to look at. Because these are the main roles in creating the movie, we are sure to see a trend in the types of movies that famous directors/producers create and actors/supporting actors are a part of. People naturally are more talented in specific roles, so they’ll gravitate to those types of movies. For example, Adam Sandler almost exclusively acts in comedy movies. As a result, having him as the main or supporting actor would highly signify that movie as a comedy movie.

Certain features that we decided to move include: original title, home page, collection, languages, alternate titles, releases (countries), youtube trailers, apple trailers, and translations. Some of these features, such as original title and alternate title, were removed because they were repetitive and did not tell us anything substantial about the genre. The youtube and apple trailer features simply provide links to the trailers, and thus are also uninformative for our model. Finally, we believed that features like “translations”, “languages”, and “releases (countries)” added unnecessary white noise to our model. Knowing which translations the movie is in likely speaks more to the movie’s popularity (a more popular movie, whether a drama, horror, or mystery movie, will have more translations) than the movie’s genre. Meanwhile, where a movie was first released or what language it was shot in likely tells little about the genre either, since all countries produce all types of movies in all types of languages. If on the other hand we saw that Spanish-speaking countries only produced dramas, for instance, then perhaps language and release country would be pertinent information; since this is not the case, however, we have removed these features.

### How do you sample your data, how many samples, and why?

Because using the entire TMDB database would cause runtime issues, we decided it would be best just to use the 500 most popular movies for each year since 2000. We wanted to use popularity as our criteria of our “top 500” and not rating because some of the movies only had one critic rate the movies. That can skew which movies are considered to be influential and are worth analyzing. Also, we wanted to stop at 2000 because we started to see that before then (and even in 2000), many of the movies had blanks for multiple pieces of information. As a result, we would be unnecessarily increasing the runtime without having more information to train our model on. Even when grabbing the features above from the movies, we found that for a movie that was released in 2001, there was an issue with its ID. This was causing errors with our API calls. We deleted this movie from our training set, but didn’t want to deal with a similar issue with older movies. Because of this issue, we now have 8,499 movies in our training set. 

In [1]:
from itertools import chain
from PIL import Image
from StringIO import StringIO
from time import sleep
import json
import os.path
import pandas as pd
import tmdb3 as tmdb
import tmdbsimple as t_simple
import urllib

In [2]:
tmdb.set_key('98ca260f92ac9b42606914b232546089')
t_simple.API_KEY = '98ca260f92ac9b42606914b232546089'

In [3]:
def get_page(discover, year, page):
    results = discover.movie(page=page, sort_by='popularity.desc', primary_release_year=year)['results']
    return [movie['id'] for movie in results]

# Merge lists of lists into a single list
def flatten(lst):
    return list(chain.from_iterable(lst))

def get_yearly_ids(discover, year):
    # Ensure that we do not make more than 40 requests per 10 seconds
    sleep(10)
    return flatten([get_page(discover, year, page) for page in range(1, 26)])

def get_all_ids():
    discover = t_simple.Discover()
    results = flatten([get_yearly_ids(discover, year) for year in range(2000, 2017)])
    
    # Remove Super Gals (2001) from the list since it can no longer be found with the API
    super_gals = 440656
    if super_gals in results:
        results.remove(super_gals)
        
    return results

def list_to_str(lst):
    return ', '.join([x.name for x in lst])

def get_role(crew, role):
    return next((person.name for person in crew if person.job == role), None)

def get_actor(cast, main):
    return cast[0 if main else (1 if len(cast) > 1 else 0)].name if len(cast) > 0 else None

def get_row(m):
    return {'id': m.id, 'title': m.title, 'tagline': m.tagline, 'overview': m.overview, 'runtime': m.runtime, 
            'budget': m.budget, 'revenue': m.revenue, 'releasedate': m.releasedate, 'popularity': m.popularity, 
            'userrating': m.userrating, 'votes': m.votes, 'adult': m.adult, 'genres': list_to_str(m.genres), 
            'studios': list_to_str(m.studios), 'countries': list_to_str(m.countries), 
            'keywords': list_to_str(m.keywords), 'director': get_role(m.crew, 'Director'), 
            'producer': get_role(m.crew, 'Producer'), 'editor': get_role(m.crew, 'Editor'), 
            'main_actor': get_actor(m.cast, True), 'supporting_actor': get_actor(m.cast, False), 
            'poster': m.poster.geturl('w500') if m.poster else None}

def get_all_movies(movie_ids):
    return pd.DataFrame([get_row(tmdb.Movie(m_id)) for m_id in movie_ids])

def genre_to_matrix(genre_str):
    genres = genre_str.split(', ')
    return [(1 if g in genres else 0) for g in genre_lst]

def get_genre_lst():
    genres = json.loads(tmdb.request.Request('/genre/movie/list').read())['genres']
    return [genre['name'] for genre in genres]

def get_posters(links, indices):
    posters = [(Image.open(StringIO(urllib.urlopen(img).read())), sleep(0.25))[0] for img in links]
    return pd.DataFrame({'poster': posters}, index=indices)

In [4]:
movie_path = './Datasets/movies.csv'; poster_path = './Datasets/posters.pkl'

if os.path.isfile(movie_path):
    movies = pd.read_csv(movie_path)
else:
    movies = get_all_movies(get_all_ids())
    movies.to_csv(movie_path)
    
movies.set_index('id', inplace=True)

# Drop movies without posters
movies = movies.loc[movies['poster'].dropna().index]

# Convert genre lists to binary lists
genre_lst = get_genre_lst()
genres = [genre_to_matrix(str(genres)) for genres in movies['genres']]

# Set up metadata dataset
metadata = movies.drop(labels=['genres', 'poster'], axis=1)
metadata.head()

Unnamed: 0_level_0,adult,budget,countries,director,editor,keywords,main_actor,overview,popularity,producer,releasedate,revenue,runtime,studios,supporting_actor,tagline,title,userrating,votes
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
77,False,9000000,United States of America,Christopher Nolan,Dody Dorn,"individual, insulin, tattoo, waitress, amnesia...",Guy Pearce,Suffering short-term memory loss after a head ...,3.666593,Jennifer Todd,2000-10-20,39723096,113.0,"Summit Entertainment, Newmarket Capital Group,...",Joe Pantoliano,Some memories are best forgotten.,Memento,8.0,2927
98,False,103000000,"United Kingdom, United States of America",Ridley Scott,Pietro Scalia,"rome, gladiator, roman empire, slavery, battle...",Russell Crowe,General Maximus' success in battle earns the f...,4.943334,David Franzoni,2000-05-01,457640427,155.0,"DreamWorks SKG, Universal Pictures, Scott Free...",Joaquin Phoenix,A Hero Will Rise.,Gladiator,7.8,4343
107,False,10000000,"United Kingdom, United States of America",Guy Ritchie,Jon Harris,"gypsy, bare knuckle boxing, slang, trailer par...",Jason Statham,The second film from British director Guy Ritc...,4.769231,Matthew Vaughn,2000-09-01,83557872,103.0,"Columbia Pictures Corporation, Screen Gems, SK...",Brad Pitt,Stealin' Stones and Breakin' Bones.,Snatch,7.6,2414
8358,False,90000000,United States of America,Robert Zemeckis,Arthur Schmidt,"exotic island, suicide attempt, volleyball, lo...",Tom Hanks,"Chuck, a top international manager for FedEx, ...",3.126307,Tom Hanks,2000-12-22,429632142,143.0,"DreamWorks SKG, Twentieth Century Fox Film Cor...",Helen Hunt,"At the edge of the world, his journey begins.",Cast Away,7.4,2440
9741,False,75000000,United States of America,M. Night Shyamalan,Dylan Tichenor,"train accident, marriage crisis, invulnerabili...",Bruce Willis,An ordinary man makes an extraordinary discove...,2.619656,Barry Mendel,2000-11-13,248118121,106.0,"Limited Edition Productions Inc., Touchstone P...",Samuel L. Jackson,Some things are only revealed by accident.,Unbreakable,6.8,1416


In [5]:
# Set up poster dataset
if os.path.isfile(poster_path):
    posters = pd.read_pickle(poster_path)
else:
    posters = get_posters(movies['poster'].as_matrix(), movies.index)
    posters.to_pickle(poster_path)

posters.head()

Unnamed: 0_level_0,poster
id,Unnamed: 1_level_1
77,<PIL.JpegImagePlugin.JpegImageFile image mode=...
98,<PIL.JpegImagePlugin.JpegImageFile image mode=...
107,<PIL.JpegImagePlugin.JpegImageFile image mode=...
8358,<PIL.JpegImagePlugin.JpegImageFile image mode=...
9741,<PIL.JpegImagePlugin.JpegImageFile image mode=...
