# CS109B - Milestone 3
Authors: Stephanie von Klot-Heydenfeldt, Roberto Vitillo, Alessio Placitelli

In [1]:
import itertools as it
import sys
import os
import operator
import math
import matplotlib.pyplot as plt
import numpy as np
import numpy.ma as ma
import pandas as pd
import tables

sys.path.append(os.path.realpath('../lib'))

from tmdbw import TMDBW
from pandas.io.json import json_normalize
from skimage.viewer import ImageViewer
from skimage.color import rgb2gray
from skimage import transform
from sklearn.decomposition import PCA
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.metrics import hamming_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import zero_one_loss

%matplotlib inline

  warn("Recommended matplotlib backend is `Agg` for full "


# Overview
### Building a shared train/test set
As described in Milestone 2, we built a library to download movie posters and metadata from TMDB. However, scraping the data and making the heavy poster data file available to all the team members was only the first step. We needed a way to correctly reproduce the very same train/test splits on every machine, with a balanced representation of all the available movie genres.

To satisfy these requirements, since we couldn’t find any suitable function in sklearn, we decided to implement our own stratified iterative sampling function. Here’s how the algorithm works:

1. Enumerate all the possible genres available in the dataset and count their frequency.
2. Sort them from the least frequent to the most frequent genre.
3. For each sorted genre:
  * Check if the label was already partially sampled. This can happen as movies have multiple labels.
  * Compute the number of samples we expect to have in the train set for this label by multiplying the requested ratio of train set samples with the number of occurrences for this label.
    * If the number of samples with this label is greater than the number of expected samples, move on to the next genre
  * Compute the number of samples that we need to add for this label, as a difference between the expected number of samples with the current label and the number of desired samples.
  * Randomly sample from the original dataset, add them to the train set and remove them from the original dataset.
  * Move to the next genre.

This algorithm allows to have train/test sets that contain the same proportion of labels that can be found in the original dataset. It’s trivial to produce the same results with this algorithm by specifying the random seed of the sampling algorithm.

### The process
After generating the train and test sets used to evaluate our models, we started investigating their performance with various traditional machine learning algorithms.

Since we have a multi-label problem, we decided to use a one-vs-the rest classification strategy, basically training a model for each label: each model having the the Y for the related label set to a positive value and the rest being negative.

This strategy is implemented by the sklearn [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) class, which plays nicely with the [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) class, conveniently allowing to model multi-label problems with a few lines of code.

Each investigated machine learning algorithm was fine-tuned by performing an exhaustive search over specified hyperparameter values, with cross-validation.

# Part 1 - Investigate movie metadata
We created explanatory features based on the keywords, overview and tagline that were provided for the movies. We included features based on one combined bag of words, including those words that occurred at least 100 times total. We also encoded the actors as dummy variables and kept those that appeared more than 50 times as either the first or the secondary actor in the data. The year of the movie release, runtime, revenue and budget were included as numerical variables.

We performed stratified sampling of the data for creation of a training set that consisted of 10% of the data.
On the standardized training data with the features described above we implemented multi-label SVM (one-vs-the rest) to predict genre and determined the best parameters using grid search with 5 fold cross validation.  We report performance (genre specific precision, recall, accuracy and summary measures as described below) on the test data (90% of the dataset).  We compared these performance measures with a naive classifier that predicts genre randomly. 

Similarly we tuned a random forest classifier.

## Import TMDB data
We built an external library to import, so we just import the Tmdb data.

In [2]:
TMDB_all = pd.read_pickle("../data/data_export")

Remove missing values and de-duplicate movie data.

In [3]:
print("Found {} duplicated movies.".format(TMDB_all.duplicated('tmdb_id').sum()))

# Keep only the first of duplicate observations.
TMDB_clean = TMDB_all.drop_duplicates(['tmdb_id'])

# Create a release year variable out of the release_date.
TMDB_clean['release_year'] = pd.to_numeric(TMDB_clean['release_date'].str[:4])

# Remove observations with some missing values in the fields
# we care for.
TMDB_clean = TMDB_clean.dropna(subset = ["director", "overview", "actor2", "release_year"])

# Drop the "producer" and "writer" columns, they have too many missings
drop_variables = [i for i, item in enumerate(TMDB_all.isnull().mean()) if item > 0.2]
TMDB_clean = TMDB_clean.drop(TMDB_clean.columns[drop_variables], axis = 1)

# Reset the index.
TMDB_clean.reset_index(inplace = True)

print("The shape of the clean metadata dataset is {}".format(TMDB_clean.shape))

TMDB_clean.head(2)

Found 0 duplicated movies.
The shape of the clean metadata dataset is (9790, 18)


Unnamed: 0,index,actor1,actor2,adult,budget,director,genres,imdb_id,keywords,language,overview,release_date,revenue,runtime,tagline,title,tmdb_id,release_year
0,0,Emma Watson,Dan Stevens,False,160000000,Bill Condon,"[Fantasy, Music, Romance]",tt2771200,"france,magic,castle,fairy tale,musical,curse,c...",en,A live-action adaptation of Disney's version o...,2017-03-16,959241034,129.0,Be our guest.,Beauty and the Beast,321612,2017.0
1,1,Alec Baldwin,Miles Christopher Bakshi,False,125000000,Tom McGrath,"[Animation, Comedy, Family]",tt3874544,"family relationships,unreliable narrator,3d",en,A story about how a new baby's arrival impacts...,2017-03-23,137547590,97.0,Born leader,The Boss Baby,295693,2017.0


## Create X and Y matrix

In [5]:
# This should use tmdb.get_genres()
GENRE_LIST = [
    'Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
    'Documentary', 'Drama', 'Family', 'Fantasy', 'Foreign', 'History',
    'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction',
    'TV Movie', 'Thriller', 'War', 'Western'
]

# Build the binarized labels.
label_binarizer = MultiLabelBinarizer()
binarized_y = label_binarizer.fit_transform(TMDB_clean['genres'])
Y = pd.DataFrame(binarized_y, columns=label_binarizer.classes_)

# Create a new table without the undesired colums.
X = TMDB_clean.drop(["genres", "imdb_id", "tmdb_id", "adult"], axis = 1)

### Use bag of words for the keywords / overview /tagline columns 

In [6]:
import re

def cleanup_text(s):
    s = s.lower()
    s = s.replace('-', ' ')
    s = s.replace(')', ' ')
    s = s.replace('(', ' ')
    s = s.replace(',', ' ')
    s = s.replace('.', ' ')
    s = s.replace('"', ' ')
    s = s.replace(' br ', '')
    s = s.replace(' quot ', ' ')
    s = s.replace(' amp ', ' and ')
    s = s.replace(' s ', "'s ")
    s = s.replace(' t ', "'t ")
    s = s.replace(' m ', "'m ")
    s = s.replace(' ve ', "'ve ")
    s = s.replace(' ll ', "ll ")
    s = s.replace(' ', " ")
    s = re.sub(r'(\d) (\d{3})', r'\1,\2', s)
    return s

def bag_of_words(feature):
    feature_cleaned =  X[feature].apply(cleanup_text)
  
    vectorizer = CountVectorizer(stop_words='english',
                                 min_df = 100,
                                 ngram_range=(1,2))
    word_counts = vectorizer.fit_transform(feature_cleaned)

    feature_names = np.array(vectorizer.get_feature_names())
    
    feature_names[::(len(feature_names)/20)]

    word_count_sum = pd.DataFrame(word_counts.sum(axis=0).T, columns=['count'])
    word_count_sum['ngram'] = feature_names
    print "number of features: " + str(len(feature_names)) + " for: " + str(feature)
    print  word_count_sum.sort_values(by=['count'], ascending=[0])
    return word_counts, feature_names


# Create one big suitcase of words out of all 3 variables
X['allwords'] = X[['keywords', 'overview', 'tagline']].apply(lambda x: ','.join(x), axis=1)
all_word_counts, all_feature_names = bag_of_words('allwords')

number of features: 618 for: allwords
     count                ngram
315   2244                 life
333   2105                 love
379   1993                  new
343   1799                  man
605   1727                world
598   1669                woman
193   1534                 film
614   1528                young
177   1478               family
35    1270                based
585   1236                  war
514   1194                story
546   1188                 time
435   1168         relationship
182   1033               father
367   1020               murder
460    976               school
125    959                death
388    896                  old
595    839                 wife
478    811                  sex
363    795               mother
465    785               secret
409    783               police
138    769             director
209    751              friends
611    744                years
493    730                  son
118    724             daughter
22

### Create dummy variables of actors

Use actor1 and actor2 for this and add both dummy variable matrices, so that the combinations are available per movie

In [7]:
#hot encode actor1 and then actor2
actor1_dummies = pd.get_dummies(X["actor1"])
actor2_dummies = pd.get_dummies(X["actor2"])

#add the two dummy list in order to have both actors
actor_dummies =  actor1_dummies.add(actor2_dummies, fill_value=0)

In [8]:
actor1_dummies.shape, actor2_dummies.shape, actor_dummies.shape

((9790, 4111), (9790, 5472), (9790, 7808))

Reduce the number of dummy variables by just considering those actors with a certain frequency (N>50).

In [9]:
varlist = ['budget', 'revenue', 'runtime', 'release_year', 'actor1', 'actor2']
X_combined = pd.concat([actor_dummies[actor_dummies.sum()[actor_dummies.sum()>50].index],
                        pd.DataFrame(all_word_counts.A, columns=all_feature_names),
                        pd.DataFrame(X[varlist])], axis = 1)

In [10]:
X_combined.shape

(9790, 626)


## Build the train, test and validation sets

We build a training set that has the same proportion of labels as the full dataset, so that every genre is represented.

We start sampling from the least frequent class to the most frequent one.


In [11]:
TMDB_binarized_outcome = binarized_y

#generate train and test sets
def test_train(data, setsize = 0.1):
    genre_counts = {}
    for genre in label_binarizer.classes_:
        genre_counts[genre] = data['genres'].apply(lambda x: genre in x).sum()

    # Sort the genres from the least occurring to the most frequent.
    num_movies = data.shape[0]
    sorted_genres = sorted(genre_counts.items(), key=operator.itemgetter(1))

    #for now make training data small, for computational reasons
    train_set_size = setsize*1.0

    # Generate a train/test set.
    train_set = pd.DataFrame()
    original_set = data.copy()
    for genre, count in sorted_genres:
        # Check how many samples have this label in the set.
        # The dataframe may be empty on the first run, so account for
        # that.
        already_sampled = 0
        if 'genres' in train_set:
            already_sampled = train_set['genres'].apply(lambda x: genre in x).sum()

        # If the test set already contains all the samples we expect
        # for this label, continue to the next label.
        expected_samples = int(math.floor(train_set_size * count))        
        if already_sampled >= expected_samples:
            continue

        # If not, then randomly sample |expected - already_there| samples
        num_to_sample = expected_samples - already_sampled
        samples = original_set[original_set['genres'].apply(lambda x: genre in x)]\
                                                     .sample(n=num_to_sample, random_state = 42)
        # Append the random samples to the train_set
        train_set = train_set.append(samples)
        # Remove them from the original set, so we don't sample them again
        # for a different label.
        original_set = original_set.drop(samples.index)
        test_set = data.drop(train_set.index)
    
    
    # use the indices above to create the train and test sets of the different data frames
    # for this make sure that the indices match
    X_train = X_combined.loc[train_set.index]#original dataframe
    X_test = X_combined.drop(train_set.index)
    Y_train = Y.loc[train_set.index]#hand created labels
    Y_test = Y.drop(train_set.index)
    
    y1_train = MultiLabelBinarizer().fit_transform(data['genres'].loc[train_set.index])
    y1_test = MultiLabelBinarizer().fit_transform(data['genres'].drop(train_set.index))
    
    return X_train, X_test, y1_train, y1_test, Y_train, Y_test

In [12]:
X_train01, X_test01, y1_train01, y1_test01, Y_train01, Y_test01 = test_train(TMDB_clean)

#### Find actors that in the training dataset represent always a certain genre

In [13]:
#test (only in train data) if actors are predictive of genre, example action
subset =pd.crosstab(X_train01.actor1,Y_train01["Action"], margins = True)
subset.pred = np.where(((subset.All - subset.iloc[:,1] == 0) & (subset.All>2)), 1,0)
subset[subset.pred ==1].sort_values(by=['All'], ascending=[0])

Action,0,1,All
actor1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jean-Claude Van Damme,0,3,3
Jet Li,0,3,3
Masako Nozawa,0,3,3


In [14]:
#test (only in train data) if actors are predictive of genre, example comedy
subset =pd.crosstab(X_train01.actor1,Y_train01["Comedy"], margins = True)
subset.pred = np.where(((subset.All - subset.iloc[:,1] == 0) & (subset.All>2)), 1,0)
subset[subset.pred ==1].sort_values(by=['All'], ascending=[0])

Comedy,0,1,All
actor1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adam Sandler,0,4,4
Chevy Chase,0,4,4
Bill Murray,0,3,3
Eddie Murphy,0,3,3
Jack Black,0,3,3
Jim Carrey,0,3,3
Johnny Knoxville,0,3,3
Mike Myers,0,3,3
Renée Zellweger,0,3,3
Robin Williams,0,3,3


### Define our performance metrics
In a multi-class prediction problem, evaluating the performance of a model is more straightforward as classes are, by definition, mutually exclusive. Our movie genre classification problem is, by nature, a multi-label problem: genres are not mutually exclusive and having more than one genre label per movie is very frequent. Evaluating the performance of such models is an hard task by itself, but the scientific literature has prior research that can be used as a reference.

Let $E=\{[0, 1], [1, 1]\}$ be the set of 2 expected labels for 2 samples and $P=\{[0, 0], [0, 0]\}$ the set of predicted labels. Here’s how the scoring would look with the following metrics:

* **Subset accuracy** - the ratio of samples with a set of predicted labels exactly matching the corresponding set of expected labels.
  The subset accuracy, in this case, would score 0, as there’s no full overlap for the labels in each sample. Predicting $P=\{[0, 1], [0, 0]\}$ would, instead, produce a score of 0.5. It’s important to note how this performance metric heavily penalizes partial label matching and provides a lower accuracy score compared to other metrics. This is implemented by the [sklearn.metrics.accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn-metrics-accuracy-score) function.
* **Hamming loss** - the average relevance of a sample to a set of labels. This is computed by averaging the number of misclassified labels on all the samples. This value should ideally be 0 for perfect predictions, but in general the lower is this value, the better. In the provided example, the hamming loss would be 0.75. This is implemented by the [sklearn.metrics.hamming_loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html#sklearn-metrics-hamming-loss) function.
* **Zero-one loss** - This the complement of the subset accuracy described above. This is implemented by the [sklearn.metrics.zero_one_loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.zero_one_loss.html#sklearn-metrics-zero-one-loss) function.
* **Jaccard similarity** - this metric measure the average overlapping of the predicted labels and the expected labels for each sample. The overlapping of the label set is defined as the intersection of the predicted labels and the expected labels divided by the size of the union of the two sets. This is implemented by the [sklearn.metrics.jaccard_similarity_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn-metrics-jaccard-similarity-score) function.
* **Average per-genre accuracy** - This is the average number of times. We provided our own implementation for this metric in this notebook.
* **Precision** - Is the average ratio of predicted correct labels to the total number of labels. This is reported by the [sklearn.metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) function.
* **Recall** - Is the average ratio of predicted correct labels to the number of expected labels. This is reported by the [sklearn.metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) function.
* **F1** - The F1 score is the harmonic mean of the precision and recall: the closer this value is to 1.0, the better is the predicted set of labels. This is reported  by the [sklearn.metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) function.

The utility function below produces a summary of all the above performance metrics.

In [26]:
def detailed_report(predicted_y, test_y):
    print classification_report(test_y, predicted_y,
                                target_names = label_binarizer.classes_)
    
    print("\n**** Per genre accuracy ****\n\n")

    per_genre_accuracies = []
    for i, label in enumerate(label_binarizer.classes_):
        per_genre_accuracies.append(accuracy_score(predicted_y[:, i], test_y[:,i]))
        print("{} accuracy:\t{:.3f}".format(label, per_genre_accuracies[-1]))

    print("\n**** OTHER METRICS ****\n\n")
    print("Overall accuracy:\t{:.3f}".format(accuracy_score(test_y, predicted_y)))
    print("Average per-genre accuracy:\t{:.3f}".format(np.mean(per_genre_accuracies)))
    print("Hamming loss:\t\t{:.3f}".format(hamming_loss(test_y, predicted_y)))
    print("Zero one loss:\t\t{:.3f}".format(zero_one_loss(test_y, predicted_y)))
    print("Jaccard similarity:\t{:.3f}".format(jaccard_similarity_score(test_y, predicted_y)))

### Create a naive models that randomly assigns genres

In [15]:
y1_test01.shape

(8762L, 20L)

In [16]:
def random_using_freqs(y):
    #assinging randomly a label to each of the genres
    rngenre = np.array(range( len(y)))
    np.random.seed(42)
    for i in range(y.shape[1]):
        rngenre_i = np.random.binomial(1, y[:,i].sum()*1.0/len(y), size=len(y))
        rngenre = np.vstack((rngenre, rngenre_i))

    rngenre = rngenre[1:rngenre.shape[1],:]

    random_genre = rngenre.reshape(rngenre.shape[1],rngenre.shape[0])
    return random_genre

In [17]:
y1_test01random = random_using_freqs(y1_test01)

The following is a detailed report for a model returning random labels for each sample.

In [29]:
detailed_report(y1_test01random, y1_test01)

                 precision    recall  f1-score   support

         Action       0.28      0.15      0.20      2088
      Adventure       0.16      0.13      0.14      1299
      Animation       0.07      0.13      0.09       568
         Comedy       0.31      0.12      0.17      2851
          Crime       0.15      0.13      0.14      1190
    Documentary       0.01      0.11      0.02       128
          Drama       0.50      0.13      0.20      4213
         Family       0.10      0.12      0.11       826
        Fantasy       0.09      0.12      0.10       754
        Foreign       0.00      0.03      0.00        30
        History       0.04      0.12      0.06       312
         Horror       0.12      0.12      0.12      1038
          Music       0.02      0.12      0.04       230
        Mystery       0.07      0.12      0.09       633
        Romance       0.16      0.12      0.14      1489
Science Fiction       0.12      0.14      0.13       908
       TV Movie       0.01    

The following is a detailed report for a model always assigning the same labels for each sample.

In [32]:
detailed_report(np.zeros(y1_test01.shape), y1_test01)

                 precision    recall  f1-score   support

         Action       0.00      0.00      0.00      2088
      Adventure       0.00      0.00      0.00      1299
      Animation       0.00      0.00      0.00       568
         Comedy       0.00      0.00      0.00      2851
          Crime       0.00      0.00      0.00      1190
    Documentary       0.00      0.00      0.00       128
          Drama       0.00      0.00      0.00      4213
         Family       0.00      0.00      0.00       826
        Fantasy       0.00      0.00      0.00       754
        Foreign       0.00      0.00      0.00        30
        History       0.00      0.00      0.00       312
         Horror       0.00      0.00      0.00      1038
          Music       0.00      0.00      0.00       230
        Mystery       0.00      0.00      0.00       633
        Romance       0.00      0.00      0.00      1489
Science Fiction       0.00      0.00      0.00       908
       TV Movie       0.00    

### SVM (training on 10%  of the data)

In [21]:
def tune_svc(xtrain, ytrain, xtest, ytest, ytest_random):
    model_to_set = OneVsRestClassifier(SVC(kernel="linear"), n_jobs=-1)

    tuned_parameters = [
      {'estimator__C': [1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 50.0], 'estimator__kernel': ['linear']},
      {'estimator__C': [1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 50.0],
       'estimator__gamma':  [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 0.2, 0.3],
       'estimator__kernel': ['rbf']
      },
     ]


    scores = ['precision']
    for score in scores:
        print("# Tuning hyper-parameters for %s" % score)
        print()

        clf = GridSearchCV(model_to_set, 
                           param_grid = tuned_parameters, cv=5,
                           scoring='%s_macro' % score,
                          verbose = 5)
        clf.fit(xtrain, ytrain)
        
        print("Best parameters set found on development set:")
        print()
        print(clf.best_params_)
        print()
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))
        print()
        print("Detailed classification report:")
        
        y_test_predict = clf.predict(xtest)
        
        detailed_report(y_test_predict, ytest)
        print("\n**** compare this result with random labeling ****\n\n")

        print classification_report(ytest,  ytest_random, target_names =label_binarizer.classes_)
        print ()

    return clf.best_params_, classification_report(ytest,  y_test_predict), y_test_predict

### Bag of words from keywords, tagline, overview plus actors and other features

In [22]:
std_scale = StandardScaler().fit(np.array(X_train01.iloc[:,:-2]))
X_train_std = std_scale.transform(np.array(X_train01.iloc[:,:-2]))
X_test_std = std_scale.transform(np.array(X_test01.iloc[:,:-2]))

param_01b, class_report_01b, y_test_pred_01b =\
    tune_svc(X_train_std, y1_train01, X_test_std, y1_test01, y1_test01random)

# Tuning hyper-parameters for precision
()
Fitting 5 folds for each of 63 candidates, totalling 315 fits
[CV] estimator__kernel=linear, estimator__C=1.0 ......................
[CV]  estimator__kernel=linear, estimator__C=1.0, score=0.323473, total=   2.1s
[CV] estimator__kernel=linear, estimator__C=1.0 ......................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.3s remaining:    0.0s


[CV]  estimator__kernel=linear, estimator__C=1.0, score=0.358270, total=   2.0s
[CV] estimator__kernel=linear, estimator__C=1.0 ......................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    8.6s remaining:    0.0s


[CV]  estimator__kernel=linear, estimator__C=1.0, score=0.241367, total=   1.9s
[CV] estimator__kernel=linear, estimator__C=1.0 ......................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   12.9s remaining:    0.0s


[CV]  estimator__kernel=linear, estimator__C=1.0, score=0.282172, total=   2.1s
[CV] estimator__kernel=linear, estimator__C=1.0 ......................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   17.3s remaining:    0.0s


[CV]  estimator__kernel=linear, estimator__C=1.0, score=0.263081, total=   2.1s
[CV] estimator__kernel=linear, estimator__C=2.0 ......................
[CV]  estimator__kernel=linear, estimator__C=2.0, score=0.323473, total=   2.0s
[CV] estimator__kernel=linear, estimator__C=2.0 ......................
[CV]  estimator__kernel=linear, estimator__C=2.0, score=0.358270, total=   2.0s
[CV] estimator__kernel=linear, estimator__C=2.0 ......................
[CV]  estimator__kernel=linear, estimator__C=2.0, score=0.241367, total=   1.9s
[CV] estimator__kernel=linear, estimator__C=2.0 ......................
[CV]  estimator__kernel=linear, estimator__C=2.0, score=0.282172, total=   2.1s
[CV] estimator__kernel=linear, estimator__C=2.0 ......................
[CV]  estimator__kernel=linear, estimator__C=2.0, score=0.263081, total=   2.2s
[CV] estimator__kernel=linear, estimator__C=4.0 ......................
[CV]  estimator__kernel=linear, estimator__C=4.0, score=0.323473, total=   2.0s
[CV] estimator

[Parallel(n_jobs=1)]: Done 315 out of 315 | elapsed: 44.3min finished


Best parameters set found on development set:
()
{'estimator__kernel': 'rbf', 'estimator__C': 2.0, 'estimator__gamma': 0.001}
()
Grid scores on development set:
()
0.294 (+/-0.084) for {'estimator__kernel': 'linear', 'estimator__C': 1.0}
0.294 (+/-0.084) for {'estimator__kernel': 'linear', 'estimator__C': 2.0}
0.294 (+/-0.084) for {'estimator__kernel': 'linear', 'estimator__C': 4.0}
0.294 (+/-0.084) for {'estimator__kernel': 'linear', 'estimator__C': 8.0}
0.294 (+/-0.084) for {'estimator__kernel': 'linear', 'estimator__C': 16.0}
0.294 (+/-0.084) for {'estimator__kernel': 'linear', 'estimator__C': 32.0}
0.294 (+/-0.084) for {'estimator__kernel': 'linear', 'estimator__C': 50.0}
0.024 (+/-0.045) for {'estimator__kernel': 'rbf', 'estimator__C': 1.0, 'estimator__gamma': 0.0001}
0.311 (+/-0.182) for {'estimator__kernel': 'rbf', 'estimator__C': 1.0, 'estimator__gamma': 0.0005}
0.303 (+/-0.171) for {'estimator__kernel': 'rbf', 'estimator__C': 1.0, 'estimator__gamma': 0.001}
0.036 (+/-0.020) fo

  'precision', 'predicted', average, warn_for)


### Performance evaluation
On a first look, the best-tuned SVM model using poster metadata, didn't have an impressive performance with a *precision* of $0.64$, a recall of $0.35$ and a F1 score of $0.43$, on average. However, as we can see from the per-genre report, the precision and recall are very dependant on the considered genre: *Drama*, also being the dominant genre, has the highest values for all these metrics while *TV Movie* performs the worst, together with *Foreign*. This is dragging down our model, suggesting that we could potentially suppress or aggregate these genres to improve the performances.

It's interesting to note how the value for the *hamming loss*, which is as low as $0.103$, half of the loss compared to the *naive random classifier* and slightly lower than the naive classifier always assigning the same label.

The *jaccard similarity score* gives a better highlight of the performance of this classifier compared to the other, dominating the ranking with a score of $0.331$.

 ### Random forest 

In [23]:
from sklearn.ensemble import RandomForestClassifier

def tune_rf(xtrain, ytrain, xtest, ytest, ytest_random):
    # Set the parameters by cross-validation
    model_to_set = OneVsRestClassifier(RandomForestClassifier(random_state=0), n_jobs=-1)
    tuned_parameters = [
      {"estimator__max_depth": [10,12,14,16,18,20,22], 'estimator__n_estimators': [ 100,150,200, 500]}
    ]

    scores = ['precision']
    
    for score in scores:
        print("# Tuning hyper-parameters for %s" % score)
        print()

        clf = GridSearchCV(model_to_set, 
                           param_grid = tuned_parameters, cv=5,
                           scoring='%s_macro' % score, verbose = 5)
        clf.fit(xtrain, ytrain)
        
        print("Best parameters set found on development set:")
        print()
        print(clf.best_params_)
        print()
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))
        print()
        print("Detailed classification report:")
        
        y_test_predict = clf.predict(xtest)
        
        detailed_report(y_test_predict, ytest)
        print("\n**** compare this result with random labeling ****\n\n")

        print classification_report(ytest,  ytest_random, target_names =label_binarizer.classes_)
        print ()
   
    return clf.best_params_, classification_report(ytest,  y_test_predict), y_test_predict

In [24]:
print("All, sampling 0.1 of all, normalized")
param_01rf2, class_report_01rf2, y_test_pred_01rf2 =\
    tune_rf(X_train_std, y1_train01, X_test_std, y1_test01, y1_test01random)

All, sampling 0.1 of all, normalized
# Tuning hyper-parameters for precision
()
Fitting 5 folds for each of 28 candidates, totalling 140 fits
[CV] estimator__max_depth=10, estimator__n_estimators=100 ............
[CV]  estimator__max_depth=10, estimator__n_estimators=100, score=0.329840, total=   3.1s
[CV] estimator__max_depth=10, estimator__n_estimators=100 ............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.2s remaining:    0.0s


[CV]  estimator__max_depth=10, estimator__n_estimators=100, score=0.408278, total=   3.4s
[CV] estimator__max_depth=10, estimator__n_estimators=100 ............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    8.7s remaining:    0.0s


[CV]  estimator__max_depth=10, estimator__n_estimators=100, score=0.428671, total=   3.4s
[CV] estimator__max_depth=10, estimator__n_estimators=100 ............


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   13.3s remaining:    0.0s


[CV]  estimator__max_depth=10, estimator__n_estimators=100, score=0.431354, total=   3.3s
[CV] estimator__max_depth=10, estimator__n_estimators=100 ............


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   17.7s remaining:    0.0s


[CV]  estimator__max_depth=10, estimator__n_estimators=100, score=0.305894, total=   3.3s
[CV] estimator__max_depth=10, estimator__n_estimators=150 ............
[CV]  estimator__max_depth=10, estimator__n_estimators=150, score=0.270170, total=   4.3s
[CV] estimator__max_depth=10, estimator__n_estimators=150 ............
[CV]  estimator__max_depth=10, estimator__n_estimators=150, score=0.432100, total=   4.6s
[CV] estimator__max_depth=10, estimator__n_estimators=150 ............
[CV]  estimator__max_depth=10, estimator__n_estimators=150, score=0.429072, total=   4.6s
[CV] estimator__max_depth=10, estimator__n_estimators=150 ............
[CV]  estimator__max_depth=10, estimator__n_estimators=150, score=0.473016, total=   4.7s
[CV] estimator__max_depth=10, estimator__n_estimators=150 ............
[CV]  estimator__max_depth=10, estimator__n_estimators=150, score=0.361818, total=   4.5s
[CV] estimator__max_depth=10, estimator__n_estimators=200 ............
[CV]  estimator__max_depth=10, est

[Parallel(n_jobs=1)]: Done 140 out of 140 | elapsed: 23.3min finished


Best parameters set found on development set:
()
{'estimator__max_depth': 22, 'estimator__n_estimators': 100}
()
Grid scores on development set:
()
0.381 (+/-0.105) for {'estimator__max_depth': 10, 'estimator__n_estimators': 100}
0.393 (+/-0.142) for {'estimator__max_depth': 10, 'estimator__n_estimators': 150}
0.400 (+/-0.110) for {'estimator__max_depth': 10, 'estimator__n_estimators': 200}
0.391 (+/-0.096) for {'estimator__max_depth': 10, 'estimator__n_estimators': 500}
0.396 (+/-0.165) for {'estimator__max_depth': 12, 'estimator__n_estimators': 100}
0.385 (+/-0.141) for {'estimator__max_depth': 12, 'estimator__n_estimators': 150}
0.389 (+/-0.150) for {'estimator__max_depth': 12, 'estimator__n_estimators': 200}
0.396 (+/-0.109) for {'estimator__max_depth': 12, 'estimator__n_estimators': 500}
0.405 (+/-0.124) for {'estimator__max_depth': 14, 'estimator__n_estimators': 100}
0.398 (+/-0.110) for {'estimator__max_depth': 14, 'estimator__n_estimators': 150}
0.391 (+/-0.180) for {'estimator

### Performance evaluation
Compared to the *SVM* classifier, this model provides an higher *precision* and a far lower *recall* average recall value. It means that, given a positive sample, the classifier will fail to detect it more frequently but, given a positive prediction from the classifier, this prediciton will be more likely correct. This is not entirely positive for our use case, as it means that we will more often fail to assign the correct genre to the movies.

This behaviour is reflected also by the values obtained in the *jaccard score* and *F1 value*, which are lower compared to the SVM case. Oddly enough, the *hamming loss* is slightly lower compared to the SVM case.

# Part 2 - Investigate movie posters
We initially downloaded 500 pixels wide, color, posters in order to feed our models with movie poster data. Each poster was 750x500x3 byte, roughly making ~1MB of uncompressed color data per poster. The color data was vectorized, so that the data for each channel was appended to a single vector.

We had to standardize the poster data and apply PCA to reduce the dimensionality of our problem: the PCA found that we could explain 90% of the variance in the data by retaining the first 100 principal components, drastically reducing the dimension of our problem. However, training an SVM model with rbf kernel on a train set of 1000 posters proved to be quite challenging on local machines: memory quickly became our bottleneck.

We decided to reduce the dimension of the poster images to 138x92 pixels, retaining the RGB data. This allowed us to feed our algorithms with 5000 samples without loosing the color information.

## Load the dataset files
As reported in Milestone 1, we built an external library to fetch the data from IMDB and TMDB. We load the dataset here.

In [2]:
num_samples = 5000
TMDB_all = pd.read_pickle("../data/metadata_export")[:num_samples]
TMDB_img = tables.open_file("../data/image_export.h5", "r").get_node("/images")[:num_samples]

## Preprocess image data
We need to make sure all images have the same size, otherwise PCA will fail. Let's investigate the most frequent poster sizes in our corpus.

In [3]:
print TMDB_all.shape
print TMDB_img.shape

(5000, 18)
(5000L, 138L, 92L, 3L)


In [5]:
TMDB_img = TMDB_img.reshape((5000, 138 * 92 * 3))

In [8]:
# Count the occurrences of all the poster sizes.
image_size_dict = {}
for img in TMDB_img:
    image_size_dict[img.shape] = image_size_dict.get(img.shape, 0) + 1

# Pick the most frequent 5, just to have a look at them.
top5_resolutions = sorted(image_size_dict.items(), key=operator.itemgetter(1))[-5:]
# Pick the most frequent resolution
top_resolution = top5_resolutions[-1][0]
top5_resolutions

[((38088L,), 5000)]

## Setup the multi-label classification problem

In [9]:
label_binarizer = MultiLabelBinarizer()
TMDB_binarized_outcome = label_binarizer.fit_transform(TMDB_all["genres"])

In [10]:
label_binarizer.classes_

array([u'Action', u'Adventure', u'Animation', u'Comedy', u'Crime',
       u'Documentary', u'Drama', u'Family', u'Fantasy', u'History',
       u'Horror', u'Music', u'Mystery', u'Romance', u'Science Fiction',
       u'TV Movie', u'Thriller', u'War', u'Western'], dtype=object)

## Build the train, test and validation sets
We build a training set that has the same proportion of labels as the full dataset, so that every genre is represented.

We start sampling from the least frequent class to the most frequent one.

In [11]:
# Compute the frequency for each genre. Please note that the sums can
# be greater than the number of available movies, as the labels are not
# mutually exclusive.
genre_counts = {}
for genre in GENRE_LIST:
    genre_counts[genre] = TMDB_all['genres'].apply(lambda x: genre in x).sum()

# Sort the genres from the least occurring to the most frequent.
num_movies = TMDB_all.shape[0]
sorted_genres = sorted(genre_counts.items(), key=operator.itemgetter(1))

train_set_size = 0.8

# Generate a train/test set.
train_set = pd.DataFrame()
original_set = TMDB_all.copy()
for genre, count in sorted_genres:
    # Check how many samples have this label in the set.
    # The dataframe may be empty on the first run, so account for
    # that.
    already_sampled = 0
    if 'genres' in train_set:
        already_sampled = train_set['genres'].apply(lambda x: genre in x).sum()
        
    # If the test set already contains all the samples we expect
    # for this label, continue to the next label.
    expected_samples = int(math.floor(train_set_size * count))
    if already_sampled >= expected_samples:
        continue

    # If not, then randomly sample |expected - already_there| samples
    num_to_sample = expected_samples - already_sampled
    samples = original_set[original_set['genres'].apply(lambda x: genre in x)]\
                                                 .sample(n=num_to_sample, random_state = 42)
    # Append the random samples to the train_set
    train_set = train_set.append(samples)
    # Remove them from the original set, so we don't sample them again
    # for a different label.
    original_set = original_set.drop(samples.index)
    
# Verify that the genre proportions are kept in the train set.
for genre, count in sorted_genres:
    print("{} occurs {} times in the full set and {} in the train set"\
          .format(genre, count,  train_set['genres'].apply(lambda x: genre in x).sum()))
    
# Build a test set, by difference.
test_set = TMDB_all.drop(train_set.index)

TV Movie occurs 36 times in the full set and 33 in the train set
Documentary occurs 41 times in the full set and 35 in the train set
Western occurs 101 times in the full set and 92 in the train set
Music occurs 136 times in the full set and 130 in the train set
War occurs 172 times in the full set and 162 in the train set
History occurs 199 times in the full set and 177 in the train set
Animation occurs 383 times in the full set and 366 in the train set
Mystery occurs 442 times in the full set and 423 in the train set
Fantasy occurs 537 times in the full set and 509 in the train set
Family occurs 547 times in the full set and 511 in the train set
Horror occurs 591 times in the full set and 525 in the train set
Science Fiction occurs 640 times in the full set and 584 in the train set
Crime occurs 815 times in the full set and 744 in the train set
Romance occurs 835 times in the full set and 754 in the train set
Adventure occurs 931 times in the full set and 831 in the train set
Action o

## Evaluate a SVM classifier on Poster data

In [12]:
# Standardize the posters.
standardized_posters = StandardScaler().fit_transform(TMDB_img.tolist()) 

In [13]:
# Apply PCA
pca = PCA(n_components=100)
pca.fit(standardized_posters[train_set.index])

PCA(copy=True, iterated_power='auto', n_components=100, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [14]:
np.cumsum(pca.explained_variance_ratio_)

array([ 0.30157971,  0.36475684,  0.41017664,  0.44483254,  0.46960256,
        0.49164431,  0.50950769,  0.52392082,  0.53475083,  0.54480542,
        0.55443467,  0.56336389,  0.57115836,  0.57850479,  0.58544517,
        0.5921212 ,  0.59831321,  0.60428034,  0.60961437,  0.61484049,
        0.61987665,  0.62467522,  0.62932932,  0.63345395,  0.63750908,
        0.64142855,  0.64519294,  0.64877207,  0.65221904,  0.6555591 ,
        0.65867536,  0.66173929,  0.66464521,  0.6674424 ,  0.67009831,
        0.67272362,  0.67529259,  0.67781786,  0.68028342,  0.68265952,
        0.68494692,  0.68716383,  0.68934213,  0.69149258,  0.69359327,
        0.69564157,  0.69766309,  0.69961495,  0.70154563,  0.70342696,
        0.70521416,  0.70696227,  0.70868087,  0.7103779 ,  0.7120733 ,
        0.71373523,  0.71533551,  0.71689277,  0.7184294 ,  0.71991512,
        0.72137023,  0.72277609,  0.72417113,  0.7255613 ,  0.72690943,
        0.72823495,  0.72955313,  0.73086973,  0.73214972,  0.73

Project the input data into the PCA space.

In [15]:
train_projected_posters = pca.transform(standardized_posters[train_set.index])
test_projected_posters = pca.transform(standardized_posters[test_set.index])

In [36]:
parameters = {
    'estimator__C': [0.001, 0.01, 1.0, 10.0, 100.0, 1000.0],
    'estimator__gamma': [10e-6, 10e-3, 0.001, 0.01, 0.1]
}
svr = OneVsRestClassifier(SVC(kernel='rbf', class_weight='balanced'), n_jobs=-1)
clf = GridSearchCV(svr, parameters, cv=3, verbose=5)
clf.fit(train_projected_posters, TMDB_binarized_outcome[train_set.index])

Fitting 3 folds for each of 30 candidates, totalling 90 fits
[CV] estimator__C=0.001, estimator__gamma=1e-05 ......................
[CV]  estimator__C=0.001, estimator__gamma=1e-05, score=0.000000, total=   9.5s
[CV] estimator__C=0.001, estimator__gamma=1e-05 ......................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.0s remaining:    0.0s


[CV]  estimator__C=0.001, estimator__gamma=1e-05, score=0.000000, total=  10.6s
[CV] estimator__C=0.001, estimator__gamma=1e-05 ......................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   45.7s remaining:    0.0s


[CV]  estimator__C=0.001, estimator__gamma=1e-05, score=0.000000, total=  10.1s
[CV] estimator__C=0.001, estimator__gamma=0.01 .......................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.1min remaining:    0.0s


[CV]  estimator__C=0.001, estimator__gamma=0.01, score=0.000000, total=  10.2s
[CV] estimator__C=0.001, estimator__gamma=0.01 .......................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.5min remaining:    0.0s


[CV]  estimator__C=0.001, estimator__gamma=0.01, score=0.000000, total=  11.1s
[CV] estimator__C=0.001, estimator__gamma=0.01 .......................
[CV]  estimator__C=0.001, estimator__gamma=0.01, score=0.000000, total=  11.3s
[CV] estimator__C=0.001, estimator__gamma=0.001 ......................
[CV]  estimator__C=0.001, estimator__gamma=0.001, score=0.000000, total=   9.8s
[CV] estimator__C=0.001, estimator__gamma=0.001 ......................
[CV]  estimator__C=0.001, estimator__gamma=0.001, score=0.000000, total=  10.5s
[CV] estimator__C=0.001, estimator__gamma=0.001 ......................
[CV]  estimator__C=0.001, estimator__gamma=0.001, score=0.000000, total=  10.1s
[CV] estimator__C=0.001, estimator__gamma=0.01 .......................
[CV]  estimator__C=0.001, estimator__gamma=0.01, score=0.000000, total=  10.7s
[CV] estimator__C=0.001, estimator__gamma=0.01 .......................
[CV]  estimator__C=0.001, estimator__gamma=0.01, score=0.000000, total=  11.1s
[CV] estimator__C=

[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed: 35.0min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=OneVsRestClassifier(estimator=SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          n_jobs=-1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'estimator__C': [0.001, 0.01, 1.0, 10.0, 100.0, 1000.0], 'estimator__gamma': [1e-05, 0.01, 0.001, 0.01, 0.1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=5)

In [37]:
clf.score(train_projected_posters, TMDB_binarized_outcome[train_set.index])

0.99556322405718511

In [38]:
clf.score(test_projected_posters, TMDB_binarized_outcome[test_set.index])

0.029692470837751856

In [39]:
detailed_report(clf.predict(test_projected_posters), TMDB_binarized_outcome[test_set.index])

                 precision    recall  f1-score   support

         Action       0.21      0.30      0.25       189
      Adventure       0.15      0.29      0.20       100
      Animation       0.06      0.24      0.09        17
         Comedy       0.45      0.45      0.45       264
          Crime       0.09      0.20      0.12        71
    Documentary       0.00      0.00      0.00         6
          Drama       0.56      0.53      0.54       464
         Family       0.09      0.22      0.12        36
        Fantasy       0.04      0.14      0.06        28
        History       0.00      0.00      0.00        22
         Horror       0.16      0.26      0.20        66
          Music       0.00      0.00      0.00         6
        Mystery       0.04      0.21      0.07        19
        Romance       0.16      0.33      0.21        81
Science Fiction       0.10      0.21      0.13        56
       TV Movie       0.00      0.00      0.00         3
       Thriller       0.35    

### Performance evaluation
Even though we expected this classifier to perform much better, we obtained some disappointing performances: all the performance metrics are worse compared to the metadata-based models. One exception is for the average *recall* value, which is slightly higher for this model (and the one below).

This is possibly related to the choice of featueres for this model, that doesn't really help correlating the image data with the *genre* of a particular movie.

Even though the performances did not live up to our expectations, it seemed to perform better than the naive classifiers trained above.

Let's try a linear kernel. Use LinearSVC, as it scales up better.

In [18]:
from sklearn.svm import LinearSVC

parameters = {
    'estimator__C': [0.001, 0.01, 1.0, 10.0, 50.0, 100.0, 1000.0],
}
svr_linearSVC = OneVsRestClassifier(LinearSVC(class_weight='balanced'), n_jobs=-1)
clf_linearSVC = GridSearchCV(svr_linearSVC, parameters, cv=5, verbose=5)
clf_linearSVC.fit(train_projected_posters, TMDB_binarized_outcome[train_set.index])

Fitting 5 folds for each of 7 candidates, totalling 35 fits
[CV] estimator__C=0.001 ..............................................
[CV] ............... estimator__C=0.001, score=0.000000, total=   7.3s
[CV] estimator__C=0.001 ..............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.3s remaining:    0.0s


[CV] ............... estimator__C=0.001, score=0.001232, total=   7.8s
[CV] estimator__C=0.001 ..............................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   15.2s remaining:    0.0s


[CV] ............... estimator__C=0.001, score=0.000000, total=   7.7s
[CV] estimator__C=0.001 ..............................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   23.0s remaining:    0.0s


[CV] ............... estimator__C=0.001, score=0.000000, total=   7.4s
[CV] estimator__C=0.001 ..............................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   30.5s remaining:    0.0s


[CV] ............... estimator__C=0.001, score=0.000000, total=   7.4s
[CV] estimator__C=0.01 ...............................................
[CV] ................ estimator__C=0.01, score=0.001232, total=   7.0s
[CV] estimator__C=0.01 ...............................................
[CV] ................ estimator__C=0.01, score=0.000000, total=   7.4s
[CV] estimator__C=0.01 ...............................................
[CV] ................ estimator__C=0.01, score=0.000000, total=   7.4s
[CV] estimator__C=0.01 ...............................................
[CV] ................ estimator__C=0.01, score=0.003699, total=   7.7s
[CV] estimator__C=0.01 ...............................................
[CV] ................ estimator__C=0.01, score=0.016030, total=   7.5s
[CV] estimator__C=1.0 ................................................
[CV] ................. estimator__C=1.0, score=0.000000, total=   7.2s
[CV] estimator__C=1.0 ................................................
[CV] .

[Parallel(n_jobs=1)]: Done  35 out of  35 | elapsed:  4.5min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=OneVsRestClassifier(estimator=LinearSVC(C=1.0, class_weight='balanced', dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=-1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'estimator__C': [0.001, 0.01, 1.0, 10.0, 50.0, 100.0, 1000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=5)

In [21]:
detailed_report(clf_linearSVC.predict(test_projected_posters), TMDB_binarized_outcome[test_set.index])

                 precision    recall  f1-score   support

         Action       0.26      0.51      0.35       189
      Adventure       0.13      0.41      0.20       100
      Animation       0.04      0.29      0.07        17
         Comedy       0.44      0.62      0.51       264
          Crime       0.11      0.42      0.18        71
    Documentary       0.00      0.00      0.00         6
          Drama       0.54      0.53      0.54       464
         Family       0.07      0.39      0.12        36
        Fantasy       0.05      0.25      0.08        28
        History       0.02      0.09      0.03        22
         Horror       0.16      0.68      0.26        66
          Music       0.00      0.00      0.00         6
        Mystery       0.04      0.16      0.06        19
        Romance       0.12      0.40      0.18        81
Science Fiction       0.09      0.39      0.15        56
       TV Movie       0.00      0.00      0.00         3
       Thriller       0.34    

  'precision', 'predicted', average, warn_for)


### Performance evaluation
As with the previous model, this SVM with a *linear* kernel did not yield any surprising result. One thing worth noting, though, is that this model produced an higher *precision*, *recall* and *F1* score compared to the other SVM model, but reported a worse *hamming loss* and *jaccard similarity*.

# Discussion of the differences between the models, their strengths, weaknesses, etc.
We implemented two different strategies for predicting movie genre using:

1. textual data from the movie metadata.
2. image data from the movie posters.

Intuitively, words included in the overview and keywords should be good predictors for a movie's genre. We have seen  that words like “police” and “murder” appear often in "Crime" movie descriptions. Similarly, words such as “evil” and “vampire” are frequent for "Horror" movies. Both *keywords* and *actors* seem to be particularly good features at predicting the movie's genre. SVM and Random Forest models seem to be very well suited for this kind of classification problem and, given the result we obtained, seem to generalize better than other models.
One drawback of using movie metadata to build a model is the potential explosion in the model dimension due to semantically related lems, i.e. words that have similar meaning but different text representation.

Ideally, we wanted the models using the movie posters to exploit the fact that most posters within the same genre have similar color profiles. For example, posters might be brownish for Westerns, very contrasted for Comic movies. However, we found when training our model that they also vary greatly within the genre. This is due to the design trends likely changing over time.
Moreover, our data representation (*PCA* projected RGB data) potentially mitigated these within-genre variations, but still didn't allow our models to perform as better as the metadata ones. Therefore metadata models might generalize better.

Both models suffer by the number of available labels: decreasing the number of predicted labels, by removing samples or assigning the most common label to them, greatly reduces the prediction error.

# What else did we try?

* Random Forest with poster data, but it didn’t perform better than SVM with a RBF kernel.
* Random Forest with stump trees (depth = 2), again with no performance benefit over SVM.
* SVM (rbf) with grayscale poster data, but it only mitigated the memory impact of the data, without bringing any additional benefit to the model, even if it enabled feeding it with 10000 samples.
* SVM (linear) with both grayscale and RGB poster data. While this trained much faster than the RBF kernel, it didn’t provide any model performance benefit over SVM + rbf.