# Opening credits

Movies are cool. They make you think, feel, learn and above all be entertained. They are the ultimate hobby. What are your favourite movies? Directors? Genres? Which movies do you really (really) hate? Which movies do you guilty watch? Top Gun? Me too.

This is the very first of a series of kernels I am going to dedicate to my passion of movies. Although I will be using data science tools to extract information and display it in particular ways, these are going to be short artcicles mainly about movies. Through data, yes, but about movies.

In this first kernel I dive in a concept I have always found particularly tricky: genres. Wikipedia says a genre is just a style or category of art. However, I believe movie genres are specially fuzzy beings. Where does terror end and thriller starts? Is _The Big Leboswki_ a comedy? a noir film? Do you have a particular favourite genre? How do you know it's your favourite? Maybe you like a lot of movies of that genre, but that's just because there are a lot of movies of it. I mean, it defies statistics that Michael Bay hasn't been able to do a decent Transformers movie. And what it is that defines a genre? A particular aesthetic? Are there similar genres? Comedy and romance? Action and adventure? Are there genres that are more frequent in some decades or countries? Can we reverse engineer some of these questions purely from data analysis?

Will this be too technical for people who just like movies and too boring for people who only like data science? Let's get to it, or as Martin Lawrence's character in the masterpice _Bad Boys 2_ once said: _shit just got real_.

# First act: the data

## Libraries Loading

We are going to use the usual suspects (movie reference) for data analysis and visualization.

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
import networkx as nx
import json
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import altair as alt
import plotly
import os
from math import sqrt

%precision %.2f
pd.options.display.float_format = '{:,.2f}'.format

#print(os.listdir("../input"))

## Data Loading

In this first kernel I will be using the TMDB 5000 Movie Dataset, one of the most popular dataset on Kaggle. I like it because it is relatively small and easy to handle. Maybe in further kernels I will use deeper datasets or combinations of several.

Reading the data

In [None]:
credits = pd.read_csv('../input/tmdb_5000_credits.csv')
movies = pd.read_csv('../input/tmdb_5000_movies.csv')

One of the main sins of data scientist (and there are many) is not looking at the data. Seems fairly straight forward, right? Just look at the data! Develop a feeling of how it looks, get to know it. The dataset is divided in two tables: `movies` and `credits`.

`movies` contains the basic infom

Let's look at the very first movie in the dataset.

In [None]:
movies.head(1)

In [None]:
credits.head(1)

Ok, I am not a big fan of this movie so there they go half of my readers! It's _Avatar_, the James Cameron uberhit that still holds the number 1 box office of all time with almost 2.8 billion USD (yeah, I said billions). You can actually check that number in the `revenue` field within the `movies` table.

We extract the genres of each movie. Most of the times movies belong to more than one genre.

In [None]:
genre_list = movies.genres.apply(json.loads).apply(lambda x: [e['name'] for e in x if 'name' in e])

_Avatar_ genres are the following:

In [None]:
pd.DataFrame(genre_list[0])

Now let´s see how many different genres there are in this dataset.

In [None]:
unique_genres = set([])
for x in genre_list.values:
    for e in x:
        unique_genres.add(e)
#len(list(unique_genres))

Those are:

In [None]:
pd.DataFrame(list(unique_genres))

Ok... some of these are... questinable to say the least hehe. Family?! As in kids friendly? Foreign seems like a very US centered concept, foreign films are just films, with their own genres but whatever. And TV Movie sounds just like an insult to me hehe.

Let´s now just expand the genres into columns and mark the ones that are defined in each movie.

In [None]:
def build_gender_row(genre_list, all_genres=unique_genres):
    row_movie_gender = pd.Series(0, index=all_genres)
    row_movie_gender[genre_list]=1
    return row_movie_gender 

In [None]:
genres = pd.DataFrame([build_gender_row(e) for e in genre_list])
#for movie_gender in genre_list

For instance, _Avatar_ genres are shown as:

In [None]:
#genres = pd.concat([movies['original_title'], genres], axis = 1)
genres.head(1)

# Second act: exploration

# Third act: similarity

## Cosine similarity

Let´s see what are the genres that appear more frequently together. Without looking into the dataset I would assume it would be pair such as Action-Adventure or Romance-Comedy or something like that. In order to do that, let´s use the `cosine_similarity` function within sklearn.

Let's see a practical example of Family-Adventure:

In [None]:
genres.Family.sum()

In [None]:
genres.Animation.sum()

In [None]:
genres.query('Family == 1 & Animation == 1').shape[0]

In [None]:
(genres.query('Family == 1 & Animation == 1').shape[0])\
/(sqrt(genres.Family.sum())*sqrt(genres.Animation.sum()))

For every pair of genres we have:

In [None]:
genres_sim = pd.DataFrame(cosine_similarity(genres.T))
genres_sim.columns = genres.columns
genres_sim.index = genres.columns

In [None]:
genres_sim

Ok, now this is starting to get interesting. The previous matrix, defined in the ´genres_sim´dataframe defines the frequency in which two genres appear in the same movies. This frequency is defined by the `cosine_similarity` and goes from 1 to 0. One meaning complete similarity, which only occurs in the main diagonal, that is, between a genre and itself. And zero meaning no overlap whatsoever. Please note that this is a symmetric matrix so every value is repeated twice. Some trends star to appear.

In [None]:
df = genres_sim.where(np.triu(np.ones(genres_sim.shape), 1).astype(np.bool))
df = df.stack().reset_index()
df.columns = ['Row','Column','Value']
df.sort_values('Value', ascending = False).head(10)

In [None]:
df.query('Row == "Adventure" | Column == "Adventure"').sort_values('Value', ascending = False).head(10)

## Dendrograms, oh look! pretty pictures!

Ok, it looks like my predictions were not that off. Action-Adventure is the second most frequent combination only after Family-Animation. Comedy-Romance is number six because we live in a bleak world now haha. Some combinations are perfectly logical, though they did not occured to me inmediatly: Thriller-Crime or War-History. Please note that the `cosine_similarity`  definition is a normalized dot product so higher similarity does not necessarily mean largest number of movies with those common genres. It means higher common appearence of those genres in movies of those particular genres. So if a two genres are very uncommon but they always appear together, we would see them as the top of list.  

Plotly has limited options when displaying a dendrogram. Mostly the linkage or hierarchical clustering is set to complete.

In [None]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
init_notebook_mode(connected=True) #do not miss this line

names = genres_sim.columns.tolist()
dendro = ff.create_dendrogram(genres_sim, orientation='left', labels=names)
dendro['layout'].update({'width':800, 'height':600, 'margin':go.layout.Margin(
        l=150,
        r=50,
        b=50,
        t=50,
        pad=0
    )})

py.offline.iplot(dendro)

When going directly to scipy and using the hierarchy.linkage funciton we can use many different linkages and distances. Please note that the cosine distance = 1 - cosine similarity.

In [None]:
from scipy.cluster import hierarchy
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

Z = hierarchy.linkage(genres.T, 'complete', 'euclidean')
# single, complete, average, weighted, centroid, ward, median
# cosine, euclidean, jaccard, etc.

fig = plt.figure()
fig.set_size_inches(12, 10)
names = genres_sim.columns.tolist()
dn = hierarchy.dendrogram(Z, orientation='right', labels=names)

## Network stuff, oh man I am rusty : (

In [None]:
G = nx.from_pandas_adjacency(genres_sim)
G.name = 'Graph from pandas adjacency matrix'
print(nx.info(G))

In [None]:
from networkx.algorithms import community
#G = nx.barbell_graph(5, 1)
communities_generator = community.girvan_newman(G)
top_level_communities = next(communities_generator)
next_level_communities = next(communities_generator)
sorted(map(sorted, next_level_communities))

## Proper clustering (YEAH, SCIENCE BITCH!)

### Affinity Propagation

In [None]:
from sklearn.cluster import AffinityPropagation
import numpy as np

clustering = AffinityPropagation().fit(genres_sim)
clustering 
cluster_centers_indices = clustering.cluster_centers_indices_
labels = clustering.labels_

n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

# Seems to work just fine

In [None]:
aux = pd.DataFrame(clustering.labels_)
aux.index = genres.columns
aux.sort_values(0)

In [None]:
aux3 = pd.DataFrame(clustering.cluster_centers_)
aux3.columns = genres.columns
aux3

### KMeans (ALL THE MEANS!)

In [None]:
from sklearn.cluster import KMeans

wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(genres.T)
    wcss.append(kmeans.inertia_)

plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

There is not clear elbow where the K-Means method finds a partition of clusters which is optimal. Very linear. Let's just take five for instance:

In [None]:
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(genres.T)

In [None]:
aux2 = pd.DataFrame(kmeans.labels_)
aux2.index = genres.columns
aux2.sort_values(0)

The results is not very useful since the algorithm is keeping most of the genres in the same cluster.

In [None]:
pd.DataFrame(kmeans.cluster_centers_)

### Agglomerative Clustering

Exactly the same thing as the Dendrogram, it gives you back one particular snapshot. Defined by the affinity, the linkage and the number of clusters.

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
clustering = AgglomerativeClustering(affinity='euclidean', linkage='complete', n_clusters=5, connectivity=genres_sim)
clustering.fit(genres.T)
clustering

# If connectivity matrix is not provided the results changes slightly. If linkage is not 'complete' the results are much worse. Other than that, ok.

In [None]:
clustering.labels_

In [None]:
aux2 = pd.DataFrame(clustering.labels_)
aux2.index = genres.columns
aux2.sort_values(0)