# Clustering Netflix Titles

markdown practice warm-up:

There's a file named `me_hoy_medoid.png` in this directory.  Display the image in this notebook using a markdown cell.

## Load

In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import numpy as np

from scipy.spatial.distance import pdist, squareform

# !pip install pyclustering
from pyclustering.cluster.kmedoids import kmedoids

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


url = "https://raw.githubusercontent.com/AdamSpannbauer/flixable_ml_dsi/master/data/movies_2020_01_23_13_15_04.csv"
movie = pd.read_csv(url)

# Drop rows where genre is na
movie = movie.dropna(subset=["Genre"])

# Proceed with sample of rows to make things run faster for class time
movie = movie.sample(2000, random_state=42)

# Subset down to a small feature set
# fmt: off
drop_columns = ['Poster', 'flixable_url', 'Response', 
                'Awards', 'Rated', 'imdbID', 'DVD', 'Website',
                'BoxOffice', 'Released', 'added_to_netflix',
                'Writer', 'Actors', 'Plot',
                'Metascore', 'Production',
                'totalSeasons', 'Runtime', 'Director',
                'Title', 'Ratings', 'Year', 'imdbRating',
                'imdbVotes']
# fmt: on
movie = movie.drop(columns=drop_columns)

<IPython.core.display.Javascript object>

In [3]:
movie.head()

Unnamed: 0,Country,Genre,Language,Type,mpaa_rating
3136,Hong Kong,"Action, Comedy","Cantonese, Mandarin",movie,TV-14
1648,Egypt,"Action, Comedy, Drama",Arabic,movie,TV-14
3641,USA,Drama,English,movie,TV-14
4221,India,Comedy,,movie,TV-PG
158,South Korea,"Comedy, Drama, Family",Korean,series,TV-14


<IPython.core.display.Javascript object>

In [4]:
for col in movie:
    print(f"\n------- {col} -------------")
    print(movie[col].value_counts())


------- Country -------------
USA                                            676
India                                          308
UK                                              93
Japan                                           54
South Korea                                     49
                                              ... 
Australia, UK, United Arab Emirates, Canada      1
USA, Hungary                                     1
USA, Canada, Germany                             1
Germany, Italy                                   1
Spain, Mexico                                    1
Name: Country, Length: 261, dtype: int64

------- Genre -------------
Comedy                                                    214
Drama                                                     178
Documentary                                               162
Comedy, Drama                                              56
Comedy, Drama, Romance                                     47
                            

<IPython.core.display.Javascript object>

## Preprocess

In [5]:
movie.isna().sum()

Country        41
Genre           0
Language       74
Type            0
mpaa_rating     0
dtype: int64

<IPython.core.display.Javascript object>

In [6]:
movie = movie.dropna()

<IPython.core.display.Javascript object>

Create a copy of the dataframe to preserve this original structure for cluster analysis later.

In [7]:
# do when youve already filtered to records you want to cluster
# but you havent yet filtered to the features you want to cluster
og_movie = movie.copy()

<IPython.core.display.Javascript object>

Use [`pd.Series.str.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html) to convert dummy encode `'Genre'`, `'Language'`, and `'Country'`.

In [8]:
genre_dummies = movie["Genre"].str.get_dummies(", ")

<IPython.core.display.Javascript object>

In [9]:
language_dummies = movie["Language"].str.get_dummies(", ")

<IPython.core.display.Javascript object>

In [10]:
country_dummies = movie["Country"].str.get_dummies(", ")

<IPython.core.display.Javascript object>

Combine all 3 dummy dataframes into a single (very wide) dataframe.

In [11]:
str_dummies = pd.concat((genre_dummies, language_dummies, country_dummies), axis=1)
str_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Thailand,Tunisia,Turkey,UK,USA,Uganda,Ukraine,United Arab Emirates,Uruguay,Zimbabwe
3136,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1648,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3641,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
158,0,0,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3556,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


<IPython.core.display.Javascript object>

* Drop the original `'Genre'`, `'Language'`, and `'Country'` columns from the `movie` dataframe.
* Add the data from `str_dummies` to the `movie` dataframe

In [12]:
movie = movie.drop(columns=["Genre", "Language", "Country"])
movie = pd.concat((movie, str_dummies), axis=1)
movie.head()

Unnamed: 0,Type,mpaa_rating,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,...,Thailand,Tunisia,Turkey,UK,USA,Uganda,Ukraine,United Arab Emirates,Uruguay,Zimbabwe
3136,movie,TV-14,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1648,movie,TV-14,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3641,movie,TV-14,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
158,series,TV-14,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3556,movie,TV-PG,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0


<IPython.core.display.Javascript object>

Use [`pd.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to dummy encode `'Type'` and `'mpaa_rating'`.

In [13]:
movie = pd.get_dummies(movie)
print(movie.shape)
movie.head()

(1916, 226)


Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,mpaa_rating_PG,mpaa_rating_PG-13,mpaa_rating_R,mpaa_rating_TV-14,mpaa_rating_TV-G,mpaa_rating_TV-MA,mpaa_rating_TV-PG,mpaa_rating_TV-Y,mpaa_rating_TV-Y7,mpaa_rating_TV-Y7-FV
3136,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1648,1,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
3641,0,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
158,0,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
3556,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0


<IPython.core.display.Javascript object>

## Calculate distances

* Use `pdist` and `squareform` to calculate the distance between each row
    * What distance metric makes the most sense here?

In [14]:
dist = pdist(movie, metric="dice")
dist_mat = squareform(dist)
dist_mat.shape

(1916, 1916)

<IPython.core.display.Javascript object>

In [15]:
dist_mat_df = pd.DataFrame(dist_mat, index=movie.index, columns=movie.index)

<IPython.core.display.Javascript object>

## Cluster with K-medoids

We need to initialize the starting 'medoids' for our clusters.  To do this, `pyclustering` wants us to provide the indices of our starting points.

* Generate `k` random indices from our distance matrix

In [16]:
k = 5

<IPython.core.display.Javascript object>

In [17]:
np.random.seed(42)

# TODO: randint has issue of possible duplicate medoids
# FIXME: fix by sampling index without replacement instead
nrows = dist_mat.shape[0]
init_medoids = np.random.randint(0, nrows, k)
init_medoids

array([1126, 1459,  860, 1294, 1130])

<IPython.core.display.Javascript object>

In [18]:
kmed = kmedoids(
    dist_mat, initial_index_medoids=init_medoids, data_type="distance_matrix"
)

kmed.process()

<pyclustering.cluster.kmedoids.kmedoids at 0x1476b6040>

<IPython.core.display.Javascript object>

Use the `.get_medoids()` method to find the index for each cluster center.

In [19]:
medoid_idxs = kmed.get_medoids()
medoid_idxs

[1634, 15, 39, 1800, 1314]

<IPython.core.display.Javascript object>

Use the `.predict()` method to output the cluster label for each record in a dataset.

In [20]:
labels = kmed.predict(dist_mat)
labels

array([3, 3, 4, ..., 3, 1, 1])

<IPython.core.display.Javascript object>

Put these labels into both the `og_movie` and `movie` dataframes.

In [21]:
og_movie["label"] = labels
movie["label"] = labels

<IPython.core.display.Javascript object>

## Explore Clusters

Use the `medoid_idxs` to pull out our cluster centers from `og_movie`.

In [22]:
medoid_idxs

[1634, 15, 39, 1800, 1314]

<IPython.core.display.Javascript object>

In [23]:
og_movie.iloc[medoid_idxs, :]

Unnamed: 0,Country,Genre,Language,Type,mpaa_rating,label
3699,USA,"Crime, Drama, Thriller",English,movie,TV-MA,0
5824,USA,Comedy,English,movie,TV-MA,1
144,USA,Drama,English,series,TV-MA,2
3393,India,"Comedy, Drama",Hindi,movie,TV-14,3
4258,USA,Documentary,English,movie,TV-14,4


<IPython.core.display.Javascript object>

Analyze clusters

In [24]:
cluster_avgs = movie.groupby("label").mean()

<IPython.core.display.Javascript object>

In [25]:
# Most defining cluster 0 characteristics
cluster_avgs.T.sort_values(0, ascending=False).head(5)

label,0,1,2,3,4
Type_movie,1.0,1.0,0.228412,0.843648,0.967552
English,0.907514,0.988372,0.665738,0.107492,0.99115
Drama,0.736994,0.151163,0.398329,0.571661,0.079646
USA,0.65896,0.914729,0.406685,0.003257,0.746313
Thriller,0.528902,0.007752,0.167131,0.135179,0.029499


<IPython.core.display.Javascript object>

In [26]:
# Most defining cluster 1 characteristics
cluster_avgs.T.sort_values(1, ascending=False).head(5)

label,0,1,2,3,4
Type_movie,1.0,1.0,0.228412,0.843648,0.967552
English,0.907514,0.988372,0.665738,0.107492,0.99115
USA,0.65896,0.914729,0.406685,0.003257,0.746313
Comedy,0.095376,0.782946,0.144847,0.390879,0.294985
mpaa_rating_TV-MA,0.49422,0.705426,0.548747,0.250814,0.056047


<IPython.core.display.Javascript object>

In [27]:
# Most defining cluster 3 characteristics
cluster_avgs.T.sort_values(3, ascending=False).head(8)

label,0,1,2,3,4
Type_movie,1.0,1.0,0.228412,0.843648,0.967552
Drama,0.736994,0.151163,0.398329,0.571661,0.079646
mpaa_rating_TV-14,0.026012,0.0,0.094708,0.568404,0.39823
India,0.034682,0.003876,0.011142,0.495114,0.017699
Comedy,0.095376,0.782946,0.144847,0.390879,0.294985
Hindi,0.031792,0.0,0.011142,0.361564,0.011799
mpaa_rating_TV-MA,0.49422,0.705426,0.548747,0.250814,0.056047
Romance,0.127168,0.120155,0.091922,0.249186,0.056047


<IPython.core.display.Javascript object>