# Clustering Netflix Titles

## Load

In [None]:
%reload_ext nb_black

In [None]:
import pandas as pd
import numpy as np

from scipy.spatial.distance import pdist, squareform

# !pip install pyclustering
from pyclustering.cluster.kmedoids import kmedoids

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


url = "https://raw.githubusercontent.com/AdamSpannbauer/flixable_ml_dsi/master/data/movies_2020_01_23_13_15_04.csv"
movie = pd.read_csv(url)

# Drop rows where genre is na
movie = movie.dropna(subset=["Genre"])

# Proceed with sample of rows to make things run faster for class time
movie = movie.sample(2000)

# Subset down to a small feature set
# fmt: off
drop_columns = ['Poster', 'flixable_url', 'Response', 
                'Awards', 'Rated', 'imdbID', 'DVD', 'Website',
                'BoxOffice', 'Released', 'added_to_netflix',
                'Writer', 'Actors', 'Plot',
                'Metascore', 'Production',
                'totalSeasons', 'Runtime', 'Director',
                'Title', 'Ratings', 'Year', 'imdbRating',
                'imdbVotes']
# fmt: on
movie = movie.drop(columns=drop_columns)

In [None]:
movie.head()

## Preprocess

Create a copy of the dataframe to preserve this original structure for cluster analysis later.

In [None]:
og_movie = movie.copy()

Use [`pd.Series.str.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html) to convert dummy encode `'Genre'`, `'Language'`, and `'Country'`.

In [None]:
genre_dummies = 

In [None]:
language_dummies = 

In [None]:
country_dummies = 

Combine all 3 dummy dataframes into a single dataframe

In [None]:
str_dummies = 
str_dummies.head()

* Drop the original `'Genre'`, `'Language'`, and `'Country'` columns from the `movie` dataframe.
* Add the data from `str_dummies` to the `movie` dataframe

In [None]:
movie = movie.drop(columns=["Genre", "Language", "Country"])
movie = 
movie.head()

Use [`pd.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to dummy encode `'Type'` and `'mpaa_rating'`.

In [None]:
movie = 
movie.head()

## Calculate distances

* Use `pdist` and `squareform` to calculate the distance between each row
    * What distance metric makes the most sense here?

In [None]:
dist = 
dist_mat = 
dist_mat.shape

## Cluster with K-medoids

We need to initialize the starting 'medoids' for our clusters.  To do this, `pyclustering` wants us to provide the indices of our starting points.

* Generate `k` random indices from our distance matrix

In [None]:
k = 5

In [None]:
init_medoids = 

In [None]:
kmed = kmedoids(
    dist_mat, initial_index_medoids=init_medoids, data_type="distance_matrix"
)

kmed.process()

Use the `.get_medoids()` method to find the index for each cluster center.

In [None]:
medoid_idxs = kmed.get_medoids()

Use the `.predict()` method to output the cluster label for each record in a dataset.

In [None]:
labels = 

Put these labels into both the `og_movie` and `movie` dataframes.

## Explore Clusters

Use the `medoid_idxs` to pull out our cluster centers from `og_movie`.

What is the most occuring county in cluster `0`?

In cluster `1`, are there more movies or series?

What is the most occuring `mpaa_rating` in cluster `2`

What else should we explore to accurately describe these clusters?

Give each of these clusters a tinder bio.