#  Netflix Recommendations



## Control Flow

1. Load the data into a dataframe.
2. Remove outlier movies and users.
3. Gerate a training set (90% of users) and a test set (10% of users).
4. Cluster the training set according to a clustering algorithm.
5. Define the center of the cluster as the ranked list of movies.
6. Test phase for each clustering algorithm:
    i. For each user in the test set:
        I. Get their top 3 favorite movies.
        II. Assign them to a most likely cluster.
        III. Give them the ranked list of movies (defined by the cluster center in part 5) as recommendations.
    ii. Calculate precisiom, recall, and RMSE for the clustering method.
7. Make data visualizations for the clusters and performance.

### Importing libraries

In [41]:
import time
import pandas as pd
import numpy as np
import os
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

### Control Flow Steps 1-2
The result of the computation will be a parquet file called `./Data/removedoutliersdf.parquet.gzip` of the dataframe which holds all of the ratings data with outlier movies and users removed.

To make this more efficient, we are only using the ratings data from `./Data/combined_data_1.txt`.

In [42]:
# Loading the ratings data and filtering out the outliers
if not os.path.isfile('./Data/removedoutliersdf.parquet.gzip'):
    # import the rating data as pandas dataframe
    df = pd.read_csv('./Data/combined_data_1.txt', header=None, names=['UserId', 'Rating'], usecols=[0, 1])

    df['Rating'] = df['Rating'].astype(float)  # Rating is temporarily a float

    df.index = np.arange(0, len(df))  # reindex the ratings

    # Adding the MovieId to the data frame
    df_nan = pd.DataFrame(pd.isnull(df.Rating))
    df_nan = df_nan[df_nan['Rating'] == True]
    df_nan = df_nan.reset_index()

    movie_np = []
    movie_id = 1

    for i, j in zip(df_nan['index'][1:], df_nan['index'][:-1]):
        # numpy approach
        temp = np.full((1, i-j-1), movie_id)
        movie_np = np.append(movie_np, temp)
        movie_id += 1

    last_record = np.full((1, len(df) - df_nan.iloc[-1, 0] - 1), movie_id)
    movie_np = np.append(movie_np, last_record)

    df = df[pd.notnull(df['Rating'])]

    df['MovieId'] = movie_np.astype(int)
    df['UserId'] = df['UserId'].astype(int)


    # Removing unpopular movies and users with too few reviews
    # Removing the 70% least popular movies and users with the least ratings
    f = ['count', 'mean']

    df_movie_summary = df.groupby('MovieId')['Rating'].agg(f)
    df_movie_summary.index = df_movie_summary.index.map(int)
    movie_benchmark = round(df_movie_summary['count'].quantile(0.7), 0)
    drop_movie_list = df_movie_summary[df_movie_summary['count'] < movie_benchmark].index

    print('Movie minimum times of review: {}'.format(movie_benchmark))

    df_cust_summary = df.groupby('UserId')['Rating'].agg(f)
    df_cust_summary.index = df_cust_summary.index.map(int)
    cust_benchmark = round(df_cust_summary['count'].quantile(0.7), 0)
    drop_cust_list = df_cust_summary[df_cust_summary['count'] < cust_benchmark].index

    print('Customer minimum times of review: {}'.format(cust_benchmark))

    print('Original Shape: {}'.format(df.shape))
    df = df[~df['MovieId'].isin(drop_movie_list)]
    df = df[~df['UserId'].isin(drop_cust_list)]

    df['Rating'] = df['Rating'].astype(int)

    print('After Trim Shape: {}'.format(df.shape))

    print(df.describe())

    df.to_parquet('./Data/removedoutliersdf.parquet.gzip', compression='gzip')


Here getting the list of movie titles minus the outliers that were filtered out from above.
The result is a parquet file titled `./Data/filteredmovietitlesdf.parquet.gzip` to regenerate the dataframe for later use, if needed.

In [43]:
# Get a list of movie titles that passed the above filter
if not os.path.isfile('./Data/filteredmovietitlesdf.parquet.gzip'):
    df = pd.read_parquet('./Data/removedoutliersdf.parquet.gzip')
    df_title = pd.read_csv('./Data/movie_titles.csv', encoding="ISO-8859-1", header=None, usecols=[0, 2],
                           names=['MovieId', 'Name'])
    # df_title.set_index('MovieId', inplace=True)
    df_title = pd.merge(df, df_title, how='inner', on='MovieId').drop_duplicates(subset=['MovieId'])[['MovieId', 'Name']]
    df_title.to_parquet('./Data/filteredmovietitlesdf.parquet.gzip', compression='gzip')

### Generating the training set
The result is a parquet file titled `./Data/trainingusersdf.parquet.gzip`.
This should be statistically similar to the removedoutliers dataset.

In [44]:
# Make a training set of users
if not os.path.isfile('./Data/trainingusersdf.parquet.gzip'):
    df = pd.read_parquet('./Data/removedoutliersdf.parquet.gzip')
    df2 = df.loc[df['UserId'] % 10 != 0]
    print("Original data set statistics:")
    print(df.describe())
    print("Training data set statistics:")
    print(df2.describe())
    df2.to_parquet('./Data/trainingusersdf.parquet.gzip', compression='gzip')

### Generating the test set
The result is a parquet file titled `./Data/testusersdf.parquet.gzip`.
This should be statistically similar to the removedoutliers dataset.

In [45]:
# Make a test set of users
if not os.path.isfile('./Data/testusersdf.parquet.gzip'):
    df = pd.read_parquet('./Data/removedoutliersdf.parquet.gzip')
    df2 = df.loc[df['UserId'] % 10 == 0]
    print("Original data set statistics:")
    print(df.describe())
    print("Test data set statistics:")
    print(df2.describe())
    df2.to_parquet('./Data/testusersdf.parquet.gzip', compression='gzip')


Pivoting the training data into a user-item matrix. 

In [46]:
# Pivot the data frame into a user-item matrix
df = pd.read_parquet('./Data/trainingusersdf.parquet.gzip')
df = pd.pivot_table(df, values='Rating', index='UserId', columns='MovieId', fill_value=0)
df.head(10)

MovieId,3,8,16,17,18,26,28,30,32,33,...,4472,4474,4478,4479,4485,4488,4490,4492,4493,4496
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,0,0,0,0,0,0,0,3,0,0,...,3,0,0,0,0,0,0,0,0,0
7,0,5,0,0,0,0,4,5,0,0,...,3,0,0,5,0,0,0,0,0,0
79,0,0,0,0,0,0,0,3,0,0,...,4,0,0,0,0,0,4,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
134,0,0,0,0,0,0,5,0,0,0,...,0,0,0,0,0,0,0,0,0,0
169,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
183,0,0,0,0,0,0,0,4,0,0,...,0,0,0,0,0,0,3,0,0,0
188,0,0,0,0,0,0,0,3,0,0,...,0,0,0,0,0,0,3,3,0,0
195,0,0,0,0,0,0,0,0,0,0,...,4,0,0,0,0,0,0,0,0,0
199,0,0,0,0,0,0,0,5,0,0,...,0,0,0,0,0,0,0,0,0,0


## Clustering

### kmeans

Below, we are using the elbow method to find an optimal number of clusters to use for kmeans. It takes about 2-3hrs to run, the result is the following figure:
![kmeansclusters](Figure_1.png)

The important DataFrames in this section are:

`dfkmeansclustercenters` - This is all of the cluster centers after kmeans. This also represents a ranked list of movies for the cluster, which is the recommendation list we givee to the new users in the test data when they fall close to this cluster center.

`dfkmeanslabels` - This is a table associating a UserId with a cluster number after kmeans.

`dfkmeans` This is the user-item DataFrame with an extra column associating each UserId with a cluster number.

In [47]:
# # Finding best number of clusters for kmeans This takes about 2-3hrs.
# model = KMeans()
# # k is range of number of clusters.
# visualizer = KElbowVisualizer(model, k=[(5 * i) + 2 for i in range(20)], timings=True)
# visualizer.fit(df)        # Fit data to visualizer
# visualizer.show()        # Finalize and render figure

In [48]:
# cluster df using kmeans
time_start = time.time()
kmeans = KMeans(n_clusters=22).fit(df)
print('Clustering with k-means took {} seconds'.format(time.time()-time_start))

Clustering with k-means took 299.63149785995483 seconds


In [49]:
# cluster centers after kmeans
# each cluster is a community of users who like the same movies. The center is our ranked list of movies for the cluster.
dfkmeansclustercenters = pd.DataFrame(kmeans.cluster_centers_, columns=df.columns)

In [50]:
# The cluster label given to a UserId
dfkmeanslabels = pd.DataFrame(kmeans.labels_)

In [51]:
# Reindexing the labels to be UserIds and renaming the column to "cluster_number"
dfkmeanslabels.index = df.index
dfkmeanslabels.columns = ["cluster_number"]

In [52]:
# dfkmeans is the user-item matrix with an extra column labeling which cluster the UserId belongs to after kmeans.
# The index are UserIds and columns are MovieIds + cluster_number
dfkmeans = df.join(dfkmeanslabels)
dfkmeans.head(10)

Unnamed: 0_level_0,3,8,16,17,18,26,28,30,32,33,...,4474,4478,4479,4485,4488,4490,4492,4493,4496,cluster_number
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,0,0,0,0,0,0,0,3,0,0,...,0,0,0,0,0,0,0,0,0,5
7,0,5,0,0,0,0,4,5,0,0,...,0,0,5,0,0,0,0,0,0,2
79,0,0,0,0,0,0,0,3,0,0,...,0,0,0,0,0,4,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,11
134,0,0,0,0,0,0,5,0,0,0,...,0,0,0,0,0,0,0,0,0,9
169,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,19
183,0,0,0,0,0,0,0,4,0,0,...,0,0,0,0,0,3,0,0,0,17
188,0,0,0,0,0,0,0,3,0,0,...,0,0,0,0,0,3,3,0,0,12
195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
199,0,0,0,0,0,0,0,5,0,0,...,0,0,0,0,0,0,0,0,0,12
