# **Netflix User Recommendations**
How do we suggest movies to a given user based on their watch and like history as well as other things we might know about these movies? I am using this kernel to explore my thoughts. I'm learning so I'm drawing things I find from other sources and hopefully stitching them together in a way that makes me feel confident I understand how to build a recommendation engine.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import dask.dataframe as dd # asynchronous load and subset which is useful for large dataset sampling
from dask.distributed import Client

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", context='paper')
sns.set(rc={'figure.figsize':(18,5)})

from functools import wraps # a ditty decorator

from sklearn.metrics.pairwise import cosine_similarity

import random # random integers
import os # i/o read files
import time # time my work
import gc  # clear ram
print(os.listdir("../input"))
print(os.listdir())

## Data Load & Processing
Each combined dataset of user ratings for a movie is some 26mm lines and **needs to be wrangled** get it into tabular format. It will be useful to create a large csv massive dataset and persist it on disk for us to **sample** from. This will also remove any biases that might come from selecting just the first file to load into memory. I'll time my process to make sure I've done it in the most efficient way. This part of the kernel takes <10 minutes to run. Have a coffee, come back for some fun.

In [None]:
def time_this(func): 
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print("{} takes {} minutes!".format(func.__name__, round((end-start)/60, 2)))
        gc.collect()
        return result 
    return wrapper

In [None]:
wd = '../input/netflix-prize-data/' # working dir
data_files = [wd + 'combined_data_{}.txt'.format(i) for i in range(1,5)]
print('Data files to be combined and pre-processed: ')
print(data_files)

In [None]:
get_movie_id = lambda line: int(line.replace(':', '').replace('\n', ''))
get_rating = lambda line: [x.replace('\n', '') for x in line.split(',')]
df_it = lambda row: pd.DataFrame(row, index=['cust_id', 'rating', 'date']).T

def append_to_csv(data, fp):
    print('Writing {} rows to {}'.format(data.shape[0], fp))
    if os.path.exists(fp):
        data.to_csv(fp, mode='a', index=False, header=False)
    else:
        data.to_csv(fp, index=False)

@time_this
def get_ratings(fp):
    """
    Parse the text files that have movie id and customer ratings into a usable dataframe.
    @fp: file path (str)
    """
    print('Getting ratings from file: {}'.format(fp))
    agg_data = []
    with open(fp, 'r') as file_:
        for line_number, line in enumerate(file_):
            if (line_number % 10**7) == 0: print('{} million rows..'.format(line_number / (10**6)))
            if ':' in line:
                movie_id = get_movie_id(line)                      
            else:
                rating_row = get_rating(line)
                rating = {str(col): val for col,val in enumerate(rating_row)}
                rating['movie_id'] = movie_id
                agg_data.append(rating)            
    agg_data = pd.DataFrame(agg_data)
    agg_data.rename(columns={'0': 'cust_id', '1': 'rating', '2': 'date'}, inplace=True)
    print('Finished getting ratings from file: {}'.format(fp))
    return agg_data

In [None]:
#for fp in data_files:
df = get_ratings(data_files[0])
append_to_csv(df, 'ratings.csv')
del df
gc.collect()

In [None]:
os.listdir()

**Dask** dataframe employs a **lazy parallel read** operation which will allow me to interact with the massive CSV I've created and filter it. Since **Dask is more limited than Pandas in its API**, I will only really do this to create the **random sample**; then I'll port to pandas directly and work from there. I randomly select customers by using a modulo with a random integer between 1 and 50. This should result in 26k customers sampled. 

I'd like to **sample by customer so that I can retain the entirety of that customer's history of ratings**. When predicting what a customer will like, I would like to have their history available for information.

In [None]:
data = dd.read_csv('ratings.csv')
print(data.shape)
print(data.head())

In [None]:
n = random.randint(0, 30)
data = data[data.cust_id % n == 0]
client = Client()   # initialize the cluster
data = client.persist(data)
data = data.compute()
data['date'] = pd.to_datetime(data['date'])
data['year'] = data['date'].dt.year
print(data.shape)

I'll also be reading in the movies dataset provided to us for additional information. I want to take a quick look at this dataset before we continue. Most of the movies in Netflix's armoire seem to have been **released in the last 25 years.** I will also fuzzy merge the movies metadata CSV onto this dataset so that I can get some additional information for use later. A couple of notes about this movies dataset:
* There are 17,700 movies in the Netflix dataset.
* Only 6,300 of those movies have a match in the metadata file (since this is useless I'll just subset down to the movies that do have metadata).


In [None]:
movie_titles = pd.read_csv(wd + 'movie_titles.csv',
                           encoding = 'ISO-8859-1', # some weird encoding issue
                           header = None, names = ['movie_id', 'year', 'name'])
movie_titles.drop_duplicates(subset=['name', 'year'], inplace=True)
movie_meta = pd.read_csv('../input/the-movies-dataset/movies_metadata.csv', 
                         low_memory=False,
                         usecols=['adult', 'budget', 'original_title',
                                 'release_date', 'popularity', 'revenue', 'genres'])
movie_meta.dropna(how='any', inplace=True)
movie_meta['release_date'] = pd.to_datetime(movie_meta['release_date'])
movie_meta['meta_year'] = movie_meta['release_date'].dt.year
movie_meta.drop_duplicates(subset=['original_title', 'meta_year'], inplace=True)
movie_titles.sample(5)

In [None]:
movie_meta = movie_titles.merge(movie_meta, how='left', left_on=['name', 'year'], right_on=['original_title', 'meta_year'],
                  validate='1:1', indicator=True).sort_values('_merge', ascending=False).query("_merge == 'both'")

In [None]:
sns.set(rc={'figure.figsize':(15,4)})
sns.set_style('whitegrid')
grouped = movie_titles.groupby('year', as_index=False)['movie_id'].count()
ax = sns.lineplot(x='year', y='movie_id', data=grouped, color='red')
ax.set_title('Movies by Release Date')
ax.set_xlabel('Year')
ax.set_ylabel('Movie Count')

In [None]:
data = data.merge(movie_titles, how='left', on='movie_id',validate='m:1', suffixes=['', '_of_release'])
data.sample(5)

In [None]:
agg = {
    'cust_id': 'nunique',
    'rating': {'min', 'max', 'mean', 'count'}
}
ratings = data.groupby(['name', 'year_of_release'], as_index=False).agg(agg)
ratings.columns = ['_'.join(col).strip('_') for col in ratings.columns.values]
ratings.rename(columns={'cust_id_cust_id': 'unique_customers'}, inplace=True)
ratings.sample(5)

### Netflix's Movie Selections
After merging in the movies datasets with the ratings dataset, we can aggregate up to get some **statistics by movie**. Movies that were **released before the 80s** all universally seem to have **higher ratings; as we get closer to more recent years, ratings become noisy**. 

In [None]:
ax = ratings.query('rating_count > 300').plot(kind='scatter', x='year_of_release', y='rating_mean', 
                                              color='red', s=4)
ax.set_ylabel('Average Rating')
ax.set_xlabel('Year of Movie Release')
plt.annotate('* Subset to Movies with >300 Ratings', (.75,0), (0,-40), xycoords='axes fraction', 
             textcoords='offset points', va='top', fontsize=10)
ax.set_title('Movie Ratings by Year of Release')

In [None]:
top = ratings.query('rating_count > 300').sort_values('rating_mean').tail(10)[['name', 'rating_mean']]
bottom = ratings.query('rating_count > 300').sort_values('rating_mean').head(10)[['name', 'rating_mean']]
fig, axes = plt.subplots(2, figsize=(15,7), sharex=True)
ax1 = plt.subplot(2, 1, 1)
top.set_index('name').plot(kind='barh', color='red', ax=ax1, legend=False)
ax1.set_ylabel('')
ax1.set_xlim(0,5)
ax1.xaxis.set_visible(False)
ax2 = plt.subplot(2, 1, 2)
bottom.set_index('name').plot(kind='barh', color='blue', ax=ax2, legend=False, position=1)
ax2.set_xlabel('User Rating')
ax2.set_ylabel('')
ax2.set_xlim(0,5)
plt.annotate('* Subset to Movies with >300 Ratings', (.75,0), (0,-40), xycoords='axes fraction', 
             textcoords='offset points', va='top', fontsize=10)
fig.suptitle("Highest and Lowest Rated Movies", fontsize=16)

In [None]:
fig, axes = plt.subplots(2, figsize=(20,7))
fig.suptitle("User Ratings & Volume Over Time", fontsize=16)
ax = plt.subplot(1, 2, 1)
grouped = data.groupby('year', as_index=False).agg(
    {'rating': 'mean', 'name': 'nunique', 'cust_id': 'count'})\
.rename(columns={'name': 'number of unique movies', 'cust_id': 'volume of reviews'})
grouped.plot(x='year', y=['rating','volume of reviews'], 
             secondary_y=['volume of reviews'], color=['red', 'blue'], ax=ax, grid=True)
ax.set_xlabel('')
ax2 = plt.subplot(1, 2, 2)
grouped.plot(x='year', y='number of unique movies', ax=ax2, color='red')
ax2.set_xlabel('')

### Recommendation "Engines"
The first thing we need to do is create a user movie matrix. Each row in this matrix will represent a customer and each column will represent a movie of the 17,700 movies in this dataset. We should expect to see a lot of nulls (a sparse matrix) here because it would be crazy if a person had managed to watch all 17,700. This will become useful in the following ways:
* We can try to find similar users (rows) to segment the population.
* We can use this matrix to identify movies a user has not seen, and [fill in potential ratings](http://https://www.kaggle.com/morrisb/how-to-recommend-anything-deep-recommender/notebook.)

In [None]:
min_movie = 1000   # movie has to have been rated over 1000 times
min_user = 200   # user has to have rated at least 200 times
users = data.groupby('cust_id')['rating'].count()
users = users.loc[users > min_user].index.values
movies = data.groupby('movie_id')['rating'].count()
movies = movies.loc[movies > min_movie].index.values
filtered = data.loc[data.cust_id.isin(users) & data.movie_id.isin(movies)]
print('Unfiltered: ', data.shape[0])
print('Filtered: ', filtered.shape[0])
print('Kept {}% of data'.format(round(filtered.shape[0]/data.shape[0], 2)*100))

In [None]:
filtered.sample(5)

In [None]:
mat = filtered.pivot_table(index='cust_id', columns='movie_id', values='rating')
print('The User-Movie Matrix')
mat.sample(10)

I came across a **good guide for recommendations engine** [here](http://https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/) and I'll try to select the excerpts that helped me understand what to do. I relied on this to understand the differences between recommendation engines and the math that works behind it so I would read this before looking on for how I acted on that information.
>  Consider the example of Netflix. They save all the information related to each user in a vector form. This vector contains the past behavior of the user, i.e. the movies liked/disliked by the user and the ratings given by them. This vector is known as the profile vector. All the information related to movies is stored in another vector called the item vector. Item vector contains the details of each movie, like genre, cast, director, etc.

This is like the matrix we just created. Each row is the behavior of the user (the profile vector) and each movie is the item vector (which at this point contains just the ratings of other users).

>The content-based filtering algorithm finds the cosine of the angle between the profile vector and item vector, i.e. cosine similarity. Suppose A is the profile vector and B is the item vector, then the similarity between them can be calculated. Based on the cosine value, which ranges between -1 to 1, the movies are arranged in descending order and one of the two below approaches is used for recommendations:
> * Top-n approach: where the top n movies are recommended (Here n can be decided by the business)
> * Rating scale approach: Where a threshold is set and all the movies above that threshold are recommended

In the following sections I will create recommendations using the following methodologies:
1. Top N Movies across all users
2. User-User Similarity
3. Movie-Movie Similarity

#### **Top N Movies**
The simplest form of recommendation here would be the **top N movies**. This recommendation is made to all users regardless of their preference. i.e. *"These are the top rated movies on Netflix right now."*

This is fairly straightforward to do: we want to get the top rated movies (in a separate dataframe) and just impute them onto this matrix. The final output would be a matrix of (Ncustomers, 5) shape where 1-5 columns are the top 5 ranked movies they have not seen yet.

In [None]:
means = filtered.groupby('movie_id')['rating'].mean().to_dict()   # get a lookup table of movie to mean rating
topNrecs = mat.copy(deep=True)
for col in topNrecs:                    # for each movie
    already_rated = topNrecs[col].notnull()    # make note of which ones theyve already rated
    topNrecs[col].fillna(means[col], inplace=True)    # fill out the mean rating for each movie
    topNrecs.loc[already_rated, col] = np.nan       # remove the information we already have
print('Average User Rating Imputed onto Users Matrix')
topNrecs.sample(10)

In [None]:
recommendations = topNrecs.stack()\
.reset_index()\
.rename(columns={0: 'imputed_rating'})\
.groupby('cust_id')\
.apply(lambda x: x.nlargest(5, columns='imputed_rating'))\
.reset_index(drop=True)\
.sort_values(by=['cust_id', 'imputed_rating'], ascending=[True, False])\
.merge(movie_titles, how='left', on='movie_id', validate='m:1')\
.rename(columns={'name': 'recommended_movie_name',
                 'year': 'year_of_release'})
print('For each user, pick the top 5 movies that they have seen (to be used to merge in).')
recommendations.head(10)

In [None]:
agg_rec = recommendations.groupby(['recommended_movie_name', 'year_of_release'])['cust_id'].nunique()\
                        .sort_values(ascending=True)
rb_palette = [(x/10.0, x/100.0, x/40.0) for x in range(len(agg_rec.tail(10)))] 
# <-- gradient rgb   (x/10.0, x/20.0, 0.75)
ax = agg_rec.tail(10).plot(kind='barh', x='cust_id', color=rb_palette)
ax.set_title('Most recommended movies to Users')
ax.set_ylabel('')
ax.set_xlabel('Number of times movie made it to Users Top 5 Recommendation')
print('Top Movie Recommendations')

#### **User-User Similarity**
Another slightly more useful method would be to identify users that are similar and source recommendations from other movies similar users have rated highly. The high level questions I'm trying to answer are the following:
* Which users are similar to each other?               *---->   Correlate users*
* How do I determine a cutoff for similarity?           *---->   Establish a threshold*
* What movies have the similar users seen that I can recommend to the user in question?           *---->   Pull in relevant recommendations*

In this example, I use the **Pearson's correlation coefficient** to tell me how similar two independent customers are based on their movie ratings. Pandas allows us to take the **user movie matrix** that we created before and **transpose** it so that **each column represents a different user's movie ratings** for N movie rows. Then we compute the **pairwise correlation coefficient of each user (column) to each other** , creating a **User-User Matrix**. I use the min periods option to specify that a minimum of 50 overlapping movie ratings are required for a user to be correlated with one another.

**Note**: This is an incredibly inefficient process and gets poorer in efficiency as the number of customers outnumber the number of movies. I explain this to myself as, each user vector has so many movie dimensions and each user vector has to be compared against another user vector along those dimensions so its N users * N users worth of computations. There are better alternate ways but I wanted to show how to do this anyway since I'm sure there are situations in which this is useful.

In [None]:
userSim = mat.copy(deep=True)
corr = userSim.T.corr(min_periods=50)    # pairwise pearson correlation coefficient of columns
corr.head(5)

In [None]:
threshold = 0.10
print('Set the threshold similarity between users to be .1 given the distribution of corrs.')
pd.Series(np.triu(corr.values).flatten()).dropna().describe(percentiles=[x*.1 for x in range(10)]).round(2)

In [None]:
nearest_users = corr.stack()\
.reset_index(level=1)\
.rename(columns={
    'cust_id': 'cust_id_2',
    0: 'similarity_score'})\
.reset_index()\
.query('similarity_score > {}'.format(threshold))\
.query('cust_id != cust_id_2')\
.groupby('cust_id')\
.apply(lambda x: x.nlargest(5, columns='similarity_score'))\
.reset_index(drop=True)\
.sort_values(by=['cust_id', 'similarity_score'], ascending=[True, False])
print('For each user, get the nearest 5 users (not themselves) that are above the threshold similarity.')
nearest_users.head(10)

In [None]:
top5perUser = mat.stack()\
.reset_index()\
.rename(columns={0: 'rating'})\
.groupby('cust_id')\
.apply(lambda x: x.nlargest(5, columns='rating'))\
.reset_index(drop=True)\
.sort_values(by=['cust_id', 'rating'], ascending=[True, False])\
.merge(movie_titles, how='left', on='movie_id', validate='m:1')\
.rename(columns={'name': 'recommended_movie',
                 'year': 'year_of_release'})
top5perUser.head(10)

In [None]:
top5perUser['rank'] = top5perUser.assign(count=1).groupby('cust_id')['count'].transform('cumsum')
top5recs = top5perUser.drop(['movie_id', 'rating', 'year_of_release'], axis=1)\
.set_index(['cust_id', 'rank'])\
.unstack().reset_index()
top5recs.columns = ['_'.join([str(x) for x in col]).strip('_') for col in top5recs.columns.values]
print('For each user, get the top 5 recommended movies.')
top5recs.sample(5)

In [None]:
userUserRecs = nearest_users.merge(top5recs, how='left', left_on='cust_id_2', right_on='cust_id', suffixes=['', '_'])\
.drop('cust_id_', axis=1)
print('For each customer, merge in the similar users recommended movies')
userUserRecs.head(10)

An **alternate way** to do this would be to **generate predictions of what the user would rate a movie** based on what similar users rated the movie and **use those predictions to generate recommendations**. This will **not limit** us to the **top 5 movies** since we will be able to rank them instead. 
1. We will still have to use a threshold here to reduce the chance that dissimilar users sway the recommendation. 
2. Then we give a prediction (weighted rating) by using the formula P(user,item) = ∑(other user ratings of the item multiplied by similarity to user) / ∑(similarity scores to user). 
3. Then we'll the merge in the movie titles for each movie.
4. Then filter to movies the user hasn't seen yet.

In [None]:
seenMovie = mat.stack()\
.reset_index()\
.rename(columns={0: 'user_rating'})\
.assign(customer_seen_movie_flag = 1)
seenMovie.head()

In [None]:
threshold = .6
weightedRatings = mat.stack()\
.reset_index()\
.rename(columns={0: 'rating'})\
.merge(nearest_users, how='right', left_on='cust_id', right_on='cust_id_2', suffixes=['_', ''])\
.drop('cust_id_', axis=1)\
.query('similarity_score > {}'.format(threshold))\
.query('cust_id != cust_id_2')\
.assign(user_rating_weighted_by_similarity = lambda x: (x.rating * x.similarity_score))\
.groupby(['cust_id', 'movie_id'], as_index=False)[['user_rating_weighted_by_similarity', 'similarity_score']].sum()\
.assign(prediction = lambda x: (x.user_rating_weighted_by_similarity / x.similarity_score))\
.sort_values(by=['cust_id', 'prediction'], ascending=[True, False])\
.merge(movie_titles, how='left', on='movie_id', validate='m:1')\
.rename(columns={'name': 'recommended_movie',
                 'year': 'year_of_release'})\
.merge(seenMovie, how='left', on=['cust_id', 'movie_id'])
weightedRatings.sample(10)

In [None]:
print('Threshold pearsons score: {}.'.format(threshold))
print('Increasing the threshold will decrease the number of customers we can provide \n recommendations for but increases the quality of the recommendation.')
print('------------')
print('Recommendations available for {} out of {} users.'.format(weightedRatings.cust_id.nunique(), mat.shape[0]))
weightedRatings.groupby('cust_id', as_index=False).agg({'recommended_movie': 'nunique', 'similarity_score': {'min', 'mean', 'max'}})\
.rename(columns={'recommended_movie': 'number_of_recommendations'}).sample(10)

In [None]:
print('Subset to movies customers havent seen.')
weightedRatings.loc[(weightedRatings.customer_seen_movie_flag != 1) & (weightedRatings.prediction > 4)].sample(10)

In [None]:
sub = weightedRatings.loc[(weightedRatings.customer_seen_movie_flag == 1)]
sample_size = round(.2 * sub.shape[0])
sample = sub.sample(sample_size)
from scipy.stats.stats import pearsonr
stats = pearsonr(sample.user_rating, sample.prediction)
ax = sns.stripplot(x="user_rating", y="prediction", data=sample, jitter=True, color='red', size=3)
ax.set_title('User Ratings to Predicted Rating')
ax.set_ylabel('Prediction')
ax.set_xlabel('Current User Rating')
print('Lets use the data where the customer has seen the movie to evaluate these results.')
print('---------------')
print('User ratings seem to be overall trending positively with predicted value.')
print('Pearsons Correlation: {}, P-value: {}'.format(stats[0], stats[1]))

#### **As Netflix gets better at predicting what users want, overall user ratings will increase over time.**