## Recommendation Systems

A recommendation system works by analysing the user preferences and recommends products to the user which user may like.
For Ex: Netflix recommends movies you may like based on previous movies, Linked recommends Jobs you may be interested in based on your profile.

Types of recommendation systems
|Type|Description|
|-|-|
|Content based filtering | Content based filtering works based on recommending products which have attributes/features that are liked by you in the past. For ex. If you have liked horror genre pictures earlier, Netflix recommends more horror genre pictures.|
| Collaborative filtering | Collaborative filtering works based on assumption that people who had similar preferences in the past will have same preferences in the future.
| Demographic based recommender system | Demographic based recommender systems use the demographic data usually collected through market research to recommend products. 
| Utility based recommender system | Utility based recommender systems work by creating a utility function for the products and recommending products based on output of the utility function. The benefit of this system is that that non-product attributes like vendor reliability, product availability can be factored into utility function.
| Knowledge based recommender system | Knowledge based recommender system functions by understanding how a particular item meets user's need. 
| Hybrid recommender system | Hybrid recommender system works by combining any 2 of the above recommendation systems. Some of the famous techniques are applying weights to recommendation systems, frequently switching between any 2 recommendation systems, or showing all recommendations from different systems.


## Collaborative filtering


### User-user collaborative Filtering

1. Get similarity between users using adjusted cosine similarity. Adjusted cosine similarly reduces the impact of very high/low rating by substracting the mean rating from each rating.
2. Multiply correlation matrix with existing ratings to get weighted average of the rating
3. Exclude the ratings already given by the user

In [2]:
# import libraties
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# load movie lens data
ratings = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/ratings_final.csv' , encoding='latin-1')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [4]:
# Test and Train split of the dataset.
from sklearn.model_selection import train_test_split
train, test = train_test_split(ratings, test_size=0.30, random_state=31)
print(train.shape)
print(test.shape)

(210088, 4)
(90038, 4)


In [5]:
# Pivot the train ratings' dataset into matrix format in which columns are movies and the rows are user IDs.
df_pivot = train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(0)
df_pivot.head(3)

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Copy the train dataset into dummy_train.
# Dummy train will be used later for prediction of the movies which has not been rated by the user. 
# To ignore the movies rated by the user, we will mark it as 0 during prediction. 
# The movies not rated by user is marked as 1 for prediction in dummy train dataset. 
dummy_train = train.copy()
dummy_train['rating'] = dummy_train['rating'].apply(lambda x: 0 if x>=1 else 1)
# Convert the dummy train dataset into matrix format.
dummy_train = dummy_train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(1)
dummy_train.head(3)

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
# Step 1: Get Adjusted cosine similarity between user-user
from sklearn.metrics.pairwise import pairwise_distances

# find mean of ratings provided by users including nans
df_pivot_nans = train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
)

# mean of ratings per movie
mean = np.nanmean(df_pivot_nans, axis=1)
# substracting mean from ratings
df_subtracted = (df_pivot_nans.T-mean).T
# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
# if user_correlation is NAN , replace with 0
user_correlation[np.isnan(user_correlation)] = 0
# if user correlation is -ve, replace with 0
user_correlation[user_correlation<0]=0
user_correlation.shape

(2071, 2071)

In [8]:
# Step2 : Predict ratings by the users for the movies that are rated and not rated. It is simply a dot product of correlation coefficient between user-user & existing rating 
user_predicted_ratings = np.dot(user_correlation, df_pivot.fillna(0))
# exclude 
user_predicted_ratings.shape

(2071, 12911)

In [9]:
# Step3: Exclude ratings given by user.
user_final_rating = np.multiply(user_predicted_ratings,dummy_train)
user_final_rating.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,205967,206272,206293,206499,206523,206845,206861,207309,208002,208793
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,9.55213,3.294495,1.470954,0.040189,1.020561,5.201704,1.659968,0.054285,0.349317,5.031108,...,0.0,0.003304,0.008125,0.0,0.033861,0.0,0.0,0.0,0.0,0.074576
2,0.0,14.925742,7.108167,0.908528,4.979097,14.034879,7.099211,0.785965,1.635084,19.88008,...,0.02464,0.061503,0.013683,0.0,0.054489,0.0,0.0176,0.113239,0.113239,0.058355
3,66.141241,22.770011,10.806908,1.206505,6.68577,34.141321,9.858038,0.669255,2.676321,32.180909,...,0.171026,0.100944,0.124693,0.080384,0.133438,0.317882,0.122161,0.365239,0.365239,0.234785
4,0.0,10.226084,2.658416,0.299828,2.207704,12.222046,3.007363,0.123162,0.998604,12.896807,...,0.324958,0.172648,0.064566,0.009314,0.211999,0.144026,0.232113,0.346767,0.346767,0.200223
5,0.0,19.867664,17.681491,2.762963,12.826641,32.774931,16.06281,1.801925,3.842096,35.779545,...,0.073628,0.0,0.025125,0.025513,0.044435,0.156123,0.052592,0.060123,0.060123,0.082602


In [10]:
## Recommended movies for a user
userId = 21
recommended_movies = user_final_rating.loc[userId].sort_values(ascending=False)[:3]
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv')
final = pd.merge(recommended_movies, movie_mapping, left_on='movieId', right_on='movieId', how='left')
final.head()

Unnamed: 0,movieId,21,title,genres
0,296,63.472242,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
1,356,59.370178,Forrest Gump (1994),Comedy|Drama|Romance|War
2,4993,56.21901,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy


### Evaluation

To Evalute follow the below steps
1. Get common users & corresponding correlations from train and test set 
2. Find user ratings for the movies in test set for which the ratings are already available
3. Find RMSE using the predicted ratings and actual ratings

In [11]:
# Step 1 : List of common users
common = test[test.userId.isin(train.userId)]
common_user_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating')
# Correlation df for all users in train set
user_correlation_df = pd.DataFrame(user_correlation)
# Take users from cosine similarity and set as index for user correlation table 
user_correlation_df['userId'] = df_subtracted.index
user_correlation_df.set_index('userId',inplace=True)
user_correlation_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,2061,2062,2063,2064,2065,2066,2067,2068,2069,2070
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.001899,0.016571,0.0,0.0,0.0,0.039761,0.0,0.003991,0.0,...,0.0,0.0,0.000829,0.0,0.053598,0.015322,0.0,0.020284,0.0,0.004454
2,0.001899,1.0,0.024391,0.014429,0.009537,0.034863,0.0,0.0,0.0,0.003106,...,0.0,0.051286,0.036513,0.029803,0.0,0.045524,0.010638,0.034615,0.0,0.0
3,0.016571,0.024391,1.0,0.062999,0.040416,0.009671,0.0,0.017017,0.0,0.036063,...,0.0,0.061911,0.01747,0.083151,0.038958,0.060932,0.028651,0.112607,0.0,0.017203
4,0.0,0.014429,0.062999,1.0,0.0,0.011082,0.0,0.0,0.0,0.02775,...,0.0,0.013295,0.0,0.003963,0.0,0.022046,0.020043,0.045028,0.0,0.004144
5,0.0,0.009537,0.040416,0.0,1.0,0.051212,0.045804,0.088584,0.079276,0.133326,...,0.0,0.033921,0.103921,0.007723,0.0,0.081722,0.106743,0.014523,0.03769,0.009734


In [12]:
# Step 2. Find user ratings for the movies in test set for which the ratings are already available
# common users
list_name = common.userId.tolist()
# only keep correlations of common users
user_correlation_df.columns = df_subtracted.index.tolist()
user_correlation_df1= user_correlation_df[user_correlation_df.index.isin(list_name)]
user_correlation_df2 = user_correlation_df1.T[user_correlation_df1.T.index.isin(list_name)]
user_correlation_df3 = user_correlation_df2.T
user_correlation_df3[user_correlation_df3<0]=0
common_user_predicted_ratings = np.dot(user_correlation_df3, common_user_based_matrix.fillna(0))
common_user_predicted_ratings.shape

(2071, 9529)

In [13]:
# Only consider predicted ratings and remove others
dummy_test = common.copy()
dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)
dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test)

In [22]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_user_predicted_ratings.copy() 
X = X[X>0]
scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))
common_ = common.pivot_table(index='userId', columns='movieId', values='rating')
total_non_nan = np.count_nonzero(~np.isnan(y))
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

  data_min = np.nanmin(X, axis=0)
  data_max = np.nanmax(X, axis=0)


MinMaxScaler(feature_range=(1, 5))


ValueError: Unable to coerce to DataFrame, shape must be (2071, 7783): given (2071, 9529)

### Item-Item Mapping



In this section we predict the movies to recommend based on item-item correlation. 
The steps for find item-item correlation matrix is same as earlier.
1. Find item-item correlation using adjusted cosine similarity
2. Find predicted ratings for the movies that are not rated by the user

In [15]:
# 1. Find item-item correlation using adjusted cosine similarity
# Pivot the train ratings' dataset into matrix format in which columns are movies and the rows are user IDs.
df_pivot = train.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).T # shape = (12911, 2071)
# mean of ratings per movie
mean = np.nanmean(df_pivot, axis=1) # shape = (12911,)
# substracting mean from ratings
df_subtracted = (df_pivot.T-mean).T # shape = (12911, 2071)
# Creating the Item Similarity Matrix using pairwise_distance function.
item_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
# if item_correlation is NAN , replace with 0
item_correlation[np.isnan(item_correlation)] = 0
# if user correlation is -ve, replace with 0
item_correlation[item_correlation<0]=0
item_correlation.shape

(12911, 12911)

In [16]:
# 2. Find predicted ratings for the movies that are not rated by the user
predicted_ratings = np.dot(df_pivot.fillna(0).T, item_correlation)
# remove the available ratings
item_final_rating = np.multiply(predicted_ratings, dummy_train)
item_final_rating.shape

(2071, 12911)

In [17]:
## Recommended movies for a user
userId = 21
recommended_movies = item_final_rating.loc[userId].sort_values(ascending=False)[:5]
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv')
final = pd.merge(recommended_movies, movie_mapping, left_on='movieId', right_on='movieId', how='left')
final.head()

Unnamed: 0,movieId,21,title,genres
0,76251,39.245261,Kick-Ass (2010),Action|Comedy
1,111362,38.47115,X-Men: Days of Future Past (2014),Action|Adventure|Sci-Fi
2,110102,36.579419,Captain America: The Winter Soldier (2014),Action|Adventure|Sci-Fi|IMAX
3,112175,36.204031,How to Train Your Dragon 2 (2014),Action|Adventure|Animation
4,33794,35.348424,Batman Begins (2005),Action|Crime|IMAX


In [18]:
train_new = pd.merge(train,movie_mapping,left_on='movieId',right_on='movieId',how='left')
train_new[train_new.userId == 21] .head()
# Notice that the movies rated high by the same uer belong to similar Genre - Action

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
687,21,99114,4.5,1456285409,Django Unchained (2012),Action|Drama|Western
706,21,318,5.0,1456284625,"Shawshank Redemption, The (1994)",Crime|Drama
1272,21,5816,4.0,1456284964,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy
5778,21,88140,3.5,1456285218,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War
11807,21,130634,3.5,1456284567,Furious 7 (2015),Action|Crime|Thriller


### Evaluation

In [19]:
common = test[test.movieId.isin(train.movieId)]
common_item_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating').T

# create a df. with items only as in test
item_correlation_df = pd.DataFrame(item_correlation)
item_correlation_df['movieId'] = df_subtracted.index
item_correlation_df.set_index('movieId',inplace=True)
list_name = test.movieId.tolist()
item_correlation_df.columns = df_subtracted.index.tolist()
item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]
# keep only items in test matrix remove the rest from rows and columns
item_correlation_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]
item_correlation_df_3 = item_correlation_df_2.T
item_correlation_df_3[item_correlation_df_3<0]=0
# this matrix will have ratings for all the items
common_item_predicted_ratings = np.dot(item_correlation_df_3, common_item_based_matrix.fillna(0))
# multiply with matrix which has ratings for item already available
dummy_test = common.copy()
dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)
dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').T.fillna(0)
common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test)
# actual ratings
common_ = common.pivot_table(index='userId', columns='movieId', values='rating').T
# scale predictions before finding RMSE
X  = common_item_predicted_ratings.copy() 
X = X[X>0]
scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))
total_non_nan = np.count_nonzero(~np.isnan(y))
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

KeyboardInterrupt: 