# data612 - Group Project 2 : Recommender System - movielens
# date: 2019-06-18
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [2]:
import pandas as pd
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://grouplens.org/datasets/movielens/, ml-latest-small will be used.

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/san123i/CUNY/master/Semester2/612-RecommenderSystem/Week2/data_assignment/ml-100k/ratings.csv')

In [4]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Just to make our life a little bit easier, we will change the names of the columns.

In [5]:
df.columns = ['user','item','rating','timestamp']

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
user         100836 non-null int64
item         100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


# Modelling

We will find the best KNN model based on RMSE. 

In [7]:
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

In [8]:
df_algo = []
list_algo = [surprise.KNNBaseline(), surprise.KNNBasic(), surprise.KNNWithMeans(), surprise.KNNWithZScore()]
# Iterate over all algorithms
for algo in list_algo:
    # Perform cross validation
    results = cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algo).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    df_algo.append(tmp)
    
pd.DataFrame(df_algo).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNNBaseline,0.882087,0.414546,5.12139
KNNWithZScore,0.90416,0.409221,5.884724
KNNWithMeans,0.906156,0.354582,4.734807
KNNBasic,0.957978,0.271943,4.041651


As KNNBaseline has the lowest RMSE. We will use this model and test accuracy using training set on test set.

# KNNBaseline - User_Based 

We will perform train/test (80/20) split to create model. 

In [9]:
trainset, testset = train_test_split(data, test_size=.20)

# Let's go with cosine similarity

In [10]:
# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_cos)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8761


0.8760512835713263

Let's created prediction dataframe for cosine in order to check each prediction result on testset.

In [11]:
prediction_df = pd.DataFrame(predictions_cos)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,356,34523,2.0,3.136148,"{'actual_k': 1, 'was_impossible': False}"
1,47,55247,4.0,3.476667,"{'actual_k': 30, 'was_impossible': False}"
2,520,471,5.0,3.915398,"{'actual_k': 35, 'was_impossible': False}"
3,318,43869,4.0,3.513997,"{'actual_k': 1, 'was_impossible': False}"
4,210,77795,3.5,3.881247,{'was_impossible': False}


We can also use individual uid and iid to check individual prediciton result. Using the results from cosine similarity, let's select one of the options in testset.

In [12]:
test_10 = testset[0:10]

In [13]:
# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos = algo.fit(trainset).test(testset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [14]:
# prediction using training set compare the results with testset
uid = test_10[1][0]   # raw user id (as in the ratings file).
iid = test_10[1][1]   # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 47         item: 55247      r_ui = 4.00   est = 3.48   {'actual_k': 30, 'was_impossible': False}


We will then recommend top 5 movies by predicted ratings for any selected user using cosine similarity. We will recommend movies that user has not watched yet based on user-user CF, which means we only recommend iid in testset, not in trainset. This time, I will merge item title data set with rating data set.

In [15]:
item_names = pd.read_csv('https://raw.githubusercontent.com/san123i/CUNY/master/Semester2/612-RecommenderSystem/Week2/data_assignment/ml-100k/movies.csv')
item_names = item_names.iloc[:,[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [16]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 47 are iid: 159817 - Planet Earth (2006) with predicted ratings of 4.088123596140258
top 5 movies for uid: 47 are iid: 166643 - Hidden Figures (2016) with predicted ratings of 3.6227170682060734
top 5 movies for uid: 47 are iid: 3949 - Requiem for a Dream (2000) with predicted ratings of 3.481476889127719
top 5 movies for uid: 47 are iid: 55247 - Into the Wild (2007) with predicted ratings of 3.476667176158484
top 5 movies for uid: 47 are iid: 1097 - E.T. the Extra-Terrestrial (1982) with predicted ratings of 3.447623802880301


# KNNBaseline - Item_Based 

# Let's go with cosine similarity

In [17]:
# We'll use KNNBaseline with cosine similarity (item_based)
sim_options = {'name': 'cosine',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_cos_2)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8912


0.8911683637019261

We can also use individual uid and iid to check individual prediciton result. This time, we are using pearson_baseline. Let's select one of the options in testset.

In [18]:
# prediction using training set and compare the results with testset
uid = test_10[1][0]  # raw user id (as in the ratings file).
iid = test_10[1][1]  # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 47         item: 55247      r_ui = 4.00   est = 3.52   {'actual_k': 40, 'was_impossible': False}


We will now recommend top 5 items based on item-item CF using cosine.

In [19]:
prediction_df_2 = pd.DataFrame(predictions_cos_2)
prediction_df_2.head()
prediction_df_2 = prediction_df_2.merge(item_names, on = 'iid', how = 'inner')

In [20]:
def recommender_user(uid):
    top_5 = prediction_df_2[prediction_df_2.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 47 are iid: 55247 - Into the Wild (2007) with predicted ratings of 3.523563867410118
top 5 movies for uid: 47 are iid: 1265 - Groundhog Day (1993) with predicted ratings of 3.473530021022578
top 5 movies for uid: 47 are iid: 47 - Seven (a.k.a. Se7en) (1995) with predicted ratings of 3.404412021113434
top 5 movies for uid: 47 are iid: 106782 - Wolf of Wall Street, The (2013) with predicted ratings of 3.3904451672372966
top 5 movies for uid: 47 are iid: 1097 - E.T. the Extra-Terrestrial (1982) with predicted ratings of 3.3891946803031425


# Conclusion - Evaluation