 # STAT542: Project 4 - Recommender Systems
 #### Fall, 2021
 #### Yu-Wei Lai / UIN: 677679455 / NetID: yuwei6
 ----
 ### *** **Demo of My Recommender App with Two Systems :**
 ##### https://stat542-project4-recommend-sys.herokuapp.com

 ### *** **Source Code of my recommender App (includes front-end and back-end structures):**
 #### https://github.com/yuwei97910/MovieRecommendSystem

 The user-based recommendation models are created by the package: Surprise, and the web app of the two systems is created by the package: Dash.
 The testing area of these models is on MacBook Air(2021) with M1 chip, 8 cores CPU, 3.2 GHz, with 8 GB RAMs
 The app was deployed with Github and hosted on Heroku Server.

 ----
 ## System 1: The determinant recommendations
 Given the users' favorite genres and make recommendations for the top classic movies and trendy movies
 ### Import required packages

In [1]:
import pandas as pd
import numpy as py
from pandas.core import groupby
from pandas.core.algorithms import unique
import datetime
from datetime import datetime
from pandas.core.indexes.base import Index

 ----
 ### Read in the movies datasets

In [2]:
rating = pd.read_table('ratings.dat', sep='::', header = None)
rating.columns = ['UserID', 'MovieID', 'Rating', 'Timestamp']

users = pd.read_table('users.dat', sep='::', header = None)
users.columns = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']

movies = pd.read_table('movies.dat', sep='::', header = None, encoding='latin-1')
movies.columns = ['MovieID', 'Title', 'Genres']
movies.set_index('MovieID', inplace = True)
movies['Genres'] = movies['Genres'].apply(lambda x: x.split('|'))

 ----
 - Genres - There are 18 genres: ["Children's", 'Film-Noir', 'Drama', 'Thriller', 'Action', 'War', 'Documentary', 'Sci-Fi', 'Animation', 'Western', 'Mystery', 'Fantasy', 'Musical', 'Comedy', 'Crime', 'Adventure', 'Horror', 'Romance']

In [3]:
genres_list = []
genres_list = list(set([x for sublist in movies['Genres'] for x in sublist]))

 -----------
 For the System one, we are required to provide two sets of recommendations.
 I try to recommend those movies which are more "classic" and "trending" movies for each genre.
 Therefore, I first defined the two kinds of scores for each movie by the rating sets and the movies set.

 ### 1. The classic movies
 The must see ones. High ranked by most reviewers & People always talked a lot about them

 This is measured by overall rating numbers and the median of the rating score. The reason of using median is that it indicated that more than 50% of people will at least agree on this rating, which I considered to be more useful than the average score.

 #### classic_score = rating count of a movie * average rating score
 - Calculate the count of rating and normalized
 - Calculate the median of rating
 - Sum the two scores as 'classic_score'

 ### 2. The trending movies
 People nowadays are talking about.
 * there are 86400 seconds for one day
 Followed the EDA of the dataset by Prof.Liang, the rating count will have huge impact on our recommendation. Threfore, to reduce those ratings that are relevantly old, I choose to remove the effect in this section.

 #### (1) calculate the trending value for each rating:
 - Calculate the time difference to present for each review
 - Calculate the weighted rating scores as trending value for up to date ratings (the time difference less than mean minus one std)

 #### (2) calculate the trending score for each movie:
 - Sum all of the trending value for each movies as 'trending_score'


In [4]:
now = float(datetime.timestamp(datetime.now()))
# rating['trending_value'] = rating.apply(lambda x: (x['Rating'] * x['Timestamp'] / (now - x['Timestamp'])), axis=1)
rating['differ_days'] = [((now - x)/86400) for x in rating['Timestamp']]
rating['differ_days'] = (rating['differ_days'] - rating['differ_days'].min()) / (rating['differ_days'].max() - rating['differ_days'].min())
differ_days_75 = (rating['differ_days'].mean() - rating['differ_days'].std())
rating['trending_value'] = rating.apply(lambda x: (x['Rating']/x['differ_days']) if x['differ_days'] < differ_days_75 else 0 , axis=1)

In [5]:
f = {'Rating':['size', 'mean', 'median'], 'trending_value':['sum']}
rating_g = rating.groupby('MovieID')

rating_g = rating_g.agg(f)
rating_g.columns = ['rating_cnt', 'rating_mean', 'rating_median', 'trending_value_sum']
rating_g['rating_cnt_normal'] = (rating_g['rating_cnt'] - rating_g['rating_cnt'].mean()) / rating_g['rating_cnt'].std()
rating_g['trending_score'] = (rating_g['trending_value_sum'] - rating_g['trending_value_sum'].mean()) / rating_g['trending_value_sum'].std()

rating_g['classic_score'] = rating_g.apply(lambda x: x['rating_cnt_normal'] * x['rating_median'], axis=1)
rating_g.drop(['rating_mean', 'rating_median', 'rating_cnt_normal', 'trending_value_sum', ], inplace = True, axis=1)

 ----
 ### Concat the result with the 'movies' dataframe & Impute movies with no rating as 0

In [6]:
movies_system_1 = pd.concat([movies, rating_g], axis = 1, join = "outer")
movies_system_1 = movies_system_1.fillna(0)

 ----
 ## Save the pre-labled dataset

In [7]:
movies_system_1.to_csv('movies_system_1.csv')

 ----
 ## Try the recommendation
 input the favorite genre: Drama

In [8]:
input_genre = ['Drama']
sel_row = [True if x in r else False for x in input_genre for r in movies_system_1['Genres']]
movies_system_1 = movies_system_1.loc[sel_row, :]

 #### Output for classic movies

In [9]:
sel_class_movie = movies_system_1.sort_values(by=['classic_score'], ascending=False).iloc[0:5, :]
sel_class_movie

Unnamed: 0_level_0,Title,Genres,rating_cnt,trending_score,classic_score
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2858,American Beauty (1999),"[Comedy, Drama]",3428.0,-0.007857,41.116114
2028,Saving Private Ryan (1998),"[Action, Drama, War]",2653.0,-0.016647,31.026225
593,"Silence of the Lambs, The (1991)","[Drama, Thriller]",2578.0,-0.020437,30.049784
608,Fargo (1996),"[Crime, Drama, Thriller]",2513.0,-0.026076,29.203535
1196,Star Wars: Episode V - The Empire Strikes Back...,"[Action, Adventure, Drama, Sci-Fi, War]",2990.0,0.009131,28.330959


 #### Output for trending movies

In [10]:
sel_trend_movie = movies_system_1.sort_values(by=['trending_score'], ascending=False).iloc[0:5, :]
sel_trend_movie

Unnamed: 0_level_0,Title,Genres,rating_cnt,trending_score,classic_score
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2453,"Boy Who Could Fly, The (1986)","[Drama, Fantasy]",176.0,5.853676,-0.733417
3098,"Natural, The (1984)",[Drama],548.0,0.086908,2.896628
1196,Star Wars: Episode V - The Empire Strikes Back...,"[Action, Adventure, Drama, Sci-Fi, War]",2990.0,0.009131,28.330959
3363,American Graffiti (1973),"[Comedy, Drama]",990.0,0.008511,7.500221
3683,Blood Simple (1984),"[Drama, Film-Noir]",628.0,0.003393,3.729857


 ----
 ## System 2: The Recommender based on users' preferences
 ### Import packages and datasets

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

rating = pd.read_table('ratings.dat', sep='::', header = None)
rating.columns = ['UserID', 'MovieID', 'Rating', 'Timestamp']

users = pd.read_table('users.dat', sep='::', header = None)
users.columns = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']

movies = pd.read_table('movies.dat', sep='::', header = None, encoding='latin-1')
movies.columns = ['MovieID', 'Title', 'Genres']
movies.set_index('MovieID', inplace = True)
movies['Genres'] = movies['Genres'].apply(lambda x: x.split('|'))

genres_list = []
genres_list = list(set([x for sublist in movies['Genres'] for x in sublist]))
##### ["Children's", 'Film-Noir', 'Drama', 'Thriller', 'Action', 'War', 'Documentary', 'Sci-Fi', 'Animation', 'Western', 'Mystery', 'Fantasy', 'Musical', 'Comedy', 'Crime', 'Adventure', 'Horror', 'Romance']

 #### Use package surprise for model canstruction and testing
 Followed the guideline from the surprise documents, this function is defined for retriveing the top n best recommendations
 - Ref: https://surprise.readthedocs.io/en/stable/FAQ.html?highlight=predict#how-to-get-the-top-n-recommendations-for-each-user

In [12]:
from surprise import Reader
from surprise import Dataset
from surprise import accuracy

from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise.model_selection import KFold

from surprise import KNNBasic
from surprise import SVD
from collections import defaultdict

 ##### Defined the output selection for top n movies

In [13]:
def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

 -----
 ### Split the dataset into training and testing sets (training : testing = 70% : 30%)

In [14]:
################################################################################
# Data Transformation & Split into training and testing set
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(rating[['UserID', 'MovieID', 'Rating']], reader)
training, testing = train_test_split(data, test_size = .3, random_state=9455)

 -----
 ### Method 1: K-Nearest Neighbors
 I first use the train-test set we created to fit a single model.
 The model I used is the KNNBasic() model from Surprise. The algorithm aimed to optimized:
 $$
 \hat{r}_{ui} = \frac{\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot r_{vi}}{\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}
 $$

 ##### Parameters:
 - K: The default setting is 40, which might be a large value. I choose to try k=10 at this step.
 - Similarity measure:
   - User-based or Item-based: The default setting is User-based model; that is, the similarities (distance) are measured by users' properties
       * The similarities are computed between users
       * which is more reasonable to use on this system, since we can know the new users' rating preference when we make recommendations for them. Also, we had only the records of rating that could be used for measurement.
   - The algorithms measured similarities/distance by Mean Squared Difference (MSD) similarity: it only takes common users into consideration.
       * It compared the user distance on features, which is measured as:
   $$
   \text{msd}(u, v) = \frac{1}{|I_{uv}|} \cdot\sum\limits_{i \in I_{uv}} (r_{ui} - r_{vi})^2
   $$
 (Surprise' Documentation 2021)

 ##### Result of the single model:
 The approximate processing time is: 12.7 secs (it will take longer for making prediction)
 The Rooted Mean Square Error (RMSE) for the testing set is 0.9272

In [15]:
################################################################################
# Method 1: K-Nearest Neighbors
model_knn = KNNBasic(k=10)
model_knn.fit(training)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7fa7d886e820>

In [16]:
# Make prediction for the testing set
knn_test_pred = model_knn.test(testing)
print(accuracy.rmse(knn_test_pred))
print(accuracy.mae(knn_test_pred))
# Retrived the recommend results
# knn_top_n = get_top_n(knn_test_pred, n=3)
# for uid, user_ratings in knn_top_n.items():
#     print(uid, [iid for (iid, _) in user_ratings])

RMSE: 0.9618
0.9617785819774314
MAE:  0.7573
0.757342363244218


 -----
 #### Try the cross vaildation method built by package-surprise
 (The dataset is automatically spilt into 5 folds of training and testing set by surprise)

 The approximate time of processing of each testing folds is about one minute, which is very time consuming for making recommendations.
 Besides, the RMSEs of 5 testing folds were over 0.95 on average, which were not be good results.

In [17]:
# Try Cross Vaildation
knn_cv_result = cross_validate(model_knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9556  0.9571  0.9553  0.9578  0.9565  0.9565  0.0009  
MAE (testset)     0.7526  0.7520  0.7517  0.7549  0.7535  0.7529  0.0011  
Fit time          15.86   16.27   15.65   15.55   15.95   15.86   0.25    
Test time         59.59   58.73   58.73   59.95   60.51   59.50   0.70    


 -----
 ### Method 2: Singular Value Decomposotion
 I first use the train-test set we created to fit a single model.
 The model I used is the SVD() model from Surprise. In the algorithms, the predictor is:
 $$
 \hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u; \hat{r}_{ui} = q_i^Tp_u
 $$
 The algorithm aimed to minimize:
 $$
 \sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 +\lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right)
 $$

 ##### Parameters:
 - Learning rate: the default of algorithms is 0.005
 - Regularization term: the default of algorithms is 0.02
 - The number of factors is set to be 100 and the number of iteration is set to be 20
 (Surprise' Documentation 2021)

 ##### Result of the single model:
 The approximate processing time is: 36.7 secs
 The Rooted Mean Square Error (RMSE) for the testing set is 0.88

In [18]:
################################################################################
# Method 2: Singular Value Decomposition
model_svd = SVD()
model_svd.fit(training)

# Try to make prediction
svd_test_pred = model_svd.test(testing)
print(accuracy.rmse(svd_test_pred))
print(accuracy.mae(svd_test_pred))
# Retrived the recommend results
# svd_top_n = get_top_n(svd_test_pred, n=3)
# for uid, user_ratings in svd_top_n.items():
#     print(uid, [iid for (iid, _) in user_ratings])

RMSE: 0.8826
0.8825835198886199
MAE:  0.6932
0.6932297203684125


 -----
 #### Try the cross vaildation method built by package-surprise
 (The dataset is automatically spilt into 5 folds of training and testing set by surprise)
 The process time for each folds is about 38 secs; and the RMSE is about 0.87

In [19]:
# Try Cross Vaildation
svd_cv_result = cross_validate(model_svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8742  0.8753  0.8733  0.8726  0.8735  0.8738  0.0009  
MAE (testset)     0.6868  0.6873  0.6856  0.6846  0.6854  0.6859  0.0010  
Fit time          38.53   38.02   38.37   38.08   38.63   38.32   0.24    
Test time         1.50    1.70    1.57    1.58    1.19    1.51    0.17    


 -----
 ### Try the model: Get one user in the testing set

In [20]:
################################################################################
#### Try the model: Get one user in the testing set
uid = 1
test_uid = rating[rating['UserID'] == uid]
test_uid_d = Dataset.load_from_df(test_uid[['UserID', 'MovieID', 'Rating']], reader)
test_uid_d = test_uid_d.build_full_trainset()
test_uid_d = test_uid_d.build_testset()

# Result from KNN
pred_uid = model_knn.test(test_uid_d)
recommend_result = get_top_n(pred_uid, n=5)
print([iid for (iid, _) in recommend_result[uid]])

# Result from SVD
pred_uid = model_svd.test(test_uid_d)
recommend_result = get_top_n(pred_uid, n=5)
print([iid for (iid, _) in recommend_result[uid]])

recommend_result = [iid for (iid, _) in recommend_result[uid]]
recommend_movie = [movies.loc[x, 'Title'] for x in recommend_result]
recommend_movie

[1193, 2355, 1287, 2804, 1270]
[1035, 527, 595, 1207, 260]


['Sound of Music, The (1965)',
 "Schindler's List (1993)",
 'Beauty and the Beast (1991)',
 'To Kill a Mockingbird (1962)',
 'Star Wars: Episode IV - A New Hope (1977)']

 -----
 ## Save the results for further uses
 I used Python pickle to save the models for building recommender the web app

In [21]:
################################################################################
# Train the model with all records
data = data.build_full_trainset()
model_svd.fit(data)
model_knn.fit(data)

################################################################################
# Pickle the model results for the next run
import pickle
filename = 'model_knn.pk'
with open(filename, 'wb') as f:
    pickle.dump(model_knn, f)

filename = 'model_svd.pk'
with open(filename, 'wb') as f:
    pickle.dump(model_svd, f)

Computing the msd similarity matrix...
Done computing similarity matrix.


 ----
 ## Reference
 - STAT542 Project 4 Guidelines on Campuswire
 - Stackoverflows
 - Surprise' Documentation. Welcome to Surprise' documentation! - Surprise 1 documentation. (2021). Retrieved December 12, 2021, from https://surprise.readthedocs.io/en/stable/index.html.
 - Dash documentation &amp; user guide. Plotly. (2021). Retrieved December 12, 2021, from https://dash.plotly.com/.
 - Lasseter, A. (2021, April 28). Deploy a plotly dash app on Heroku. Medium. Retrieved December 13, 2021, from https://austinlasseter.medium.com/deploy-a-plotly-dash-app-on-heroku-4d2c3224230.