---
title: Recommender systems, a celebration of collaborative filtering and content filtering
---

*Group 15 - Weidong Xu, Jiejun Lu*

**Note:** You can view source codes for this project on [GitHub](https://github.com/xuwd11/Recommender_Systems), including the python package [`recommender`](https://github.com/xuwd11/Recommender_Systems/tree/master/recommender) we wrote for this project.
<br>

## Contents
{:.no_toc}
*  
{: toc}

## Project Statement and Motivation

Recommender systems can predict the rating a user would give to an item by learning from the historical data of user's ratings, or the attributes of users and items if available. In this project, we constructed a recommender system for restaurants using an ensemble method, which combines the prediction of several base estimators, including baseline estimators, collaborative filtering estimators, and content filtering estimators. We benchmarked these base estimators, followed by exploring strategies of building the ensemble estimator. We demonstrated our recommender system performs robustly on different size of datasets.
<br><br>

## Introduction and Description of Data

### Dataset

In this project, we use "review" (shape: 4736897 X 9), "business" (shape: 156639 X 101)  and "user" (shape: 1183362 X 22) from [Yelp academic dataset](https://www.yelp.com/dataset/challenge). Each row in "review" specifies a review that a user makes on a restaurant (or other business premises such as barbershop), including date, comment ("text") and rating ("stars"), as well as the number of votes of different attributes received on this review. "Business" contains information on restaurants and barbershops appearing in "review", including typical attributes defining a restaurant and average ratings. There are a lot of missing values in "business", mostly caused by the missing of attribute descriptions. "User" contains information on users, including profile summaries, social networks on yelp and average ratings. "Review", "business" and "user" are linked together through "user\_id" and "business\_id". A peek of raw data is available [here](https://github.com/xuwd11/Recommender_Systems/blob/master/01_rawdata_peek.ipynb).

### Data Wrangling

To wrangle data for EDA and predictive modeling, we first checked and cleaned duplicate reviews (an user reviews a business for multiple times). We identified 1 case of duplicates involving 2 reviews; we simply dropped one of them since the ratings happen to be the same. Then we dropped business places unrelated to restaurants and closed restaurants (~16.4% of rows in "business"), and kept reviews and users associated with remaining restaurants. We conducted a series of EDA and found that ratings are correlated with many of the attributes in "business", "user" and "review", which inspired us to propose a content filtering model. We checked the number of restaurants in each city (there are 980 cities in the remaining dataset), and sampled several sets of different sizes by extracting data associated with restaurants in several cities of different sizes respectively for benchmarking (we chose Champaign, Cleveland, Pittsburgh, Toronto, and Las_Vegas). 

To build a recommender system, we can do collaborative filtering or content filtering. To perform collaborative filtering, we only need restaurant ratings from each user, which we can obtain by keeping 3 columns, i.e., "user\_id", "business\_id" and "stars", in "review". Content filtering requires a profile for each user or restaurant, which can characterize its nature; we can obtain the required data by merging "review" with "user" and "business" through "user\_id" and "business\_id" respectively.
<br><br>

## Literature Review/Related Work

There are primarily [2 strategies](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf) for recommender systems, content filtering and [collaborative filtering](http://files.grouplens.org/papers/FnT%20CF%20Recsys%20Survey.pdf). Content filtering requires a profile for each user or restaurant, which captures its nature, and makes predictions by learning from user or restaurant profiles. Collaborative filtering makes predictions by analyzing only past user behavior, such as restaurant ratings. Although collaborative filtering suffers from cold start problem due to its inability to handle new users and restaurants, collaborative filtering is extremely helpful when user or restaurant profile data required for content filtering are not available, and is generally more accurate than content filtering methods.

Collaborative filtering approaches include neighborhood methods and latent factor models. As demonstrated by the [Netflix Prize competition](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf), matrix factorization based latent factor models generally performs better than neighborhood methods.

We implemented some baseline models and latent factor models from scratch (by using `numpy` and `scipy`'s linear algebra toolkits instead of well-established recommender system packages); we implemented other algorithms by wrapping around methods in a recommender system python package, [`scikit-surprise`](http://surpriselib.com/). Each algorithm we implemented by using the [`scikit-surprise`](http://surpriselib.com/) package is indicated by a * after its name.

Besides reporting root mean square error (RMSE) and $R^2$ score, we found it would be very helpful to visualize model's performance on different ratings by rounding the predicted ratings to integers (and setting ratings below 1 to 1 and above 5 to 5), and plotting the confusion matrix in a format used in [a related work](https://github.com/kevin11h/YelpDatasetChallengeDataScienceAndMachineLearningUCSD/blob/master/Yelp%20Predictive%20Analytics.ipynb). 

A comprehensive list of references can be found in [References](09_reference.html).
<br><br>

## Result summary

We benchmarked the base estimators and ensemble estimators in 6 datasets of different sizes. In each dataset, we randomly split the reviews into 3 sets: a training set (60%), a cross-validation set (16%) and a test set (24%). We train base estimators on the training set, and test on the test set; cross-validation set is used for the training of ensemble estimators. All experiments are run on a desktop with Inter Xeon CPU 3.10 GHz, 256 GB RAM.

Please see [EDA](03_EDA.html), [Collaborative Filtering](04_collaborative_filtering.html) and [Content Filtering](05_content_filtering.html) for modeling approaches and project trajectory.

Please see [Results](07_results.html) for detailed results including confusion matrix visualization (of rounded predictions) for each experiment we list below.

Please see [Conclusions](08_conclusions.html) for conclusions and future work.
<br><br>

In [1]:
import sys
import traceback
import pandas as pd
import numpy as np
import time
from copy import deepcopy

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.linear_model import Ridge, RidgeCV
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

from scipy import sparse

import pickle

from IPython.display import display, HTML, Markdown

import matplotlib
import matplotlib.pyplot as plt

import seaborn as sns
pd.set_option('display.width', 15000)
pd.set_option('display.max_columns', 100)
sns.set_style("whitegrid", {'axes.grid' : False})
sns.set_context('poster')
%matplotlib inline

from surprise import Dataset, Reader
from surprise import NormalPredictor, BaselineOnly, SVD, SVDpp, NMF, \
SlopeOne, CoClustering, KNNBasic, KNNWithMeans, KNNBaseline

from recommender import plot_cm, get_results, show_results, IO, \
show_summaries, get_base_predictions, get_multi_base_predictions
from recommender import ModeClassifier, BaselineMean, BaselineRegression, ALS1, ALS2, RS_surprise, RS_ensemble

In [2]:
cities = ['Champaign', 'Cleveland', 'Pittsburgh', 'Toronto', 'Las_Vegas', 'Full']

for city in cities:
    data_dir = 'data/{}/'.format(city)
    model_names = IO(data_dir + 'results/model_names.pkl').read_pickle()
    results = IO(data_dir + 'results/results.pkl').read_pickle()
    is_successful = IO(data_dir + 'results/is_successful.pkl').read_pickle()
    sizes = IO(data_dir + 'sizes.pkl').read_pickle()
    
    display(Markdown('### {} <sup>({} reviews, {} restaurants, {} users)</sup>'.\
                 format(city, sizes[0], sizes[1], sizes[2])))
    # display(Markdown('**Collaborative filtering**'))
    show_summaries(model_names, results, is_successful, title='Collaborative filtering')
    display(Markdown('<sup>(* shows the algorithms we implemented by wrapping around \
    methods in scikit-surprise python package)</sup>'))
    
    model_names = IO(data_dir + 'results05/model_names.pkl').read_pickle()
    results = IO(data_dir + 'results05/results.pkl').read_pickle()
    is_successful = IO(data_dir + 'results05/is_successful.pkl').read_pickle()
    #sizes = IO(data_dir + 'sizes.pkl').read_pickle()

    #display(Markdown('### {} <sup>({} reviews, {} restaurants, {} users)</sup>'.\
                     #format(city, sizes[0], sizes[1], sizes[2])))
    #display(Markdown('**Content filtering**'))
    show_summaries(model_names, results, is_successful, title='Content filtering')
    
    model_names = IO(data_dir + 'results06/model_names.pkl').read_pickle()
    results = IO(data_dir + 'results06/results.pkl').read_pickle()
    is_successful = IO(data_dir + 'results06/is_successful.pkl').read_pickle()
    #sizes = IO(data_dir + 'sizes.pkl').read_pickle()

    #display(Markdown('### {} <sup>({} reviews, {} restaurants, {} users)</sup>'.\
                     #format(city, sizes[0], sizes[1], sizes[2])))
    #display(Markdown('**Ensemble estimators**'))
    show_summaries(model_names, results, is_successful, title='Ensemble estimators')
    display(Markdown('<sup>(Ensemble1 represents the ensemble of collaborative filtering models; \
    Ensemble2 represents the ensemble of collaborative filtering and content filtering models)</sup>'))
        
    display(Markdown('''<br><br>'''))

### Champaign <sup>(20571 reviews, 878 restaurants, 8451 users)</sup>

Collaborative filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Mode estimator,0.0,1.9995,2.0258,-0.9501,-0.95
Normal predictor*,0.087,1.8825,1.8821,-0.7286,-0.6833
Baseline (mean),0.019,0.9485,1.4648,0.5612,-0.0195
Baseline (regression),0.035,1.0481,1.3032,0.4642,0.193
Baseline (ALS)*,0.057,1.1981,1.32,0.2998,0.1721
KNN (basic)*,0.9841,0.4328,1.4642,0.9086,-0.0187
KNN (with means)*,1.2851,0.5898,1.531,0.8303,-0.1138
KNN (baseline)*,1.0201,0.4175,1.3718,0.915,0.1058
SVD-ALS1,12.2077,0.6747,1.3064,0.778,0.1891
SVD-ALS2,12.9087,0.6764,1.3092,0.7768,0.1855


<sup>(* shows the algorithms we implemented by wrapping around     methods in scikit-surprise python package)</sup>

Content filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ridge regression,0.069,1.0773,1.0971,0.4339,0.428
Random forest,1.0951,1.0262,1.0862,0.4864,0.4394


Ensemble estimators,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ensemble1 (weighted average),0.0,0.8527,1.3071,0.6454,0.1882
Ensemble1 (Ridge regression),0.011,1.3268,1.3026,0.1413,0.1937
Ensemble1 (random forest),0.22,1.0506,1.3048,0.4617,0.191
Ensemble2 (weighted average),0.0,0.9007,1.1591,0.6043,0.3616
Ensemble2 (Ridge regression),0.004,1.2721,1.083,0.2107,0.4426
Ensemble2 (random forest),0.271,1.0678,1.0847,0.4439,0.4409


<sup>(Ensemble1 represents the ensemble of collaborative filtering models;     Ensemble2 represents the ensemble of collaborative filtering and content filtering models)</sup>

<br><br>

### Cleveland <sup>(75932 reviews, 2500 restaurants, 30131 users)</sup>

Collaborative filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Mode estimator,0.0,1.8152,1.8262,-0.8226,-0.8371
Normal predictor*,0.225,1.7514,1.7529,-0.6968,-0.6926
Baseline (mean),0.055,0.8908,1.3417,0.561,0.0084
Baseline (regression),0.111,0.987,1.2051,0.4611,0.2
Baseline (ALS)*,0.279,1.1171,1.217,0.3097,0.1841
KNN (basic)*,13.4688,0.3952,1.3484,0.9136,-0.0016
KNN (with means)*,14.3168,0.56,1.402,0.8265,-0.0829
KNN (baseline)*,13.0127,0.3837,1.2612,0.9186,0.1237
SVD-ALS1,41.9954,0.5721,1.2095,0.819,0.1941
SVD-ALS2,44.2825,0.574,1.2121,0.8177,0.1907


<sup>(* shows the algorithms we implemented by wrapping around     methods in scikit-surprise python package)</sup>

Content filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ridge regression,0.286,1.0195,1.0313,0.4251,0.4141
Random forest,4.7943,0.9929,1.0155,0.4546,0.432


Ensemble estimators,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ensemble1 (weighted average),0.0,0.7559,1.2085,0.6839,0.1955
Ensemble1 (Ridge regression),0.005,0.9587,1.204,0.4916,0.2014
Ensemble1 (random forest),0.726,0.9431,1.2063,0.508,0.1984
Ensemble2 (weighted average),0.0,0.8214,1.0915,0.6268,0.3437
Ensemble2 (Ridge regression),0.007,1.0072,1.0141,0.4389,0.4334
Ensemble2 (random forest),0.9471,0.9989,1.018,0.448,0.4291


<sup>(Ensemble1 represents the ensemble of collaborative filtering models;     Ensemble2 represents the ensemble of collaborative filtering and content filtering models)</sup>

<br><br>

### Pittsburgh <sup>(143682 reviews, 4745 restaurants, 46179 users)</sup>

Collaborative filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Mode estimator,0.0,1.8026,1.7988,-0.8466,-0.8393
Normal predictor*,0.455,1.7307,1.7303,-0.7022,-0.7017
Baseline (mean),0.102,0.9052,1.3198,0.5343,0.0099
Baseline (regression),0.2,0.9941,1.1878,0.4384,0.198
Baseline (ALS)*,0.578,1.1119,1.202,0.2974,0.1788
SVD-ALS1,79.5536,0.5627,1.196,0.82,0.187
SVD-ALS2,82.5507,0.5651,1.201,0.8185,0.1801
SVD-SGD*,7.5524,0.8267,1.2046,0.6116,0.1752
SVD++-SGD*,43.6945,0.8738,1.2025,0.5661,0.178
NMF-SGD*,9.3785,0.3666,1.3761,0.9236,-0.0765


<sup>(* shows the algorithms we implemented by wrapping around     methods in scikit-surprise python package)</sup>

Content filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ridge regression,0.55,1.0158,1.0062,0.4135,0.4245
Random forest,10.1126,0.9938,0.9896,0.4388,0.4434


Ensemble estimators,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ensemble1 (weighted average),0.0,0.805,1.1919,0.6317,0.1925
Ensemble1 (Ridge regression),0.015,0.9371,1.1872,0.501,0.1988
Ensemble1 (random forest),1.2871,0.9703,1.1882,0.465,0.1974
Ensemble2 (weighted average),0.0,0.8576,1.0612,0.582,0.3598
Ensemble2 (Ridge regression),0.017,0.9494,0.9882,0.4878,0.4449
Ensemble2 (random forest),1.7071,0.9996,0.9933,0.4322,0.4391


<sup>(Ensemble1 represents the ensemble of collaborative filtering models;     Ensemble2 represents the ensemble of collaborative filtering and content filtering models)</sup>

<br><br>

### Toronto <sup>(331407 reviews, 12118 restaurants, 77506 users)</sup>

Collaborative filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Mode estimator,0.0,1.883,1.8801,-1.1173,-1.1219
Normal predictor*,1.0181,1.7034,1.7099,-0.7326,-0.7552
Baseline (mean),0.236,0.9293,1.2911,0.4843,-0.0006
Baseline (regression),0.627,0.9918,1.1624,0.4126,0.189
Baseline (ALS)*,1.6651,1.0916,1.173,0.2884,0.174
SVD-ALS1,168.3716,0.5614,1.1751,0.8118,0.1711
SVD-ALS2,169.0347,0.5634,1.1795,0.8104,0.1649
SVD-SGD*,17.636,0.8222,1.1772,0.5963,0.1681
SVD++-SGD*,119.8469,0.873,1.1763,0.5449,0.1694
NMF-SGD*,22.2953,0.4094,1.3369,0.8999,-0.0729


<sup>(* shows the algorithms we implemented by wrapping around     methods in scikit-surprise python package)</sup>

Content filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ridge regression,1.1401,1.0049,1.0035,0.397,0.3955
Random forest,27.0035,0.9891,0.9909,0.4158,0.4106


Ensemble estimators,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ensemble1 (weighted average),0.0,0.8044,1.1661,0.6136,0.1837
Ensemble1 (Ridge regression),0.029,1.0152,1.162,0.3845,0.1895
Ensemble1 (random forest),3.2312,0.9977,1.1637,0.4056,0.1872
Ensemble2 (weighted average),0.0,0.8564,1.0542,0.562,0.3329
Ensemble2 (Ridge regression),0.039,1.0111,0.9879,0.3895,0.4141
Ensemble2 (random forest),4.1512,0.9948,0.9962,0.409,0.4043


<sup>(Ensemble1 represents the ensemble of collaborative filtering models;     Ensemble2 represents the ensemble of collaborative filtering and content filtering models)</sup>

<br><br>

### Las_Vegas <sup>(1280896 reviews, 20434 restaurants, 429363 users)</sup>

Collaborative filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Mode estimator,0.0,1.906,1.9073,-0.7549,-0.7578
Normal predictor*,5.3933,1.8565,1.8573,-0.6649,-0.6667
Baseline (mean),1.1371,0.999,1.4148,0.5179,0.0329
Baseline (regression),5.9963,1.0732,1.2612,0.4436,0.2314
Baseline (ALS)*,6.9174,1.188,1.2696,0.3182,0.2211
SVD-ALS1,652.1173,0.4264,1.2794,0.9122,0.2091
SVD-ALS2,674.9286,0.4283,1.2862,0.9114,0.2007
SVD-SGD*,70.622,0.7758,1.2827,0.7093,0.205
SVD++-SGD*,333.8711,0.8046,1.302,0.6873,0.1809
NMF-SGD*,92.4893,0.4178,1.4916,0.9157,-0.075


<sup>(* shows the algorithms we implemented by wrapping around     methods in scikit-surprise python package)</sup>

Content filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ridge regression,5.0093,1.1216,1.1226,0.3923,0.3911
Random forest,154.2278,1.1008,1.1029,0.4146,0.4122


Ensemble estimators,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ensemble1 (weighted average),0.0,0.7633,1.2648,0.7185,0.227
Ensemble1 (Ridge regression),0.127,1.0514,1.26,0.466,0.2329
Ensemble1 (random forest),13.2948,1.0838,1.2617,0.4326,0.2308
Ensemble2 (weighted average),0.0,0.8674,1.1697,0.6366,0.3389
Ensemble2 (Ridge regression),0.152,1.1144,1.1016,0.4001,0.4136
Ensemble2 (random forest),16.93,1.1061,1.1082,0.409,0.4066


<sup>(Ensemble1 represents the ensemble of collaborative filtering models;     Ensemble2 represents the ensemble of collaborative filtering and content filtering models)</sup>

<br><br>

### Full <sup>(4166778 reviews, 131025 restaurants, 1117891 users)</sup>

Collaborative filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Mode estimator,0.0,1.8974,1.8985,-0.7803,-0.7799
Normal predictor*,19.3631,1.8394,1.8405,-0.6729,-0.6727
Baseline (mean),4.6053,1.0178,1.4063,0.4878,0.0234
Baseline (regression),21.4822,1.0642,1.2529,0.44,0.2248
Baseline (ALS)*,27.2096,1.1754,1.2659,0.3169,0.2086
SVD-ALS1,2153.7902,0.5313,1.2691,0.8604,0.2046
SVD-ALS2,2268.9128,0.5332,1.2756,0.8594,0.1965
SVD-SGD*,242.1499,0.8312,1.2721,0.6584,0.2008
SVD++-SGD*,1473.2923,0.8713,1.2784,0.6246,0.193
NMF-SGD*,323.9235,0.4277,1.4656,0.9095,-0.0607


<sup>(* shows the algorithms we implemented by wrapping around     methods in scikit-surprise python package)</sup>

Content filtering,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ridge regression,17.161,1.0857,1.0869,0.4171,0.4167
Random forest,663.2849,1.0639,1.0653,0.4403,0.4396


Ensemble estimators,fitting time (s),train RMSE,test RMSE,train $R^2$,test $R^2$
Ensemble1 (weighted average),0.0,0.8161,1.2572,0.6706,0.2195
Ensemble1 (Ridge regression),0.416,1.055,1.252,0.4496,0.226
Ensemble1 (random forest),56.0282,1.0652,1.2542,0.4389,0.2233
Ensemble2 (weighted average),0.0,0.8864,1.1434,0.6115,0.3545
Ensemble2 (Ridge regression),0.553,1.0865,1.0623,0.4163,0.4427
Ensemble2 (random forest),69.695,1.0689,1.0703,0.4351,0.4343


<sup>(Ensemble1 represents the ensemble of collaborative filtering models;     Ensemble2 represents the ensemble of collaborative filtering and content filtering models)</sup>

<br><br>

## Authors

* [Weidong Xu](https://github.com/xuwd11)

* [Jiejun Lu](https://github.com/gwungwun)