Comparison of different algorithms available in Scikit-Surprise Library. The ones used here are Singular Value Decomposition(SVD), Stochastic Gradient Descent(SGD) and Alterating Least Squares(ALS). The dataset used is the movie ratings dataset, which can be found in grouplens.org. The one used here is of size 1M.

In [1]:
%matplotlib inline
import numpy 
import pandas

from surprise import Dataset, Reader, accuracy
from surprise import SVD, BaselineOnly, KNNBasic
from surprise.model_selection import cross_validate, train_test_split

In [2]:
ratings_data = pandas.read_table('ml-1m/ratings.dat', sep = '::', names = ['user_id', 'movie_id', 'rating', 'time'])

  return read_csv(**locals())


In [3]:
print(ratings_data.head())

   user_id  movie_id  rating       time
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291


In [4]:
movie_data = ratings_data[['user_id', 'movie_id', 'rating']]
reader = Reader(line_format='user item rating', sep = ',')
movie_data = Dataset.load_from_df(movie_data, reader=reader)

In [5]:
training_set, testing_set = train_test_split(movie_data, test_size=0.2)

Using SGD

In [6]:
baseline_options = {'method' : 'sgd', 'learning_rate': 0.001}
algorithm = BaselineOnly(bsl_options=baseline_options)
cross_validate(algorithm, movie_data, measures=['RMSE', 'MAE'], cv=4, verbose=True)

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Evaluating RMSE, MAE of algorithm BaselineOnly on 4 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Mean    Std     
RMSE (testset)    0.9140  0.9133  0.9144  0.9115  0.9133  0.0011  
MAE (testset)     0.7247  0.7235  0.7247  0.7227  0.7239  0.0009  
Fit time          5.10    5.77    5.79    5.65    5.58    0.28    
Test time         1.60    2.55    2.57    2.55    2.32    0.42    


{'test_rmse': array([0.9140353 , 0.91326707, 0.91437977, 0.9115154 ]),
 'test_mae': array([0.72470027, 0.72352706, 0.72472232, 0.72266963]),
 'fit_time': (5.097360372543335,
  5.770519495010376,
  5.794794321060181,
  5.645847320556641),
 'test_time': (1.599212646484375,
  2.5527093410491943,
  2.5704715251922607,
  2.551476240158081)}

In [7]:
algorithm.fit(training_set)
predictions = algorithm.test(testing_set)
accuracy.rmse(predictions)

Estimating biases using sgd...
RMSE: 0.9119


0.9119431263461393

Using ALS

In [8]:
baseline_options = {'method' : 'als', 'learning_rate': 16}
similarity_options = {'name': 'pearson_baseline'}
algorithm = KNNBasic(bsl_options=baseline_options, sim_options=similarity_options)
cross_validate(algorithm, movie_data, measures=['RMSE', 'MAE'], cv=4, verbose=True)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 4 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Mean    Std     
RMSE (testset)    0.9307  0.9296  0.9300  0.9296  0.9300  0.0004  
MAE (testset)     0.7403  0.7397  0.7399  0.7386  0.7396  0.0006  
Fit time          64.14   64.73   65.29   65.88   65.01   0.65    
Test time         144.19  143.42  142.36  142.88  143.22  0.68    


{'test_rmse': array([0.93068408, 0.92959469, 0.93001881, 0.9295874 ]),
 'test_mae': array([0.74034448, 0.73972784, 0.73989412, 0.73863138]),
 'fit_time': (64.14145874977112,
  64.73085975646973,
  65.29405641555786,
  65.87912058830261),
 'test_time': (144.1949439048767,
  143.42399096488953,
  142.36009001731873,
  142.88420248031616)}

In [9]:
algorithm.fit(training_set)
predictions = algorithm.test(testing_set)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.9260


0.9260466800891254

Using SVD

In [10]:
algorithm = SVD()
cross_validate(algorithm, movie_data, measures=['RMSE', 'MAE'], cv=4, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 4 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Mean    Std     
RMSE (testset)    0.8771  0.8793  0.8766  0.8776  0.8776  0.0010  
MAE (testset)     0.6890  0.6905  0.6884  0.6891  0.6892  0.0008  
Fit time          49.10   49.14   47.96   48.91   48.78   0.48    
Test time         2.92    2.78    2.79    2.84    2.83    0.06    


{'test_rmse': array([0.87712951, 0.87929278, 0.87660288, 0.87755486]),
 'test_mae': array([0.68900749, 0.690501  , 0.68835565, 0.6890917 ]),
 'fit_time': (49.09730887413025,
  49.140267848968506,
  47.960280895233154,
  48.906163930892944),
 'test_time': (2.921950101852417,
  2.7793517112731934,
  2.789134979248047,
  2.838862895965576)}

In [12]:
algorithm.fit(training_set)
predictions = algorithm.test(testing_set)
accuracy.rmse(predictions)

RMSE: 0.8733


0.8732707293213283