Пакет [Surprise](https://surprise.readthedocs.io/en/stable/)

- используйте данные MovieLens 1M
- можно использовать любые модели из пакета
- получите RMSE на тестовом сете 0.87 и ниже
\
\
Комментарий:
- На датасет 1М можешь не хватить RAM. Можно сделать на 100K. 
- Качество RMSE предлагаю считать на основе CrossValidation (5 фолдов), а не отложенном датасете

In [1]:
# !pip install surprise

In [24]:
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, SVD, SVDpp, accuracy
from surprise.model_selection import KFold, train_test_split, cross_validate, GridSearchCV

In [83]:
movies = pd.read_csv('/Users/aleksandr/Desktop/movies.csv')
ratings = pd.read_csv('/Users/aleksandr/Desktop/ratings.csv')

In [84]:
def preprocessing(movies, ratings):
    
    data = pd.merge(movies, ratings, on='movieId') # Объеденяем 2 набора данных по ключу 'movieId'
    data.dropna(inplace=True) # Удаляем пропуски (пустые значения), если они есть
    data = pd.DataFrame({'uid': data.userId, 'iid': data.movieId, 'rating': data.rating}) # Формируем новый набор данных
    
    return data

In [85]:
data = preprocessing(movies, ratings)
data.head()

Unnamed: 0,uid,iid,rating
0,1,1,4.0
1,5,1,4.0
2,7,1,4.5
3,15,1,2.5
4,17,1,4.5


In [102]:
def surprise_preprocessing(data, return_dataset=True):
    
    '''
    reader            - Масштабируем от min и max, по выбранной колонке нашего набора данных;
    data              - Делаем объект класса 'datasets' из нашего dataframe;
    trainset, testset - Разбиваем на тренировочную (0.8) и тестовую (0.2) выборки, от data после preprocessing.
    
    # Не путать from surprise.model_selection import train_test_split с \
                from sklearn.model_selection import train_test_split это разные вещи!
    
    # В данной функции мы получам объект <surprise.trainset.Trainset> при return_dataset=False и \
                                     <surprise.dataset.DatasetAutoFolds> при return_dataset=True
    
    return_dataset    - Вернуть dataset если = True, \
                        Вернуть trainset, testset если = False
    '''
    
    reader = Reader(rating_scale=(data.rating.min(), data.rating.max())) # ['min', 'max'] == [0.5, 5.0]
    dataset = Dataset.load_from_df(data, reader)
    trainset, testset = train_test_split(dataset, test_size=0.2)
    
    if return_dataset!=True:
        return trainset, testset
    else:
        return dataset

In [104]:
dataset = surprise_preprocessing(data, return_dataset=True)
trainset, testset = surprise_preprocessing(data, return_dataset=False)

### [KNNBasic](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBasic)

In [110]:
kfold = KFold(5)
scores_KNNBasic = []

for trainset, testset in tqdm_notebook(kfold.split(dataset)):
    algo = KNNBasic(k=40, sim_options={'name': 'pearson_baseline', 'user_based': False})
    algo.fit(trainset)
    predictions = algo.test(testset)
    scores_KNNBasic.append(accuracy.rmse(predictions))
    print()
    
print('Mean RMSE: {}'.format(np.mean(scores_KNNBasic)))

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.9167

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.9167

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.9212

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.9170

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.9122

Mean RMSE: 0.91676580857083


### [KNNWithMeans](https://surprise.readthedocs.io/en/stable/knn_inspired.html)

In [106]:
kfold = KFold(5)
scores = []

for trainset, testset in tqdm_notebook(kfold.split(dataset)):
    algo = KNNWithMeans(k=40, sim_options={'name': 'pearson_baseline', 'user_based': False})
    algo.fit(trainset)
    predictions = algo.test(testset)
    scores.append(accuracy.rmse(predictions))
    print()
    
print('Mean RMSE: {}'.format(np.mean(scores)))

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8818

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8812

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8781

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8891

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8854

Mean RMSE: 0.8831428721183293


### [SVD](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html) _and_ [Cross_validate](https://surprise.readthedocs.io/en/stable/model_selection.html)
_(The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.When baselines are not used, this is equivalent to Probabilistic Matrix Factorization [SM08] (see note below)..)_

In [107]:
N_FOLDS = 5
algo = SVD()

CV = cross_validate(algo, dataset, measures=['RMSE'], cv=N_FOLDS, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8663  0.8810  0.8783  0.8725  0.8698  0.8736  0.0054  
Fit time          4.46    4.54    4.86    4.41    4.61    4.58    0.16    
Test time         0.12    0.10    0.15    0.10    0.14    0.12    0.02    


### [SVD++](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html) _and_ [Cross_validate](https://surprise.readthedocs.io/en/stable/model_selection.html)
_(The SVD++ algorithm, an extension of SVD taking into account implicit ratings.)_

In [108]:
N_FOLDS = 5
algo_1 = SVDpp()

CV_1 = cross_validate(algo_1, dataset, measures=['RMSE'], cv=N_FOLDS, verbose=True)

Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8637  0.8671  0.8674  0.8592  0.8541  0.8623  0.0051  
Fit time          516.64  520.05  510.64  528.83  522.43  519.72  6.04    
Test time         8.77    8.41    9.14    8.64    8.66    8.72    0.24    


### [SVD++](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html) _and_ [GridSearchCV](https://surprise.readthedocs.io/en/stable/model_selection.html#surprise.model_selection.search.GridSearchCV)

In [112]:
model_params = {
    'n_epochs': [20, 25], 
    'lr_all': [0.007, 0.009],
    'reg_all': [0.2, 0.4]
}

svdpp_gscv = GridSearchCV(SVDpp, model_params, measures=['rmse'], cv=5, n_jobs=-1)
svdpp_gscv.fit(dataset)

In [114]:
print(svdpp_gscv.best_params)
print(svdpp_gscv.best_score)
print(svdpp_gscv.best_estimator)

{'rmse': {'n_epochs': 25, 'lr_all': 0.009, 'reg_all': 0.2}}
{'rmse': 0.8697521904610671}
{'rmse': <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x1234d2780>}


### [SVD++](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html) _and_ [best_params](https://surprise.readthedocs.io/en/stable/model_selection.html#surprise.model_selection.search.GridSearchCV.best_params)

In [115]:
algo_svdpp = SVDpp(
    n_epochs = 25, 
    lr_all = 0.009,
    reg_all = 0.2
)

algo_svdpp.fit(trainset)
test_pred = algo_svdpp.test(testset)
accuracy.rmse(test_pred, verbose=True)

RMSE: 0.8687


0.8687003148250274

In [None]:
pass