Домашнее задание по теме «Рекомендации на основе содержания»
Пакет SURPRISE:

используйте данные MovieLens 1M,
можно использовать любые модели из пакета,
получите RMSE на тестовом сете 0,87 и ниже.

* Загружаем данные и собираем датасет(фильм-рейтинг)
* Используем средства SURPRISE для перевода pandas датафрейма в нужный формат
* Отбираем алгоритм/алгоритмы из SURPRISE, которые будем обучать
* В процессе обучения выполняем проверку на 5 фолдах, оцениваем RMSE
* Отбираем лучший алгоритм, при необходимости тьюним его
https://surpriselib.com/

Алгоритмы https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

0. Загрузка данных и импорт библиотек

In [None]:
%pip install scikit-surprise

In [23]:
import numpy as np
import pandas as pd
import zipfile
from surprise import Dataset, Trainset
from surprise.accuracy import rmse
from surprise.similarities import cosine, msd, pearson, pearson_baseline
from surprise import Reader
from surprise.prediction_algorithms.algo_base import AlgoBase 
from surprise.prediction_algorithms.baseline_only import BaselineOnly
from surprise.prediction_algorithms.knns import KNNWithMeans,\
KNNWithZScore, KNNBaseline
from surprise.model_selection.search import GridSearchCV
from surprise.prediction_algorithms.matrix_factorization import SVD, SVDpp, NMF
from surprise.model_selection import train_test_split
from surprise.model_selection.split import KFold
from surprise.model_selection.validation import cross_validate
from surprise import accuracy
from sklearn.metrics import mean_squared_error

1. Пользовательские функции

In [31]:
def make_dataset() -> Dataset:
  """
  Prepare dataset for Surprise
  """
  # load dataset
  !wget  "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"  --no-check-certificate
  # unzip data from zip
  with zipfile.ZipFile('ml-latest-small.zip', 'r') as zip_ref:
    zip_ref.extractall('data')
  # read tables
  movies = pd.read_csv('data\\ml-latest-small\\movies.csv')
  ratings = pd.read_csv('data\\ml-latest-small\\ratings.csv')
  ratings.drop(columns=['timestamp'], inplace=True)
  # join tables
  df = ratings.join(movies.set_index('movieId'), on='movieId', how='left')
  dataset = pd.DataFrame({
    'uid': df.userId,
    'iid': df.title,
    'rating': df.rating
})
  min = dataset.rating.min()
  max = dataset.rating.max()
  reader = Reader(rating_scale=(min, max)) 
  data = Dataset.load_from_df(dataset, reader)
  return data

In [5]:
def get_algo(algo_name) -> AlgoBase():
  """
  Make a dictionary with surprise models
  Args:
    algo_name:str - a name of surprise model
  Return:
    algos[algo_name]: AlgoBase - choosed model 
  """
  algos = dict()
  algos['BaselineOnly'] = BaselineOnly()
  algos['KNNWithMeans'] = KNNWithMeans()
  algos['KNNWithZScore'] = KNNWithZScore()
  algos['KNNBaseline'] = KNNBaseline()
  algos['SVD'] = SVD()
  algos['SVDpp'] = SVDpp()
  algos['NMF'] = NMF()
  return algos[algo_name]

In [6]:
def build_features(data: Dataset, split_ratio: float, random_seed: int) -> tuple[Trainset, Trainset, Trainset]:
  """Make datasets for fit, predict, cross_validate model
  Args:
    data: Dataset - whole dataset
  Return:
    trainset: Trainset,
    testset: Trainset
    traintestfull: Trainset
  """
  trainset, testset = train_test_split(data, test_size=split_ratio, random_state=random_seed)
  traintestfull = data.build_full_trainset()
  return  trainset, testset, traintestfull

In [7]:
def train_model( trainset: Trainset, testest: Trainset, algo_name: str) -> np.float64:
  """
  Get algoritm by its name,
  fit model, predict on test dataset
  & return rmse
  Will be used for primary algoritm's selecton
  Args:
    trainset: Trainset - train dataset
    algo_name: str
  Return: rmse: np.float64
  """
  # get algoritm
  algo = get_algo(algo_name)
  # fit algoritm
  algo.fit(trainset)
  predictions = algo.test(testset)
  return accuracy.rmse(predictions, verbose=False)

In [32]:
data = make_dataset()

--2023-07-06 19:11:10--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: 'ml-latest-small.zip'

     0K .......... .......... .......... .......... ..........  5%  144K 6s
    50K .......... .......... .......... .......... .......... 10%  302K 4s
   100K .......... .......... .......... .......... .......... 15% 3.60M 3s
   150K .......... .......... .......... .......... .......... 20%  326K 3s
   200K .......... .......... .......... .......... .......... 26% 5.36M 2s
   250K .......... .......... .......... .......... .......... 31% 3.95M 2s
   300K .......... .......... .......... .......... .......... 36% 10.5M 1s
   350K .......... .......... .......... ...

In [33]:
trainset, testset, trainsetfull = build_features(data, 0.15, 42)

In [34]:
# to save list of raw & inner item's id's
trainset_iids = list(trainset.all_items()) # It is moviesId
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_iids = list(map(iid_converter, trainset_iids)) # it is movies titles

In [35]:
algos = [
 'BaselineOnly',
 'KNNWithMeans',         
 'KNNWithZScore',
 'KNNBaseline',
 'SVD',
 'SVDpp',
 'NMF'
]

In [36]:
results = dict()
for algo in algos:
  results[algo] = train_model(trainset, algo)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


In [37]:
results

{'BaselineOnly': 0.8824900429621578,
 'KNNWithMeans': 0.9041790317046055,
 'KNNWithZScore': 0.9024747395939957,
 'KNNBaseline': 0.8828909395729079,
 'SVD': 0.8854328980760703,
 'SVDpp': 0.8703005899674594,
 'NMF': 0.9329089587256932}

Валидация на отложенном датасете помогает выбрать наиболее перспективные для дальнейшего изучения алгоритмы. ВЫберем KNNBaseline,SVD и SVDpp. Выполним для этих алгоритмов кросс-валидацию с параметрами по умолчанию

In [95]:
results=dict()

Для SVD

In [101]:
algo=SVD()
cv_results = cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, return_train_measures=True, verbose=False)
cv_results['test_rmse'].mean()

0.8724466491164524

In [102]:
results['SVD'] = cv_results['test_rmse'].mean()
results['SVD'] 

0.8724466491164524

In [40]:
kf = KFold(n_splits=5)

algo = SVD()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.8745
RMSE: 0.8686
RMSE: 0.8860
RMSE: 0.8775
RMSE: 0.8657


In [None]:
param_grid = {"n_epochs": [61, 100], "lr_all": [0.006, 0.01], "reg_all": [0.4]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

0.8765257814053132
{'n_epochs': 61, 'lr_all': 0.006, 'reg_all': 0.4}


для KNNBaseline

In [98]:
algo=KNNBaseline()
cv_results = cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, return_train_measures=True, verbose=False)
cv_results['test_rmse'].mean()
results['KNNBaseline'] = cv_results['test_rmse'].mean()

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


In [99]:
kf = KFold(n_splits=5)

algo = KNNBaseline()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8863
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8684
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8709
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8738
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8778


Для SVDpp

In [97]:
algo=SVDpp()
cv_results = cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, return_train_measures=True, verbose=False)
cv_results['test_rmse'].mean()
results['SVDpp'] = cv_results['test_rmse'].mean()

In [104]:
results['SVDpp']

0.8631228176464131

Для алгоритма SVDpp уже достигнут требуемый результат при выполнении кросс-валидации на 5 фолдах: 0.8616685463384428 < 0.87. Оценим rmsе по фолдам

In [93]:
kf = KFold(n_splits=5)

algo = SVDpp()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.8689
RMSE: 0.8542
RMSE: 0.8590
RMSE: 0.8630
RMSE: 0.8585


На каждом фолде RMSE меньше заявленного порога. Попробуем еще улучшить модель и выполним настройку гиперпараметров

In [94]:
param_grid = {"n_epochs": [95], "lr_all": [0.006], "reg_all": [0.4],  "verbose": [True]}
gs3 = GridSearchCV(SVDpp, param_grid, measures=["rmse"], cv=5)

gs3.fit(data)

# best RMSE score
print(gs3.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs3.best_params["rmse"])

 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 5
 processing epoch 6
 processing epoch 7
 processing epoch 8
 processing epoch 9
 processing epoch 10
 processing epoch 11
 processing epoch 12
 processing epoch 13
 processing epoch 14
 processing epoch 15
 processing epoch 16
 processing epoch 17
 processing epoch 18
 processing epoch 19
 processing epoch 20
 processing epoch 21
 processing epoch 22
 processing epoch 23
 processing epoch 24
 processing epoch 25
 processing epoch 26
 processing epoch 27
 processing epoch 28
 processing epoch 29
 processing epoch 30
 processing epoch 31
 processing epoch 32
 processing epoch 33
 processing epoch 34
 processing epoch 35
 processing epoch 36
 processing epoch 37
 processing epoch 38
 processing epoch 39
 processing epoch 40
 processing epoch 41
 processing epoch 42
 processing epoch 43
 processing epoch 44
 processing epoch 45
 processing epoch 46
 processing epoch 47
 p

In [103]:
results

{'SVDpp': 0.8631228176464131,
 'KNNBaseline': 0.874301593874022,
 'SVD': 0.8724466491164524}

Улучшить модель путем подбора гиперпааметров не удалось, лучше всего работают параметры по умолчанию
Таким образом, наибольшую эффективность показал алгоритм SVDpp с параметрами по умолчанию