Задание по теме «Коллаборативная фильтрация»
Пакет SURPRISE:

используйте данные MovieLens 1M,
можно использовать любые модели из пакета,
получите RMSE на тестовом сете 0,87 и ниже.

* Загружаем данные и собираем датасет(фильм-рейтинг)
* Используем средства SURPRISE для перевода pandas датафрейма в нужный формат
* Отбираем алгоритм/алгоритмы из SURPRISE, которые будем обучать
* В процессе обучения выполняем проверку на 5 фолдах, оцениваем RMSE
* Отбираем лучший алгоритм, при необходимости тьюним его
https://surpriselib.com/

Алгоритмы https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

0. Загрузка данных и импорт библиотек

In [2]:
%pip install scikit-surprise

Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import pandas as pd
import zipfile
from surprise import Dataset, Trainset
from surprise.accuracy import rmse
from surprise.similarities import cosine, msd, pearson, pearson_baseline
from surprise import Reader
from surprise.prediction_algorithms.algo_base import AlgoBase 
from surprise.prediction_algorithms.baseline_only import BaselineOnly
from surprise.prediction_algorithms.knns import KNNWithMeans,\
KNNWithZScore, KNNBaseline
from surprise.model_selection.search import GridSearchCV
from surprise.model_selection import train_test_split
from surprise.model_selection.split import KFold
from surprise.model_selection.validation import cross_validate
from surprise import accuracy
from sklearn.metrics import mean_squared_error

1. Пользовательские функции

In [4]:
def make_dataset() -> Dataset:
  """
  Prepare dataset for Surprise
  """
  # load dataset
  !wget  "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"  --no-check-certificate
  # unzip data from zip
  with zipfile.ZipFile('ml-latest-small.zip', 'r') as zip_ref:
    zip_ref.extractall('data')
  # read tables
  movies = pd.read_csv('data\\ml-latest-small\\movies.csv')
  ratings = pd.read_csv('data\\ml-latest-small\\ratings.csv')
  ratings.drop(columns=['timestamp'], inplace=True)
  # join tables
  df = ratings.join(movies.set_index('movieId'), on='movieId', how='left')
  dataset = pd.DataFrame({
    'uid': df.userId,
    'iid': df.title,
    'rating': df.rating
})
  min = dataset.rating.min()
  max = dataset.rating.max()
  reader = Reader(rating_scale=(min, max)) 
  data = Dataset.load_from_df(dataset, reader)
  return data

In [5]:
def get_algo(algo_name) -> AlgoBase():
  """
  Make a dictionary with surprise models
  Args:
    algo_name:str - a name of surprise model
  Return:
    algos[algo_name]: AlgoBase - choosed model 
  """
  algos = dict()
  algos['BaselineOnly'] = BaselineOnly()
  algos['KNNWithMeans'] = KNNWithMeans()
  algos['KNNWithZScore'] = KNNWithZScore()
  algos['KNNBaseline'] = KNNBaseline()
  return algos[algo_name]

In [6]:
def build_features(data: Dataset, split_ratio: float, random_seed: int) -> tuple[Trainset, Trainset, Trainset]:
  """Make datasets for fit, predict, cross_validate model
  Args:
    data: Dataset - whole dataset
  Return:
    trainset: Trainset,
    testset: Trainset
    traintestfull: Trainset
  """
  trainset, testset = train_test_split(data, test_size=split_ratio, random_state=random_seed)
  traintestfull = data.build_full_trainset()
  return  trainset, testset, traintestfull

In [7]:
def train_model( trainset: Trainset, testest: Trainset, algo_name: str) -> np.float64:
  """
  Get algoritm by its name,
  fit model, predict on test dataset
  & return rmse
  Will be used for primary algoritm's selecton
  Args:
    trainset: Trainset - train dataset
    algo_name: str
  Return: rmse: np.float64
  """
  # get algoritm
  algo = get_algo(algo_name)
  # fit algoritm
  algo.fit(trainset)
  predictions = algo.test(testset)
  return accuracy.rmse(predictions, verbose=False)

In [42]:
def validate_model(data, algo_name):
  results = dict()
  algo = get_algo(algo_name)
  results = cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, return_train_measures=True)
  kf = KFold(n_splits=5)
  acc = []
  for trainset, testset in kf.split(data):

      # train and test algorithm.
      algo.fit(trainset)
      predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
      acc.append(accuracy.rmse(predictions, verbose=True))
  return results['test_rmse'].mean(), acc

In [8]:
data = make_dataset()

--2023-07-07 13:12:44--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: 'ml-latest-small.zip'

     0K .......... .......... .......... .......... ..........  5%  143K 6s
    50K .......... .......... .......... .......... .......... 10%  296K 4s
   100K .......... .......... .......... .......... .......... 15% 5.10M 3s
   150K .......... .......... .......... .......... .......... 20%  313K 3s
   200K .......... .......... .......... .......... .......... 26% 2.95M 2s
   250K .......... .......... .......... .......... .......... 31% 24.8M 2s
   300K .......... .......... .......... .......... .......... 36% 14.2M 1s
   350K .......... .......... .......... ...

In [9]:
trainset, testset, trainsetfull = build_features(data, 0.15, 42)

In [10]:
# to save list of raw & inner item's id's
trainset_iids = list(trainset.all_items()) # It is moviesId
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_iids = list(map(iid_converter, trainset_iids)) # it is movies titles

In [11]:
algos = [
 'BaselineOnly',
 'KNNWithMeans',         
 'KNNWithZScore',
 'KNNBaseline',
]

In [43]:
cv_results = dict()
cv_5 = dict()
results = dict()
for algo in algos:
  results[algo] = train_model(trainset, testset, algo)
  cv_results[algo], cv_5[algo] =  validate_model(data, algo)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
RMSE: 0.8717
Estimating biases using als...
RMSE: 0.8782
Estimating biases using als...
RMSE: 0.8694
Estimating biases using als...
RMSE: 0.8696
Estimating biases using als...
RMSE: 0.8738
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8996
Computing the msd similarity matrix...
Done computing similarity matri

In [44]:
results

{'BaselineOnly': 0.8737114003217862,
 'KNNWithMeans': 0.8955378893124751,
 'KNNWithZScore': 0.894175350623354,
 'KNNBaseline': 0.874123767263706}

In [45]:
cv_results

{'BaselineOnly': 0.8729676746099791,
 'KNNWithMeans': 0.8972381676383165,
 'KNNWithZScore': 0.896553334056852,
 'KNNBaseline': 0.8748385951293189}

In [46]:
cv_5

{'BaselineOnly': [0.8717052600991458,
  0.8781714700898999,
  0.8694439501324589,
  0.8695839273199019,
  0.8737733352714818],
 'KNNWithMeans': [0.8996062986589268,
  0.9078242950699351,
  0.8980946844683133,
  0.8940173688340429,
  0.8955001153258797],
 'KNNWithZScore': [0.9078839559939339,
  0.896548969144564,
  0.8891393349904706,
  0.8962744468675109,
  0.8957736612716851],
 'KNNBaseline': [0.8782825675094662,
  0.8763859530246967,
  0.8722111110555015,
  0.867523079263711,
  0.8738614987283699]}

Валидация на отложенном датасете помогает выбрать наиболее перспективные для дальнейшего изучения алгоритмы. ВЫберем KNNBaseline.Выполним для этих алгоритмов кросс-валидацию с параметрами по умолчанию

In [29]:
all_cv_results

{'KNNWithMeans': 0.8964511992710749,
 'KNNBaseline': 0.8751849954725367,
 'KNNWithZScore': 0.896420945106889}

In [31]:
cv_5

{'KNNWithMeans': [0.9065331672252935,
  0.9013035508937861,
  0.8927859633903186,
  0.8920246477594777,
  0.8963935644444989],
 'KNNBaseline': [0.8774524064622571,
  0.8768064833762967,
  0.8752693714850209,
  0.8755628135662583,
  0.8720178246797206],
 'KNNWithZScore': [0.89720809095683,
  0.8938095377372172,
  0.8971338392777446,
  0.9001861910596451,
  0.894175350623354]}

Лучшие результаты показывает KNNBaseline, но эти результаты >0.87. Попробуем улучшить модель

In [40]:
param_grid = {"k": [30], "min_k": [5],"verbose": [True],   "name": ["pearson"],   "user_based": [True, False], 'n_epochs':[10, 20]}
gs3 = GridSearchCV(KNNBaseline, param_grid, measures=["rmse"], cv=5)

gs3.fit(data)

# best RMSE score
print(gs3.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs3.best_params["rmse"])

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matr

Результат достигнут!