<a href="https://colab.research.google.com/github/solobala/ABD26/blob/main/RMSL9_DZ2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Домашнее задание по теме «Рекомендации на основе содержания»
Пакет SURPRISE:

используйте данные MovieLens 1M,
можно использовать любые модели из пакета,
получите RMSE на тестовом сете 0,87 и ниже.

* Загружаем данные и собираем датасет(фильм-рейтинг)
* Используем средства SURPRISE для перевода pandas датафрейма в нужный формат
* Отбираем алгоритм/алгоритмы из SURPRISE, которые будем обучать
* В процессе обучения выполняем проверку на 5 фолдах, оцениваем RMSE
* Отбираем лучший алгоритм, при необходимости тьюним его

# 0. Загрузка данных и импорт библиотек

https://surpriselib.com/

Алгоритмы https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

In [1]:
!pip install surprise # prediction algorithms available for recommendation

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3097650 sha256=75e31a5220ced27d4e8f2fb3740a113874a1f8e2c5c5a245edddcce134892d46
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


In [2]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from surprise import Dataset, Trainset
from surprise.accuracy import rmse
from surprise.similarities import cosine, msd, pearson, pearson_baseline
from surprise import Reader
from surprise.prediction_algorithms.algo_base import AlgoBase
from surprise.prediction_algorithms.baseline_only import BaselineOnly
from surprise.prediction_algorithms.knns import KNNWithMeans,\
KNNWithZScore, KNNBaseline
from surprise.prediction_algorithms.matrix_factorization import SVD, SVDpp, NMF
from surprise.prediction_algorithms.slope_one import SlopeOne
from surprise.prediction_algorithms.co_clustering import CoClustering
from surprise.model_selection import train_test_split
from surprise.model_selection.split import KFold
# from surprise.model_selection import PredefinedKFold
from surprise.model_selection.validation import cross_validate
from surprise.model_selection.search import GridSearchCV
# from surprise.prediction_algorithms.predictions import Prediction, PredictionImpossible
from surprise import accuracy
from sklearn.metrics import mean_squared_error

#1. Пользовательские функции

In [3]:
def make_dataset() -> Dataset:
  """
  Prepare dataset for Surprise
  """
  # load dataset
  !wget  "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"   # Качаем архив выбранного датасета
  # unzip data from zip
  !unzip ml-latest-small.zip
  # read tables
  movies = pd.read_csv('/content/ml-latest-small/movies.csv')
  ratings = pd.read_csv('/content/ml-latest-small/ratings.csv')
  ratings.drop(columns=['timestamp'], inplace=True)
  # join tables
  df = ratings.join(movies.set_index('movieId'), on='movieId', how='left')
  dataset = pd.DataFrame({
    'uid': df.userId,
    'iid': df.title,
    'rating': df.rating
})
  min = dataset.rating.min()
  max = dataset.rating.max()
  reader = Reader(rating_scale=(min, max))
  data = Dataset.load_from_df(dataset, reader)
  return data

In [4]:
def get_algo(algo_name) -> AlgoBase():
  """
  Make a dictionary with surprise models
  Args:
    algo_name:str - a name of surprise model
  Return:
    algos[algo_name]: AlgoBase - choosed model
  """
  algos = dict()
  algos['BaselineOnly'] = BaselineOnly()
  algos['KNNWithMeans'] = KNNWithMeans()
  algos['KNNWithZScore'] = KNNWithZScore()
  algos['KNNBaseline'] = KNNBaseline()
  algos['SVD'] = SVD()
  algos['SVDpp'] = SVDpp()
  algos['NMF'] = NMF()
  algos['SlopeOne'] = SlopeOne()
  algos['CoClustering'] = CoClustering()
  return algos[algo_name]

In [5]:
def build_features(data: Dataset, split_ratio: float, random_seed: int) -> tuple[Trainset, Trainset, Trainset]:
  """Make datasets for fit, predict, cross_validate model
  Args:
    data: Dataset - whole dataset
  Return:
    trainset: Trainset,
    testset: Trainset
    traintestfull: Trainset
  """
  trainset, testset = train_test_split(data, test_size=split_ratio, random_state=random_seed)
  traintestfull = data.build_full_trainset()
  return  trainset, testset, traintestfull

In [6]:
def train_model( trainset: Trainset, algo_name: str) -> np.float64:
  """
  Get algoritm by its name,
  fit model, predict on test dataset
  & return rmse
  Will be used for primary algoritm's selecton
  Args:
    trainset: Trainset - train dataset
    algo_name: str
  Return: rmse: np.float64
  """
  # get algoritm
  algo = get_algo(algo_name)
  # fit algoritm
  algo.fit(trainset)
  predictions = algo.test(testset)
  return accuracy.rmse(predictions, verbose=False)


In [7]:
def validate_model(data, algo_name):
  algo = get_algo(algo_name)
  results = cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, return_train_measures=True)
  return results['test_rmse'].mean()

In [8]:
data = make_dataset()

--2023-07-06 11:01:20--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2023-07-06 11:01:20 (8.12 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [9]:
trainset, testset, trainsetfull = build_features(data, 0.15, 42)

In [10]:
# to save list of raw & inner item's id's
trainset_iids = list(trainset.all_items()) # It is moviesId
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_iids = list(map(iid_converter, trainset_iids)) # it is movies titles

In [11]:
algos = [
 'BaselineOnly',
 'KNNWithMeans',
 'KNNWithZScore',
 'KNNBaseline',
  'SVD',
 'SVDpp',
'NMF',
 'SlopeOne',
 'CoClustering'
]

In [12]:
results = dict()
for algo in algos:
  results[algo] = train_model(trainset, algo)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


In [13]:
results

{'BaselineOnly': 0.8824900429621578,
 'KNNWithMeans': 0.9041790317046055,
 'KNNWithZScore': 0.9024747395939957,
 'KNNBaseline': 0.8828909395729079,
 'SVD': 0.8810904905144165,
 'SVDpp': 0.8737186435529336,
 'NMF': 0.9299851823491933,
 'SlopeOne': 0.9086010702465205,
 'CoClustering': 0.9527419328431517}

С параметрами по умолчанию лучший результат на текстовой выборке - у SVDpp. Далее будем работать с этим алгоритмом

In [14]:
algo=SVDpp()
cv_results = cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, return_train_measures=True, verbose=False)


In [15]:
cv_results['test_rmse'].mean()

0.8614040035255691

Для алгоритма SVDpp уже достигнут требуемый результат при выполнении кросс-валидации на 5 фолдах: 0.8616685463384428 < 0.87. Оценим rmsе по фолдам

In [16]:
kf = KFold(n_splits=5)

algo = SVDpp()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.8628
RMSE: 0.8647
RMSE: 0.8563
RMSE: 0.8574
RMSE: 0.8637


На каждом фолде RMSE меньше заявленного порога. Попробуем еще улучшить модель и выполним настройку гиперпараметров

In [None]:
param_grid = {"n_epochs": [10,20], "lr_all": [0.005, 0.01], "reg_all": [0.4, 0.5]}
gs = GridSearchCV(SVDpp, param_grid, measures=["rmse"], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

## Рекомендации

In [None]:
def generate_recommendation(uid, model, dataset, thresh=4.5, amount=5):
    all_titles = list(dataset['iid'].values)
    users_seen_titles = dataset[dataset['uid'] == uid]['iid']
    titles = np.array(list(set(all_titles) - set(users_seen_titles)))

    np.random.shuffle(titles)

    rec_list = []
    for title in titles:
        review_prediction = model.predict(uid=uid, iid=title)
        rating = review_prediction.est

        if rating >= thresh:
            rec_list.append((title, round(rating, 2)))

            if len(rec_list) >= amount:
                return rec_list

In [None]:
algo = SVDpp(n_epochs= gs.best_params["n_epochs"],
             lr_all=gs.best_params["lr_all"],
             reg_all=gs.best_params["reg_all"])
generate_recommendation(2, algo, data, thresh=4.8)