Візьміть датасет [movielens](https://surprise.readthedocs.io/en/stable/dataset.html) і побудуйте модель матричної факторизації. У даній бібліотеці він має назву SVD. Підберіть найкращі параметри за допомогою крос-валідації, також поекспериментуйте з іншими алгоритмами розрахунків (SVD++, NMF) і оберіть той, який буде оптимальним.

In [1]:
import pandas as pd
from surprise import Dataset, SVD, SVDpp, NMF, accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV

In [2]:
data = Dataset.load_builtin('ml-100k')

## Модель матричної факторизації

In [3]:
trainset = data.build_full_trainset()
testset = trainset.build_testset()

### SVD

In [4]:
algo_svd = SVD(random_state=42)

algo_svd.fit(trainset)
predict = algo_svd.test(testset)

accuracy.rmse(predictions=predict)

RMSE: 0.6752


0.6751519444923186

In [5]:
cross_validate_result = cross_validate(algo_svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print("RMSE: {:.4f}".format(cross_validate_result['test_rmse'].mean()))
print("MAE: {:.4f}".format(cross_validate_result['test_mae'].mean()))

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9409  0.9299  0.9303  0.9331  0.9454  0.9359  0.0062  
MAE (testset)     0.7415  0.7341  0.7324  0.7377  0.7450  0.7381  0.0046  
Fit time          0.86    0.88    0.88    0.88    0.87    0.87    0.01    
Test time         0.24    0.14    0.14    0.25    0.14    0.18    0.05    
RMSE: 0.9359
MAE: 0.7381


### SVD ++

In [6]:
algo_svdpp = SVDpp(random_state=42)

algo_svdpp.fit(trainset)
predict = algo_svdpp.test(testset)

accuracy.rmse(predictions=predict)

RMSE: 0.7574


0.7574410731286018

In [7]:
cross_validate_result = cross_validate(algo_svdpp, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print("RMSE: {:.4f}".format(cross_validate_result['test_rmse'].mean()))
print("MAE: {:.4f}".format(cross_validate_result['test_mae'].mean()))

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9255  0.9137  0.9195  0.9117  0.9196  0.9180  0.0049  
MAE (testset)     0.7256  0.7161  0.7209  0.7168  0.7210  0.7201  0.0034  
Fit time          13.87   21.22   14.05   13.97   22.57   17.14   3.91    
Test time         4.44    4.37    4.47    4.46    4.42    4.43    0.04    
RMSE: 0.9180
MAE: 0.7201


### NMF

In [8]:
algo_nmf = NMF(random_state=42)

algo_nmf.fit(trainset)
predict = algo_nmf.test(testset)

accuracy.rmse(predictions=predict)

RMSE: 0.8221


0.8221107626437083

In [9]:
cross_validate_result = cross_validate(algo_nmf, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print("RMSE: {:.4f}".format(cross_validate_result['test_rmse'].mean()))
print("MAE: {:.4f}".format(cross_validate_result['test_mae'].mean()))

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9569  0.9708  0.9549  0.9680  0.9625  0.9626  0.0061  
MAE (testset)     0.7508  0.7599  0.7502  0.7633  0.7580  0.7564  0.0051  
Fit time          1.58    1.62    1.60    1.60    1.60    1.60    0.01    
Test time         0.11    0.11    0.11    0.11    0.26    0.14    0.06    
RMSE: 0.9626
MAE: 0.7564


## Best algo (cross-validate)

1. SVDpp (0.9211)
2. SVD (0.9376)
3. NMF (0.9636)

## BEST Param

In [12]:
param_grid = {"random_state": [42],
              "n_epochs": [15, 20], "lr_all": [0.007], "reg_all": [0.02, 0.2]
              }
gs = GridSearchCV(SVDpp, param_grid, measures=["rmse"], cv=5, n_jobs=-1)

gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.9173757466265855
{'random_state': 42, 'n_epochs': 15, 'lr_all': 0.007, 'reg_all': 0.02}


## Recomendation

In [42]:
movie_list = pd.read_csv("content/data/ml-latest-small/movies.csv")

In [13]:
trainset = data.build_full_trainset()
testset = trainset.build_testset()

In [16]:
algo = gs.best_estimator["rmse"]

algo.fit(trainset)
predict = algo.test(testset)

In [55]:
def get_top_n_recommendation(predictions, n=10):
    top_n = dict()
    for uid, iid, true_r, est, _ in predictions:
        top_n.setdefault(uid, [])
        top_n[uid].append((iid, est))

    for uid, user_ratings in top_n.items():
        top_n[uid] = sorted(user_ratings, key=lambda x: x[1], reverse=True)[:n]

    return top_n

In [57]:
count_recommendation = 10
top_n_recommendation = get_top_n_recommendation(predictions=predict, n =count_recommendation)

In [70]:
for k in list(top_n_recommendation)[:20] :
    print(f"Top {count_recommendation} movie for user `{k}` :")
    for id, est in top_n_recommendation[k]:
        print("-", id, movie_list['title'].get(int(id)), f"{est:2f}")


Top 10 movie for user `196` :
- 251 Once Were Warriors (1994) 4.158705
- 8 Sudden Death (1995) 4.071695
- 1007 Female Perversions (1996) 4.042307
- 242 Murder in the First (1995) 4.029599
- 285 Tank Girl (1995) 4.006705
- 269 Rent-a-Kid (1995) 3.907000
- 655 Spitfire Grill, The (1996) 3.886761
- 663 Kansas City (1996) 3.883407
- 286 Tales from the Crypt Presents: Demon Knight (1995) 3.875516
- 116 Birdcage, The (1996) 3.853859
Top 10 movie for user `186` :
- 300 Muriel's Wedding (1994) 4.227949
- 71 Juror, The (1996) 4.206991
- 79 Black Sheep (1996) 4.078892
- 742 African Queen, The (1951) 4.025933
- 385 Demolition Man (1993) 3.978472
- 117 Brothers McMullen, The (1995) 3.957155
- 470 Sirens (1994) 3.940768
- 684 Breakfast at Tiffany's (1961) 3.928978
- 95 Muppet Treasure Island (1996) 3.921227
- 588 Great White Hype, The (1996) 3.918295
Top 10 movie for user `22` :
- 172 Under Siege 2: Dark Territory (1995) 5.000000
- 50 Georgia (1995) 5.000000
- 174 Unzipped (1995) 5.000000
- 511 Sno