In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [3]:
import pandas as pd

pd.options.mode.chained_assignment = None  # Disable the warning

import pickle
import numpy as np

from metrics import RmseCalculator, TestMetricsCalculator
from rating import get_explicit_rating, get_implicit_rating_out_of_positive_ratings_df, split_matrix_csr, \
    sanity_check_implicit_rating, sanity_check_explicit_split, sanity_check_explicit_matrix
from tuning import GridSearchSvdPP

In [54]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)

# Feature selection

The only dataset that is necessary for our purposes is **review** dataset since:
- it contains the information about explicit ratings (the mean of the field **stars** for pairs of users and items, check the chapter **Feature engineering** for more details)
- it contains the information for implicit rating (check the chapter **Feature engineering** for more details)
- it already contains only those users who provided at least one review and those items that received at least one estimation

In [4]:
PATH = '../../eda/dataset_samples/df_yelp_review_open_health_10.parquet'

There are two paths of working with the dataset that contains more than **900 000** rows

*The standard flow:*
1. Download dataset from Kaggle
2. Drop the reviews for closed businesses (`is_open == True`) and select the particular category of business (check chosen category in `/eda/yelp.ipynb`)
3. Drop the reviews were provided by users that assessed less than **threshold** items (check the threshold definition in `/eda/yelp.ipynb`) 
4. Use all the dataset in the purpose of analysis, training and testing

*The "sampled" flow* (provided in `/eda/yelp.ipynb`):
1. Download dataset from Kaggle
2. Drop the reviews for closed businesses (`is_open == True`) and select the particular category of business (check chosen category in `/eda/yelp.ipynb`)
3. Drop the reviews were provided by users that assessed less than **threshold** items (check the threshold definition in `/eda/yelp.ipynb`) 
4. Sample **10%** randomly
5. Check similarity of mathematical moments (`.describe()`) and dataset balancing (`.value_counts()`) with main dataset (initial YELP)

In the purpose of time-efficiency and due to the lack of power **the "sampled" flow** was chosen during EDA, and the sampled dataset was saved using the following path **`/eda/dataset_samples`**

Reviews' features:
- `review_id` | `user_id` | `business_id` - id of the review and foreign keys (one user can leave several reviews for one item)
- `stars` - **explicit rating** provided by user for the particular item in the particular moment
- `useful` | `funny` | `cool`  - user's flags about (presumably) review. We don't drop this feature since it's necessary to get any evidences that theory about the nature of the feature is right - check the **opportunity of usage them for implicit rating** 
- `text` - the content of review (can be useful for potential sentimental analysis)
- `date` - the timestamp of review

In [5]:
review_df = pd.read_parquet(PATH)
review_df

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,year,review_length
0,A8JR77dMCrnymyMhPQE7Qw,tL2pS5UOmN6aAOi3Z-qFGg,kEC675O6YwRH30ImVxBCCA,4,2017-02-16,I'm new to the area and stopped in here a few ...,0,0,0,2017,96
1,6Gd8tlhutYDMifKEzCvHyw,tL2pS5UOmN6aAOi3Z-qFGg,07F0eE_JHkH4Op0vpR_v4A,2,2013-05-09,I lived in this neighborhood a number of years...,2,0,0,2013,104
2,ZDMj4LnkO26QtM-CDtV94w,tL2pS5UOmN6aAOi3Z-qFGg,m-_IOYAreUyy_uyK9U3niQ,5,2013-02-03,It's been a few years since I've hiked this tr...,3,1,1,2013,109
3,8yfmu2iAagShVXJUGVTELw,tL2pS5UOmN6aAOi3Z-qFGg,6zry3kyGHiplbQ4rdqxbaQ,4,2013-07-17,This is a challenging hike to one of the most ...,0,0,0,2013,46
4,HaIBF7a1HjFT-gOIlmSIdQ,tL2pS5UOmN6aAOi3Z-qFGg,b3z314J6wktaNVblxumiug,5,2014-07-18,This is one if my favorite trails in the Redro...,0,0,0,2014,42
...,...,...,...,...,...,...,...,...,...,...,...
31736,JZJts6Y7gG5Uog6zyiXo5w,IqIpCfg0qDhIkaUJGKzlyw,KlLCJN_KUP9xFQBJYrhgVg,5,2012-10-21,pretty hospital great staff!! I had a great p...,0,0,0,2012,31
31737,Lp4ZHWDrDoXM2HHkZTh90g,IqIpCfg0qDhIkaUJGKzlyw,U-9uOCu4tG4idBAnMPmZTw,4,2012-08-15,I like this place because I live by it!! Mini ...,1,1,1,2012,57
31738,oQP82Wz-gIlV2qG53EacGA,IqIpCfg0qDhIkaUJGKzlyw,wghDrzcZ0VloAtaIZ7GEBg,4,2013-06-07,There's parking thank goodness. I hiked here f...,2,0,0,2013,41
31739,KZEEqRkPNDVTCZCBb93iYA,IqIpCfg0qDhIkaUJGKzlyw,4xkjmpgUNJdwQo8FKIYp6Q,3,2012-07-28,I bought a groupon yay at $13 and it was so wo...,5,1,1,2012,166


Reasons of feature dropping:
- since `useful | funny | cool` features describe the preferences of other users about this particular review, not an item, they can't be used for calculations of **explicit** and **implicit** ratings (our assumption is that these features are reactions that user can give to review that theoretically describes the user-to-user relations) 
- `review_id` won't be dropped, but will be used for indexing since it's unique field
- `text` of review won't be used for implicit or explicit ratings so this feature can be also dropped 

In [6]:
REMAINED_FEATURES = ['review_id', 'user_id', 'business_id', 'stars', 'date']

filtered_review_df = review_df[REMAINED_FEATURES]
filtered_review_df.set_index('review_id', inplace=True)
filtered_review_df

Unnamed: 0_level_0,user_id,business_id,stars,date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A8JR77dMCrnymyMhPQE7Qw,tL2pS5UOmN6aAOi3Z-qFGg,kEC675O6YwRH30ImVxBCCA,4,2017-02-16
6Gd8tlhutYDMifKEzCvHyw,tL2pS5UOmN6aAOi3Z-qFGg,07F0eE_JHkH4Op0vpR_v4A,2,2013-05-09
ZDMj4LnkO26QtM-CDtV94w,tL2pS5UOmN6aAOi3Z-qFGg,m-_IOYAreUyy_uyK9U3niQ,5,2013-02-03
8yfmu2iAagShVXJUGVTELw,tL2pS5UOmN6aAOi3Z-qFGg,6zry3kyGHiplbQ4rdqxbaQ,4,2013-07-17
HaIBF7a1HjFT-gOIlmSIdQ,tL2pS5UOmN6aAOi3Z-qFGg,b3z314J6wktaNVblxumiug,5,2014-07-18
...,...,...,...,...
JZJts6Y7gG5Uog6zyiXo5w,IqIpCfg0qDhIkaUJGKzlyw,KlLCJN_KUP9xFQBJYrhgVg,5,2012-10-21
Lp4ZHWDrDoXM2HHkZTh90g,IqIpCfg0qDhIkaUJGKzlyw,U-9uOCu4tG4idBAnMPmZTw,4,2012-08-15
oQP82Wz-gIlV2qG53EacGA,IqIpCfg0qDhIkaUJGKzlyw,wghDrzcZ0VloAtaIZ7GEBg,4,2013-06-07
KZEEqRkPNDVTCZCBb93iYA,IqIpCfg0qDhIkaUJGKzlyw,4xkjmpgUNJdwQo8FKIYp6Q,3,2012-07-28


# Explicit rating extracting

Convert the "date" column in the filtered dataset to UNIX timestamp (in milliseconds)

In [7]:
filtered_review_df["date"] = pd.to_datetime(filtered_review_df["date"]).astype(np.int64) // 10 ** 9
filtered_review_df

Unnamed: 0_level_0,user_id,business_id,stars,date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A8JR77dMCrnymyMhPQE7Qw,tL2pS5UOmN6aAOi3Z-qFGg,kEC675O6YwRH30ImVxBCCA,4,1487203200
6Gd8tlhutYDMifKEzCvHyw,tL2pS5UOmN6aAOi3Z-qFGg,07F0eE_JHkH4Op0vpR_v4A,2,1368057600
ZDMj4LnkO26QtM-CDtV94w,tL2pS5UOmN6aAOi3Z-qFGg,m-_IOYAreUyy_uyK9U3niQ,5,1359849600
8yfmu2iAagShVXJUGVTELw,tL2pS5UOmN6aAOi3Z-qFGg,6zry3kyGHiplbQ4rdqxbaQ,4,1374019200
HaIBF7a1HjFT-gOIlmSIdQ,tL2pS5UOmN6aAOi3Z-qFGg,b3z314J6wktaNVblxumiug,5,1405641600
...,...,...,...,...
JZJts6Y7gG5Uog6zyiXo5w,IqIpCfg0qDhIkaUJGKzlyw,KlLCJN_KUP9xFQBJYrhgVg,5,1350777600
Lp4ZHWDrDoXM2HHkZTh90g,IqIpCfg0qDhIkaUJGKzlyw,U-9uOCu4tG4idBAnMPmZTw,4,1344988800
oQP82Wz-gIlV2qG53EacGA,IqIpCfg0qDhIkaUJGKzlyw,wghDrzcZ0VloAtaIZ7GEBg,4,1370563200
KZEEqRkPNDVTCZCBb93iYA,IqIpCfg0qDhIkaUJGKzlyw,4xkjmpgUNJdwQo8FKIYp6Q,3,1343433600


Calculating the **explicit rating** for the filtered dataset. 

The output consists of two CSR matrices with identical structure: the first matrix contains **the mean review rating** given by user *u_i* to business *b_i*, and the second matrix stores **the timestamp of the latest review** at the same positions. 

Additionally, two utility dictionaries are provided, containing mappings **between IDs and matrix indices** (and vice versa).

In [8]:
explicit_ratings, last_dates, user_mapping, item_mapping = get_explicit_rating(filtered_review_df, "user_id",
                                                                               "business_id", "stars", "date")

explicit_ratings.toarray(), last_dates.toarray()

(array([[4., 2., 5., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 4., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([[1487203200, 1368057600, 1359849600, ...,          0,          0,
                  0],
        [         0,          0,          0, ...,          0,          0,
                  0],
        [         0,          0,          0, ...,          0,          0,
                  0],
        ...,
        [         0,          0,          0, ...,          0,          0,
                  0],
        [         0,          0,          0, ..., 1486771200,          0,
                  0],
        [         0,          0,          0, ...,          0,          0,
                  0]]))

**Sanity check**:
* the amount of filled cells in the sparse matrices (`.nnz`) must be the same as **the number of unique pairs** of users and items
* the amount is **the same**

In [12]:
sanity_check_explicit_matrix(explicit_ratings=explicit_ratings, last_dates=last_dates, review_df=filtered_review_df,
                             user_field="user_id", item_field="business_id")

Unnamed: 0,Source,Calculated metrics,Value
0,Explicit ratings matrix,Non-zero entries,31740
1,Last dates matrix,Non-zero entries,31740
2,Filtered review DataFrame,"Unique (user_id, business_id) pairs",31740


# Train / validation / test split

Define the divisions within the initial matrix (**test / validation / train** according to the documentation of split function)

In [9]:
DIVISIONS = [0.1, 0.2, 0.7]

Split matrix in proportions `0.1, 0.2, 0.7` for **test**, **validation** and **train** set.



In [10]:
test_matrix, validation_matrix, train_matrix = split_matrix_csr(explicit_ratings, last_dates, DIVISIONS)
train_matrix.toarray(), validation_matrix.toarray(), test_matrix.toarray()

(array([[0., 2., 5., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 4., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([[4., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))

**Sanity check** (verify that the explicit matrix has been correctly split into **train, validation, and test** subsets):
* The total number of interactions (nnz) in the splits matches the original explicit matrix.
* The proportions of data in each split (Train, Validation, Test) **almost** align with the intended ratios.
* No interactions are lost during the split.

In [37]:
sanity_check_explicit_split(train_matrix=train_matrix, validation_matrix=validation_matrix, test_matrix=test_matrix,
                            explicit_matrix=explicit_ratings)

Unnamed: 0,Split,Number of interactions,Part of factual interactions
0,Train,22548,71.04%
1,Validation,6136,19.33%
2,Test,3056,9.63%
3,Explicit total,31740,100.0%
4,Factual total,31740,100%


# Implicit rating extraction for train dataset

**The threshold** for implicit ratings calculations (only positive ratings are considered and explicit ratings are `from 1 to 5`)

In [16]:
IMPLICIT_THRESHOLD = 4

Calculate the **implicit rating** in the following way:
* calculate the amount of reviews from `u_i` to `b_i` that have the number of starts is above `IMPLICIT_THRESHOLD`

Final artifact:
* dict in the format `{<user_id>: {<item_id>: <amount of positive reviews>} }`

In [17]:
implicit_ratings = get_implicit_rating_out_of_positive_ratings_df(df=filtered_review_df, user_field='user_id',
                                                                  item_field='business_id', rating_field='stars',
                                                                  implicit_threshold=IMPLICIT_THRESHOLD)

len(implicit_ratings.keys()), filtered_review_df['user_id'].nunique()

(1696, 1696)

**Sanity check** (verify the correctness of the `implicit_ratings` matrix creation from the filtered review DataFrame using a specified `IMPLICIT_THRESHOLD`):

Metrics:
* Total number of reviews in the original dataset that meet or exceed the implicit threshold.
* Confirms that all qualifying reviews were included in the final implicit ratings matrix.
* Number of distinct users present in the filtered dataset before conversion.
* Ensures that no user information was lost during transformation.
* Total businesses that were reviewed in the original dataset.
* Confirms that all relevant business interactions are retained.

Results: 
* All user and business counts match between the initial and processed datasets.
* The number of implicit ratings is equal to the number of qualifying reviews — indicating a correct threshold-based transformation.

In [16]:
sanity_check_implicit_rating(initial_df=filtered_review_df, implicit_ratings=implicit_ratings,
                             implicit_threshold=IMPLICIT_THRESHOLD, user_field='user_id', item_field='business_id',
                             rating_field='stars')

Unnamed: 0,Metric,Value
0,Number of reviews (stars >= threshold),22636
1,Number of reviews in implicit_ratings,22636
2,Unique users in initial reviews,1696
3,Unique users in implicit_ratings,1696
4,Unique businesses in initial reviews,8406
5,Unique businesses in implicit_ratings,8406


# Hyperparameters tuning

Define potential values of hyperparameters for the implementation of **SVD++** 

In [17]:
svd_pp_param_grid = {
    'n_factors': [20, 40, 70, 100, 150],  # Number of latent factors
    'n_epochs': [10, 20, 30, 50],  # Number of training iterations
    'lr_all': [0.002, 0.005, 0.007, 0.01],  # Learning rate
    'reg_all': [0.02, 0.05, 0.1, 0.2]  # Regularization strength
}

Using **train and validation** dataset conduct grid search based on **RMSE** metric and extract the best hyperparameters for the target metric on the validation dataset. 

Best hyperparameters (based on validation matrix):
* **learning rate**: 0.007
* **number of epochs**: 50
* **number of hidden factors**: 40
* **regularization term**: 0.2

Best RMSE: **1.111** (in average the model makes in 1.111 point in rating estimation)

In [18]:
grid_search_svd_pp = GridSearchSvdPP(train_matrix=train_matrix, val_matrix=validation_matrix,
                                     implicit_rating=implicit_ratings, user_mapping=user_mapping,
                                     item_mapping=item_mapping, param_grid=svd_pp_param_grid)

best_params, best_score, best_svdpp_model = grid_search_svd_pp.run()

print(f"Best params: {best_params}")
print(f"Best RMSE: {best_score}")

INFO:root:Try number: 1
INFO:root:Train with params: {'lr_all': 0.002, 'n_epochs': 10, 'n_factors': 20, 'reg_all': 0.02}
INFO:root:Epoch: 1
INFO:root:Epoch: 2
INFO:root:Epoch: 3
INFO:root:Epoch: 4
INFO:root:Epoch: 5
INFO:root:Epoch: 6
INFO:root:Epoch: 7
INFO:root:Epoch: 8
INFO:root:Epoch: 9
INFO:root:Epoch: 10
INFO:root:Current common score: 1.1608847379240155
INFO:root:Try number: 2
INFO:root:Train with params: {'lr_all': 0.002, 'n_epochs': 10, 'n_factors': 20, 'reg_all': 0.05}
INFO:root:Epoch: 1
INFO:root:Epoch: 2
INFO:root:Epoch: 3
INFO:root:Epoch: 4
INFO:root:Epoch: 5
INFO:root:Epoch: 6
INFO:root:Epoch: 7
INFO:root:Epoch: 8
INFO:root:Epoch: 9
INFO:root:Epoch: 10
INFO:root:Current common score: 1.1613874308488443
INFO:root:Try number: 3
INFO:root:Train with params: {'lr_all': 0.002, 'n_epochs': 10, 'n_factors': 20, 'reg_all': 0.1}
INFO:root:Epoch: 1
INFO:root:Epoch: 2
INFO:root:Epoch: 3
INFO:root:Epoch: 4
INFO:root:Epoch: 5
INFO:root:Epoch: 6
INFO:root:Epoch: 7
INFO:root:Epoch: 8
IN

Best params: {'lr_all': 0.007, 'n_epochs': 50, 'n_factors': 40, 'reg_all': 0.2}
Best RMSE: 1.1113174596624342


The following code saves the result object to reuse **the trained model** in the service

In [21]:
with open("../../models/svd_pp_yelp.pkl", "wb") as f:
    pickle.dump(best_svdpp_model, f)

# Model testing

Load the saved model back from memory

In [11]:
with open("../../models/svd_pp_yelp.pkl", 'rb') as f:
    test_model = pickle.load(f)

test_model

<svdpp.SVDpp at 0x14ed25a90>

Create a metrics evaluator

In [38]:
metrics_calculator = TestMetricsCalculator(test_matrix=test_matrix, model=test_model,
                                           idx_to_user_id=user_mapping['idx_to_id'],
                                           idx_to_item_id=item_mapping['idx_to_id'])

INFO:root:Create top-10 recommendations' list
INFO:root:User: 0 -- top 10 list -- [(4059, 4.822973146458476), (661, 4.773654694302385), (56, 4.768638517032372), (2570, 4.749373595300882), (1985, 4.747255121617102), (2468, 4.743219493650465), (4835, 4.73297144342073), (2667, 4.726983795599641), (15, 4.708776007700817), (253, 4.69547095145208)]
INFO:root:User: 1 -- top 10 list -- [(661, 5.164569215075879), (2667, 5.152540502311157), (1367, 5.137254876611186), (253, 5.134788011457491), (20, 5.134069811720325), (56, 5.131850795036773), (1985, 5.12557827617598), (2554, 5.120673624576117), (4059, 5.11937558521036), (1048, 5.108466026248805)]
INFO:root:User: 2 -- top 10 list -- [(2554, 4.671056949491202), (2667, 4.657125403581349), (2246, 4.6490770308496305), (3577, 4.632151509366688), (253, 4.620593567301276), (39, 4.610692025939996), (4063, 4.600853485149394), (6891, 4.577547551679746), (3348, 4.574189771117739), (1230, 4.567654880470862)]
INFO:root:User: 3 -- top 10 list -- [(253, 4.929938

As it's possible to see, we're in the **cold start** state with our `test set`, because:
- maximum popularity is less than **0.005** (the maximum percentage of users that consider a particular item as **relevant** is less than **0.5%**) 
- Percentage of filled pairs `user - item` (this is considered as a flag for **relevance**) is less than **0.02%**

It means that most probably we'll get **novel** and **relevant** system, but it'll have bad **diversity** and **coverage** (in the cases when it's linked to `relevance flag`)

In [53]:
metrics_calculator.get_test_set_statistic()

Mean popularity                0.000203
Max popularity                 0.004717
Min popularity                 0.000000
Number of pairs         17212704.000000
Non-null pairs (u-i)        3056.000000
% of non-null pairs            0.017754
Relevant pairs (u-i)        3056.000000
% of relevant pairs            0.017754
dtype: object

## Relevance metrics

RMSE on `test dataset`: **1.179** (in average the model makes in 1.179 point in rating estimation) 

We assume that this is RMSE on the real data (the newest one)

In [40]:
rmse_calculator = RmseCalculator(matrix=test_matrix, model=test_model, idx_to_user_id=user_mapping['idx_to_id'],
                                 idx_to_item_id=item_mapping['idx_to_id'])

rmse_calculator.calculate_rmse()

1.1789339324295245

**Recovery** checks how close relevant* items to the top of the RLs. 

However, since we're in the **cold start** position with our test dataset, **no one** relevant* item appeared in all RLs 

\* **relevant item `j`** - item that has rating from user `i` in the test dataset

In [41]:
metrics_calculator.calculate_recovery()

There is no one relevant item in the top-10 recommendation list => Recovery can't be calculated


## Diversity metric
As a diversity metric **(normalized) aggregation diversity** was chosen. This metric can be used for both purposes - **inter-user diversity** and **coverage** since initially it calculates amount of unique items among all the RLs:
- we normalize by the `amount of recommendations` to calculate the level of **diversity** (which percent of recommendations is unique)
- we normalize by the `ammount of avaliable items` to calculate the level of **coverage** (which percent of all items has appeared in the RLs)

Final result:
-  **0.044** => our system mostly recommend **the same items** across the lists (can't reach **1** in current setup - size of catalog < amount of recommendations)

In [42]:
metrics_calculator.calculate_agg_div()

0.04392688679245283

## Coverage metric
As a coverage metric **(normalized) Item space coverage** was used. This metric serves for 2 purposes:
- check how many unique items appears in the RLs (also can be considered as **coverage**)
- check how uniform items distributed across the RLs

We can't directly conclude which behaviour our model has based on this metric, it serves for comparison between models

In [32]:
metrics_calculator.calculate_item_space_coverage()

21.38273102119177

Apart from that, as already was mentioned, AggDiv can be used as **coverage metric**

Its result is **0.073** which highlights that most part of the catalog **wasn't used** in the RLs 

In [49]:
metrics_calculator.calculate_agg_div(is_coverage=True)

0.07340624692087891

## Novelty metric
As novelty metric **(normalized) Item degree-based Novelty** is used: it shows the level of **unpopular** items that are in the RLs. 

Our system has pretty good level of novelty - **0.654** - which means it doesn't avoid unpopular items

In [43]:
metrics_calculator.calculate_normalized_item_deg()

0.6538109869318396

## Serendipity metric
Two metrics are used:
- **Unexpectedness** (`False` flag) == amount of recommended items above `mean popularity` across all the users
- **Serendipity** == amount of recommended items above `mean popularity` and relevant at the same time across all the users

1. Since test set almost doesn't contain relevant items (**cold start**), serendipity is **0** == there is no items above mean popularity and relevant at the same time
2. However, the system shows the good level of `unexpectedness` - **61.47%**

In [44]:
metrics_calculator.calculate_serendipity()

0.0

In [45]:
metrics_calculator.calculate_serendipity(False)

0.6147405660377387

## Key takeaways

Final conclusions about the **SVD++** with **Yelp**:
- Since the test dataset is in the **cold start** state, most part of the items in the set are **irrelevant** => the system doesn't produce relevant and serendipitous recommendations at all 
- The accuracy of the system remains debatable since the deviation is more than **1 point** in explicit rating (RMSE > 1)
- System also doesn't cover most part of the items from catalog and don't recommend unique lists mostly
- However, system recommends **unpopular** items and doesn't focus only on the most popular set 

In [55]:
metrics_calculator.generate_metrics_summary_df(rmse_calculator.calculate_rmse())

There is no one relevant item in the top-10 recommendation list => Recovery can't be calculated


Unnamed: 0,Metric,Area,Value,Value Range,Meaning
0,Recovery,Relevance,,"[0, 0.9]",How early relevant items appear in top-N recommendations
1,Normalized AggDiv (diversity),Inter-user diversity,0.043927,"[0, 1]",Proportion of unique items recommended across all users divided by the amount of recommendations
2,Normalized AggDiv (coverage),Coverage,0.073406,"[0, 1]",Proportion of unique items recommended across all users divided by the size of catalog
3,Item Space Coverage,Coverage,21.383,"[0, Not defined]",Shows how many unique items and how often appears in the RLs (ideally a lot of different items recommended uniformly)
4,Normalized ItemDeg,Novelty,0.654,"[0, 1]",Novelty of recommended items based on inverse (log) item popularity
5,Unexpectedness (no relevance),Serendipity,0.615,"[0, 1]",Proportion of items that are unexpected (less popular than average)
6,Serendipity (with relevance),Serendipity,0.0,"[0, 1]",Proportion of unexpected and relevant items in top-N recommendations
7,RMSE,Relevance,1.179,"[0, 6]",Root Mean Square Error between predicted and actual ratings


The meanings of the metrics and their ranges

In [56]:
metrics_calculator.get_range_of_metrics()

Unnamed: 0,Metric,Min,Max,Explanation
0,Item space coverage,0,Not defined,"small - recommendations focuses on several item only or aren't balanced, big - recommendations are distributed uniformly across a lot of items"
1,Recovery,0,0.9,"0 - all the relevant items on the top of the list, 0.9 - all relevant items in the bottom of the list, None - no relevant items in the RLs"
2,Normalized AggDiv (diversity),0,1,"0 - only 1 item was recommended for everyone, 1 - all recommendations are different"
3,Normalized AggDiv (coverage),0,1,"0 - only 1 item was recommended, 1 - all the items from catalog were recommended"
4,Unexpectedness (with_relevance=False),0,1,"0 - there is no unexpected item (popularity below the average) in all RLs, 1 - all the items are unexpected"
5,Serendipity (with_relevance=True),0,1,"0 - there is no serendipitous item (popularity below the average + relevant) in all RLs, 1 - all the items are serendipitous"
6,Normalized ItemDeg,0,1,"0 - the most popular items are used (no novelty), 1 - all items are the most unpopular (the best novelty)"
