In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!pip install -r ../../../requirements.txt



In [3]:
import os
import sys

library_path = os.path.abspath("../library")
if library_path not in sys.path:
    sys.path.append(library_path)

In [4]:
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [5]:
import pandas as pd
pd.options.mode.chained_assignment = None  # Disable the warning

import pickle
import numpy as np

from metrics import SvdMetricsCalculator
from rating import get_explicit_rating, get_implicit_rating_out_of_positive_ratings_csr, split_matrix_csr
from tuning import GridSearchSvdPP

INFO:numexpr.utils:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


# Feature selection

The only dataset that is necessary for our purposes is **review** dataset since:
- it contains the information about explicit ratings (the mean of the field **stars** for pairs of users and items, check the chapter **Feature engineering** for more details)
- it contains the information for implicit rating (check the chapter **Feature engineering** for more details)
- it already contains only those users who provided at least one review and those items that received at least one estimation

In [6]:
PATH = '../../../eda/dataset_samples/sampled_yelp_review.csv'

There are two paths of working with the dataset that contains more than **900 000** rows

*The standard flow:*
1. Download dataset from Kaggle
2. Drop the reviews for closed businesses (`is_open == True`) and select the particular category of business (check chosen category in `/eda/yelp.ipynb`)
3. Drop the reviews were provided by users that assessed less than **threshold** items (check the threshold definition in `/eda/yelp.ipynb`) 
4. Use all the dataset in the purpose of analysis, training and testing

*The "sampled" flow* (provided in `/eda/yelp.ipynb`):
1. Download dataset from Kaggle
2. Drop the reviews for closed businesses (`is_open == True`) and select the particular category of business (check chosen category in `/eda/yelp.ipynb`)
3. Drop the reviews were provided by users that assessed less than **threshold** items (check the threshold definition in `/eda/yelp.ipynb`) 
4. Sample **10%** randomly
5. Check similarity of mathematical moments (`.describe()`) and dataset balancing (`.value_counts()`) with main dataset (initial YELP)

In the purpose of time-efficiency and due to the lack of power **the "sampled" flow** was chosen during EDA, and the sampled dataset was saved using the following path **`/eda/dataset_samples`**

Reviews' features:
- `review_id` | `user_id` | `business_id` - id of the review and foreign keys (one user can leave several reviews for one item)
- `stars` - **explicit rating** provided by user for the particular item in the particular moment
- `useful` | `funny` | `cool`  - user's flags about (presumably) review. We don't drop this feature since it's necessary to get any evidences that theory about the nature of the feature is right - check the **opportunity of usage them for implicit rating** 
- `text` - the content of review (can be useful for potential sentimental analysis)
- `date` - the timestamp of review

In [7]:
review_df = pd.read_csv(PATH)
review_df

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,J5Q1gH4ACCj6CtQG7Yom7g,56gL9KEJNHiSDUoyjk2o3Q,8yR12PNSMo6FBYx1u5KPlw,2,1,0,0,Went for lunch and found that my burger was me...,2018-04-04 21:09:53
1,HlXP79ecTquSVXmjM10QxQ,bAt9OUFX9ZRgGLCXG22UmA,pBNucviUkNsiqhJv5IFpjg,5,0,0,0,I needed a new tires for my wife's car. They h...,2020-05-24 12:22:14
2,JBBULrjyGx6vHto2osk_CQ,NRHPcLq2vGWqgqwVugSgnQ,8sf9kv6O4GgEb0j1o22N1g,5,0,0,0,Jim Woltman who works at Goleta Honda is 5 sta...,2019-02-14 03:47:48
3,U9-43s8YUl6GWBFCpxUGEw,PAxc0qpqt5c2kA0rjDFFAg,XwepyB7KjJ-XGJf0vKc6Vg,4,0,0,0,Been here a few times to get some shrimp. The...,2013-04-27 01:55:49
4,8T8EGa_4Cj12M6w8vRgUsQ,BqPR1Dp5Rb_QYs9_fz9RiA,prm5wvpp0OHJBlrvTj9uOg,5,0,0,0,This is one fantastic place to eat whether you...,2019-05-15 18:29:25
...,...,...,...,...,...,...,...,...,...
699023,35n-AfizcW5N5KoKkYBSiQ,0PIrTmRx2EXHjAJOxxWvLA,IM0gPEyNRUoZVmYZhwuBpQ,5,0,0,0,Today was our first time to go here and the st...,2019-04-15 22:55:39
699024,b8_VWg1UF-Xwbg2PBiYDuA,X3fYkqlfQFNI3vP2sS-JWw,l_slvEnh4v3W8BXF1gYlcQ,5,0,0,0,Sharkeez still the best place on State Street ...,2016-01-08 01:11:29
699025,ZRNDyqZMoBvfDn1WeKzI3w,ZEcFnCWWT7wGEbPZkPQb0Q,Mr7Aov2n7wPCpwaUxk8lCw,4,0,0,0,Good vegetarian selections. Excellent sangria....,2013-05-13 12:23:06
699026,7-uxIDn4CkcBJedNm6IdTw,nWMwtH1xGvn1Utm11Ufxow,v9jNsSprsNpERYvyTOAzqw,5,2,1,1,Normally I wouldn't review a buffet but I feel...,2019-05-27 18:29:40


Reasons of feature dropping:
- since `useful | funny | cool` features describe the preferences of other users about this particular review, not an item, they can't be used for calculations of **explicit** and **implicit** ratings (our assumption is that these features are reactions that user can give to review that theoretically describes the user-to-user relations) 
- `review_id` won't be dropped, but will be used for indexing since it's unique field
- `text` of review won't be used for implicit or explicit ratings so this feature can be also dropped 

In [8]:
REMAINED_FEATURES = ['review_id', 'user_id', 'business_id', 'stars', 'date']

filtered_review_df = review_df[REMAINED_FEATURES]
filtered_review_df.set_index('review_id', inplace=True)
filtered_review_df

Unnamed: 0_level_0,user_id,business_id,stars,date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
J5Q1gH4ACCj6CtQG7Yom7g,56gL9KEJNHiSDUoyjk2o3Q,8yR12PNSMo6FBYx1u5KPlw,2,2018-04-04 21:09:53
HlXP79ecTquSVXmjM10QxQ,bAt9OUFX9ZRgGLCXG22UmA,pBNucviUkNsiqhJv5IFpjg,5,2020-05-24 12:22:14
JBBULrjyGx6vHto2osk_CQ,NRHPcLq2vGWqgqwVugSgnQ,8sf9kv6O4GgEb0j1o22N1g,5,2019-02-14 03:47:48
U9-43s8YUl6GWBFCpxUGEw,PAxc0qpqt5c2kA0rjDFFAg,XwepyB7KjJ-XGJf0vKc6Vg,4,2013-04-27 01:55:49
8T8EGa_4Cj12M6w8vRgUsQ,BqPR1Dp5Rb_QYs9_fz9RiA,prm5wvpp0OHJBlrvTj9uOg,5,2019-05-15 18:29:25
...,...,...,...,...
35n-AfizcW5N5KoKkYBSiQ,0PIrTmRx2EXHjAJOxxWvLA,IM0gPEyNRUoZVmYZhwuBpQ,5,2019-04-15 22:55:39
b8_VWg1UF-Xwbg2PBiYDuA,X3fYkqlfQFNI3vP2sS-JWw,l_slvEnh4v3W8BXF1gYlcQ,5,2016-01-08 01:11:29
ZRNDyqZMoBvfDn1WeKzI3w,ZEcFnCWWT7wGEbPZkPQb0Q,Mr7Aov2n7wPCpwaUxk8lCw,4,2013-05-13 12:23:06
7-uxIDn4CkcBJedNm6IdTw,nWMwtH1xGvn1Utm11Ufxow,v9jNsSprsNpERYvyTOAzqw,5,2019-05-27 18:29:40


# Rating extracting

Below predefined constants for this section

In [9]:
# The threshold for implicit ratings calculations (only positive ratings are considered and explicit ratings are from 1 to 5)
IMPLICIT_THRESHOLD = 4

In [10]:
filtered_review_df["date"] = pd.to_datetime(filtered_review_df["date"]).astype(np.int64) // 10**9 
filtered_review_df

Unnamed: 0_level_0,user_id,business_id,stars,date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
J5Q1gH4ACCj6CtQG7Yom7g,56gL9KEJNHiSDUoyjk2o3Q,8yR12PNSMo6FBYx1u5KPlw,2,1522876193
HlXP79ecTquSVXmjM10QxQ,bAt9OUFX9ZRgGLCXG22UmA,pBNucviUkNsiqhJv5IFpjg,5,1590322934
JBBULrjyGx6vHto2osk_CQ,NRHPcLq2vGWqgqwVugSgnQ,8sf9kv6O4GgEb0j1o22N1g,5,1550116068
U9-43s8YUl6GWBFCpxUGEw,PAxc0qpqt5c2kA0rjDFFAg,XwepyB7KjJ-XGJf0vKc6Vg,4,1367027749
8T8EGa_4Cj12M6w8vRgUsQ,BqPR1Dp5Rb_QYs9_fz9RiA,prm5wvpp0OHJBlrvTj9uOg,5,1557944965
...,...,...,...,...
35n-AfizcW5N5KoKkYBSiQ,0PIrTmRx2EXHjAJOxxWvLA,IM0gPEyNRUoZVmYZhwuBpQ,5,1555368939
b8_VWg1UF-Xwbg2PBiYDuA,X3fYkqlfQFNI3vP2sS-JWw,l_slvEnh4v3W8BXF1gYlcQ,5,1452215489
ZRNDyqZMoBvfDn1WeKzI3w,ZEcFnCWWT7wGEbPZkPQb0Q,Mr7Aov2n7wPCpwaUxk8lCw,4,1368447786
7-uxIDn4CkcBJedNm6IdTw,nWMwtH1xGvn1Utm11Ufxow,v9jNsSprsNpERYvyTOAzqw,5,1558981780


## Explicit rating 

In [11]:
explicit_ratings, last_dates, user_mapping, item_mapping = get_explicit_rating(filtered_review_df, "user_id", "business_id", "stars", "date")

explicit_ratings.toarray() , last_dates.toarray() 

(array([[2., 0., 0., ..., 0., 0., 0.],
        [0., 5., 0., ..., 0., 0., 0.],
        [0., 0., 5., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([[1522876193,          0,          0, ...,          0,          0,
                  0],
        [         0, 1590322934,          0, ...,          0,          0,
                  0],
        [         0,          0, 1550116068, ...,          0,          0,
                  0],
        ...,
        [         0,          0,          0, ...,          0,          0,
                  0],
        [         0,          0,          0, ...,          0,          0,
                  0],
        [         0,          0,          0, ...,          0,          0,
                  0]]))

## Implicit rating

# Train / validation / test split

In [12]:
DIVISIONS = [0.7, 0.2, 0.1]

train_matrix, validation_matrix, test_matrix = split_matrix_csr(explicit_ratings, last_dates, DIVISIONS)
train_matrix.toarray(), validation_matrix.toarray(), test_matrix.toarray()

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))

In [13]:
train_df = pd.DataFrame.sparse.from_spmatrix(train_matrix)
implicit_ratings = get_implicit_rating_out_of_positive_ratings_csr(matrix=train_matrix, implicit_threshold=IMPLICIT_THRESHOLD, idx_to_user_id=user_mapping['idx_to_id'], idx_to_item_id=item_mapping['idx_to_id'])

implicit_ratings

defaultdict(dict,
            {'56gL9KEJNHiSDUoyjk2o3Q': {'oEwmCZknUHgHfEBdKA2SZA': 1},
             'bnDZpsii_if2_wpn8oPcig': {'GKohndn_sMk2nWCh1jkAfg': 1,
              '3gVSrS4kffGGZT8oXHsIcw': 1},
             '8H183Gq4be1PqKBW7jbIiA': {'KHl171eshtTPrGyBWGEHQQ': 1},
             'ZLKpeCqbCMWfNeT6yU8wUQ': {'zT2OzXDWKK1abapHs2RUrQ': 1,
              'H1Azz4BpHYC8BlUAdMlfxw': 1},
             '7ihD-NrnECBbm9qnV_V38w': {'mXqcL-AQDLDXiT_dsksXGQ': 1},
             'AT_p7NkLqd50ugp3wjFg2Q': {'gqmQA9TIdmKz3tCnz6DqFA': 1},
             'CEZMiWrgtF67m0GUm19ZJA': {'I5uFcL0xshPJvhKQBW10Wg': 1,
              'nKpWUL3kMt4cnNQhye2WqA': 1},
             'XKh39FTs6Brg_cmQt-1hkw': {'JATnHDL8fLenqqhjJ5NTsA': 1},
             '6PujZU6irRIC0BhqrMu1Lg': {'_f3JQU6IXpGmTLaSqGy79g': 1,
              'bUW3qAQr_nf-DTz4B85o7A': 1,
              'sDi3UPW6imj4wGHUg7CgXg': 1},
             'J4a0YR8lfdd91TlqQtC0HQ': {'G6_zdYOVXUtEuW50ymodcA': 1,
              'vwAQNbXRm7Y6_f8REg7L7Q': 1,
              'xkTjLbBC7u

# Hyperparameters tuning

In [14]:
grid_search_svd_pp = GridSearchSvdPP(train_matrix=train_matrix, val_matrix=validation_matrix, train_implicit_rating=implicit_ratings, user_mapping=user_mapping, item_mapping=item_mapping)
 
best_params, best_score, best_svdpp_model = grid_search_svd_pp.run(explicit_rating_max=5)
best_params, best_score

INFO:venv:Try number: 1
INFO:venv:Train with params: {'lr_all': 0.005, 'n_epochs': 10, 'n_factors': 10, 'reg_all': 0.1}
INFO:venv:Epoch: 1
INFO:venv:Epoch: 2
INFO:venv:Epoch: 3
INFO:venv:Epoch: 4
INFO:venv:Epoch: 5
INFO:venv:Epoch: 6
INFO:venv:Epoch: 7
INFO:venv:Epoch: 8
INFO:venv:Epoch: 9
INFO:venv:Epoch: 10
INFO:venv:Current common score: 0.22113838065050526
INFO:venv:Try number: 2
INFO:venv:Train with params: {'lr_all': 0.005, 'n_epochs': 10, 'n_factors': 20, 'reg_all': 0.1}
INFO:venv:Epoch: 1
INFO:venv:Epoch: 2
INFO:venv:Epoch: 3
INFO:venv:Epoch: 4
INFO:venv:Epoch: 5
INFO:venv:Epoch: 6
INFO:venv:Epoch: 7
INFO:venv:Epoch: 8
INFO:venv:Epoch: 9
INFO:venv:Epoch: 10
INFO:venv:Current common score: 0.2212533994085407
INFO:venv:Try number: 3
INFO:venv:Train with params: {'lr_all': 0.005, 'n_epochs': 10, 'n_factors': 30, 'reg_all': 0.1}
INFO:venv:Epoch: 1
INFO:venv:Epoch: 2
INFO:venv:Epoch: 3
INFO:venv:Epoch: 4
INFO:venv:Epoch: 5
INFO:venv:Epoch: 6
INFO:venv:Epoch: 7
INFO:venv:Epoch: 8
INF

({'lr_all': 0.005, 'n_epochs': 20, 'n_factors': 10, 'reg_all': 0.1},
 0.21916578848747775)

# Model testing

In [16]:
metrics_calculator = SvdMetricsCalculator(test_matrix=test_matrix, model=best_svdpp_model, idx_to_user_id=user_mapping['idx_to_id'], idx_to_item_id=item_mapping['idx_to_id'])

metrics_calculator.calculate_common_metric(5)

0.21247450612685462

In [17]:
metrics_calculator.calculate_rmse()

1.062372530634273

# Model saving 

The following code saves the result object to reuse **the trained model** in the service

In [15]:
with open("./models/svd_pp.pkl", "wb") as f:
    pickle.dump(best_svdpp_model, f)