# Important

`make data/processed/ratings-train.csv` has to be run before running any notebook cell

# Imports

In [None]:
import pandas as pd

# Grid search

Search for optimal parameters values was conducted using GridSearch method. The whole process base only on training dataset, to avoid introducing a bias into test procedure.

Helper methods to prepare data and compare results:

In [None]:
def flatten_dicts(df):
    if 'param_sim_options' in df.columns:
        df_sim = df['param_sim_options'].apply(lambda x : dict(eval(x))).apply(pd.Series)
        df = pd.concat([df, df_sim], axis=1).drop('param_sim_options', axis=1)
    if 'param_bsl_options' in df.columns:
        df_sim = df['param_bsl_options'].apply(lambda x : dict(eval(x))).apply(pd.Series)
        df = pd.concat([df, df_sim], axis=1).drop('param_bsl_options', axis=1)
    return df

def select_cols(df, cols):
    df = flatten_dicts(df)
    cols = ['mean_test_rmse'] + cols + ['mean_fit_time', 'mean_test_time']
    return df[cols]

def compare(df, col):
    return df.groupby(col)[['mean_test_rmse', 'mean_fit_time', 'mean_test_time']
                       ].median().sort_values('mean_test_rmse')

## KNN

### Choosing similarity metric

Available metrics are:
- `Cosine`:
- `Mean Squared Difference`: 
- `Pearson`: 
- `Pearson with baseline`:

Because number of users is much higher than number of items, we use item-based similarity.

Shrinkage parameter can be specified for Pearson Baseline to avoid overfitting when only few ratings are available. In our dataset there are always at least 8 ratings for books, so there is no need to tune this parameter.

In [None]:
df_sim_metric = pd.read_csv("../results/knn-parameters-search-sim_metric.csv", index_col='rank_test_rmse')
df_sim_metric = select_cols(df_sim_metric, ['name'])
compare(df_sim_metric, 'name')

Pearson baseline methods achieves the best results due to the fact, that it take into account the baselines. Further explained in section 2.1 of "Factor in the Neighbors: Scalable and
Accurate Collaborative Filtering" by Koren.

### Choosing baselines estimates method 

Available methods are:
- `SGD`: Stochastic Gradient Descent
- `ALS`: Alternating Least Squares

In [None]:
df_baselines = pd.read_csv("../results/knn-parameters-search-baselines.csv", index_col='rank_test_rmse')
df_baselines = select_cols(df_baselines, ['method', 'n_epochs', 'reg', 'learning_rate', 'reg_i', 'reg_u'])
compare(df_baselines, 'method')

ALS and SGD achieve comparable results with deafult parameters, but ALS is trained faster. Therefore, we choose ALS for tuning.

### Choosing neighbors count

Available parameters are:
- `k`: maximal number of neighbors to take into account; default value 40
- `min_support`: minimal number of similar users between neighbor and current item for calculating similarity isntead of returning 0; default value 1

In [None]:
df_neighbors = pd.read_csv("../results/knn-parameters-search-neighbors.csv", index_col='rank_test_rmse')
df_neighbors = select_cols(df_neighbors, ['param_k', 'min_support'])
df_neighbors.head(10)

In [None]:
compare(df_neighbors, ['param_k'])

Taking into consideration smaller number of neighbors seems to benefit the model's accuracy.

In [None]:
compare(df_neighbors, ['min_support'])

As expected, any value for score is better than 0.

### Choosing regularization parameters

In [None]:
df_knn_reg = pd.read_csv("../results/knn-parameters-search-reg.csv", index_col='rank_test_rmse')
df_knn_reg = select_cols(df_knn_reg, ['n_epochs', 'reg_i', 'reg_u'])
df_knn_reg.head()

In [None]:
compare(df_knn_reg, ['n_epochs', 'reg_i', 'reg_u'])

There are no huge differences between obtained results, so we stick to the defaults.

### Final settings

In [None]:
knn_params = {'bsl_options': {'method': ['als'],
                          'reg_i': [10],
                          'reg_u': [15],
                          'n_epochs': [10]},
          'k': [30],
          'sim_options': {'name': ['pearson_baseline'],
                          'min_support': [1],
                          'user_based': [False],
                          'shrinkage': [100]},
          'verbose': [False]}

## SVD

We opted for SVD instead of SVD++. The latter requires far more time for training phase(5 vs 150 minutes) and scores better by only 0.01 points(0.82 vs 0.81).

### Choosing factors number

In [None]:
df_factors = pd.read_csv("../results/svd-parameters-search-factors.csv", index_col='rank_test_rmse')
df_factors = select_cols(df_factors, ['param_n_factors'])
compare(df_factors, 'param_n_factors')

Surprisingly, higher number of factors does not result in better accuracy.

### Choosing regularization parameters

In [None]:
df_init = pd.read_csv("../results/svd-parameters-search-init.csv", index_col='rank_test_rmse')
df_init = select_cols(df_init, ['param_init_mean', 'param_init_std_dev'])
df_init.head()

As the ratings mean is much greater than average point of the scale(3.9 vs 2.5) we can assume than starting from median closer to real one would yield better results. That turned out to be true.

In [None]:
df_svd_reg = pd.read_csv("../results/svd-parameters-search-reg.csv", index_col='rank_test_rmse')
df_svd_reg = select_cols(df_svd_reg, ['param_n_epochs', 'param_lr_all', 'param_reg_all'])
df_svd_reg.head()

Default parameters for surprise library were already adjusted for 1-5 rating scale. Therefore different parameter values give worse results.

### Final settings

In [None]:
svd_params = {'n_factors': [100],
          'biased': [True],
          'init_mean': [0.1],
          'init_std_dev': [0.05],
          'n_epochs': [25],
          'lr_all': [0.005],
          'reg_all': [0.02],
          'random_state': [44]}