# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# Reduced Data: Grid Search + Cross-Validation

This time, as explored in the EDA NB, let's consider removing customers who we feel have too few or too many purchases to influence the model in the intended way.

Examples:
- Too few - Customers who have only bought 1 comic (series).
- Too many - Customers with > 1000 series (for example, think all eBay customers are rolled into one account number).

# Libraries

In [44]:
%matplotlib inline
%load_ext autoreload
# %autoreload 1 #would be where you need to specify the files
# %aimport comic_recs

import pandas as pd # dataframes
import os
import pickle
import numpy as np

# Data storage
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

# import Spark methods
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.sql.types import (StructType, StructField, IntegerType
#                                ,FloatType, LongType, StringType)
from pyspark.sql.types import *

import pyspark.sql.functions as F
from pyspark.sql.functions import col, explode, lit, isnan, when, count
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import (CrossValidator, ParamGridBuilder, 
                               TrainValidationSplit)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql import DataFrame
from pyspark import SparkConf
from pyspark.sql import SQLContext

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
import sys

In [3]:
sys.path.append('..')

In [4]:
# Custom
import data_fcns as dfc
import keys  # Custom keys lib
import comic_recs as cr

import time
import itertools
from functools import reduce
import numpy as np

### Configure Spark

In [5]:
conf = SparkConf()

# Laptop config
conf = (conf.setMaster('local[*]')
#         .set('spark.executor.memory', '1G') #https://stackoverflow.com/questions/48523629/spark-pyspark-an-error-occurred-while-trying-to-connect-to-the-java-server-127
        .set('spark.driver.memory', '7G')
        .set('spark.driver.maxResultSize', '2G'))
#         .set('spark.executor.memory', '1G')
#         .set('spark.driver.memory', '10G')
#         .set('spark.driver.maxResultSize', '5G'))

# AWS instance config
#conf = (conf.setMaster('local[*]')
#         .set('spark.driver.memory', '25G')
#         .set('spark.driver.maxResultSize', '4G'))
#         .set('spark.executor.memory', '1G') #https://stackoverflow.com/questions/48523629/spark-pyspark-an-error-occurred-while-trying-to-connect-to-the-java-server-127
#         .set('spark.driver.memory', '10G')
#         .set('spark.driver.maxResultSize', '5G'))

sc = pyspark.SparkContext().getOrCreate(conf=conf)

sql_context = SQLContext(sc)

sc.setCheckpointDir('./checkpoints')
# spark.sparkContext.setCheckpointDir("hdfs://datalake/check_point_directory/als")

## Import Data

We've previously set aside the dataset into a `json` file.

In [6]:
# We have previously created a version of the transactions table 
# and filtered it down.
sold = sql_context.read.json('raw_data/als_input_filtered_190915.json')

In [7]:
# Persist the data
sold.persist()

DataFrame[account_id: bigint, bought: bigint, comic_id: bigint]

In [8]:
sold.count()

846090

### ALS Model

Let's start with  train/test split.

In [9]:
random_seed = 41916

In [10]:
# Split data into training and test set
(train, test) = sold.randomSplit([.75, .25], seed=random_seed)

In [11]:
train.count()

634635

Make sure shapes make sense.

In [12]:
print(train.count(), len(train.columns))

634635 3


In [13]:
print(test.count(), len(test.columns))

211455 3


## New Grid Search

In [14]:
# # hyper-param config
# max_iters = [20, 25, 30]
# ranks = [5, 10, 20, 25]
# reg_params = [.05, .10, .15]
# alphas = [5, 25, 40]

In [15]:
# # hyper-param config
# max_iters = [30, 40]
# ranks = [25, 30]
# reg_params = [.15, .25]
# alphas = [5]

In [16]:
# # hyper-param config
# max_iters = [30]
# ranks = [5, 10]
# reg_params = [.25, .30]
# alphas = [5]

In [17]:
# hyper-param config
max_iters = [40]
ranks = [30]
reg_params = [.25]
alphas = [5]

In [18]:
# hyper-param config
max_iters = [20, 30, 40]
ranks = [5, 10, 20, 30, 40]
reg_params = [.05, .10, .25, .50]
alphas = [5, 25, 40, 100]

These will be user, item and rating columns, respectively, for our dataset.
- `account_id`
- `comic_id`
- `bought`

In [19]:
model_list = cr.create_als_models_list(userCol="account_id", itemCol="comic_id"
                                    ,ratingCol="bought"
                                    ,ranks=ranks
                                    ,max_iters=max_iters
                                    ,reg_params=reg_params
                                    ,alphas=alphas
                                    ,seed=random_seed
                                   )

In [20]:
model_list

[ALS_79970e79f6cb,
 ALS_36fb60fe80ff,
 ALS_4c53f475ea6d,
 ALS_cb761e2de513,
 ALS_f329e3e61c85,
 ALS_d6224b2689ab,
 ALS_44f7779cd04d,
 ALS_4e9c3f2f923d,
 ALS_459cd49bfb27,
 ALS_9cd396141376,
 ALS_499143b3f40a,
 ALS_e3bacf56e376,
 ALS_e7b9cc3808d7,
 ALS_c6995e303e61,
 ALS_765102bce857,
 ALS_d370b59dba4a,
 ALS_51e15e7eec31,
 ALS_a9ba7b084d0b,
 ALS_500f4e5ad35d,
 ALS_5f0d08e4e572,
 ALS_6b491b2bc51d,
 ALS_9514f6fc5eeb,
 ALS_b52dee8b4a91,
 ALS_48ba9ac9a032,
 ALS_de3d55cf13a4,
 ALS_fcaadd240aad,
 ALS_4e9f85414c64,
 ALS_d25f5e7ab87d,
 ALS_23f8293b4735,
 ALS_3338019de999,
 ALS_1701253e9f27,
 ALS_8ca308eecc68,
 ALS_9c64703b2d41,
 ALS_aa13c41fdac8,
 ALS_819b19bb30fa,
 ALS_70ecb63eb816,
 ALS_22bc63bd0810,
 ALS_2314a58ba707,
 ALS_fa84db690a83,
 ALS_484469a8c4af,
 ALS_ddb6c3207234,
 ALS_758207d8d05f,
 ALS_5fe45e276b3a,
 ALS_c88b9cf7c0eb,
 ALS_d9a27a3d4e2f,
 ALS_0dd8e357f9d0,
 ALS_9b87d7b09620,
 ALS_066d71ef1872,
 ALS_6fda87ef82e5,
 ALS_e6c1302dbbfa,
 ALS_d0874540e6c8,
 ALS_be11e95b3bf1,
 ALS_9239293

Check that model list has proper number of combinations.

In [21]:
print( len(model_list) == (len(max_iters)*len(ranks)*len(reg_params)*len(alphas)) )

True


### Manual Cross-Validation

## Metrics

In [22]:
ROEMs = cr.get_ROEMs(sql_context, model_list, train, test, 'account_id'
                     ,'bought')

Validation ROEM #1: 0.19134561177973972
Validation ROEM #2: 0.1876827427285817
Validation ROEM #3: 0.19027737320136165
Validation ROEM #4: 0.1991636965010482
Validation ROEM #5: 0.1911200435745735
Validation ROEM #6: 0.1875977122338758
Validation ROEM #7: 0.1901276534870648
Validation ROEM #8: 0.19864277974489786
Validation ROEM #9: 0.19036813296278882
Validation ROEM #10: 0.18731592423295831
Validation ROEM #11: 0.18964788669034244
Validation ROEM #12: 0.19760961970279622
Validation ROEM #13: 0.18963968337402312
Validation ROEM #14: 0.18642287740761168
Validation ROEM #15: 0.18877627951442777
Validation ROEM #16: 0.196313141311891
Validation ROEM #17: 0.1915670227116951
Validation ROEM #18: 0.18737775309120064
Validation ROEM #19: 0.18893069369280546
Validation ROEM #20: 0.19861632820624128
Validation ROEM #21: 0.19123698354570431
Validation ROEM #22: 0.1872404442235986
Validation ROEM #23: 0.18872956698659393
Validation ROEM #24: 0.19820594332349661
Validation ROEM #25: 0.19064365444

Validation ROEM #201: 0.15052866369042966
Validation ROEM #202: 0.16008203235813376
Validation ROEM #203: 0.16377216999624922
Validation ROEM #204: 0.17746391506687387
Validation ROEM #205: 0.14580436253465356
Validation ROEM #206: 0.15380954461969978
Validation ROEM #207: 0.1595373829676186
Validation ROEM #208: 0.1725142245742679
Validation ROEM #209: 0.15605785602901012
Validation ROEM #210: 0.16353772415423032
Validation ROEM #211: 0.16703357023573942
Validation ROEM #212: 0.17870261668808135
Validation ROEM #213: 0.15233017289579692
Validation ROEM #214: 0.1628771427669606
Validation ROEM #215: 0.16636002621602572
Validation ROEM #216: 0.17746494129771298
Validation ROEM #217: 0.1479837168570861
Validation ROEM #218: 0.15680033439118243
Validation ROEM #219: 0.1614193672355894
Validation ROEM #220: 0.1754934940541338
Validation ROEM #221: 0.1445386306765583
Validation ROEM #222: 0.15135169042916002
Validation ROEM #223: 0.15727258162213203
Validation ROEM #224: 0.1709135699029346


In [23]:
best_model = cr.get_best_model(ROEMs, model_list)

Index of smallest error: 236
Smallest error:  0.14376224135570556


In [24]:
best_model

ALS_19b8fd7da8d9

## Extracting Parameters

In [25]:
# Extract the best_modelel parameters
print('alpha: {}'.format(best_model.getAlpha()))
print('reg param: {}'.format(best_model.getRegParam()))
print('max iter: {}'.format(best_model.getMaxIter()))
print('rank: {}'.format(best_model.getRank()))

alpha: 5.0
reg param: 0.5
max iter: 40
rank: 40


Upon inspection of the ROEM history, it appears that the best model lies along the edge of our grid, so we could potentially try other parameter values beyond. For example, the best model has a rank of 40, but we only tried rank values up to 40. We could potentially try higher values. 

In [51]:
# hyper-param config
max_iters = [40]
ranks = [50, 60, 70,80,90,100]
reg_params = [.50]
alphas = [5]

In [52]:
model_list_2 = cr.create_als_models_list(userCol="account_id", itemCol="comic_id"
                                    ,ratingCol="bought"
                                    ,ranks=ranks
                                    ,max_iters=max_iters
                                    ,reg_params=reg_params
                                    ,alphas=alphas
                                    ,seed=random_seed
                                   )

In [53]:
model_list_2

[ALS_ad358955846a,
 ALS_ee84f3b4c2e4,
 ALS_4df92cfa740f,
 ALS_a8a7288967a9,
 ALS_d36ce52fbb3a,
 ALS_0d06e48eb652]

In [54]:
ROEMs_2 = cr.get_ROEMs(sql_context, model_list_2, train, test, 'account_id'
                     ,'bought')

Validation ROEM #1: 0.14222735436507208
Validation ROEM #2: 0.1395159364006137
Validation ROEM #3: 0.13762273722934018
Validation ROEM #4: 0.1378879958535668
Validation ROEM #5: 0.1385916411905225
Validation ROEM #6: 0.1358331508405915
Total Runtime: 520.75 seconds


In [26]:
!rm -rf models/best_model_use_20190921/

In [27]:
best_model.save('models/best_model_use_20190921')

In [28]:
# with open('support_data/best_model_20190916a.pkl', 'wb') as f:
#     pickle.dump(best_model, f)
    
# # Example - load pickle
# # pickle_in = open("support_data/params_errs_rd1.pkl","rb")
# # pe1 = pickle.load(pickle_in)

#### Use this to reload the Grid Search results

# pickle_in = open('support_data/params_errs_20190912d.pkl', 'rb')
# params_errs_4 = pickle.load(pickle_in)
                         

## Cross-Validation

Perform 5-fold cross-validation

In [29]:
errors = []

In [30]:
foldlist = cr.create_spark_5_fold_set(train, seed=random_seed)

In [31]:
errors = cr.perform_cv(foldlist, best_model, sql_context, 
                       'account_id', 'bought')

In [32]:
errors

[0.15581960057027952,
 0.1637829108738509,
 0.1602362185414189,
 0.16010871159309842,
 0.15467930302872307]

In [33]:
print("CV errors range: %0.2f (+/- %0.2f)" % (np.mean(errors), 
                                              np.std(errors) * 2))

CV errors range: 0.16 (+/- 0.01)


Seems pretty stable, which helps alleviate some over-fitting concerns.

## Test the Candidate Model

Test vs our holdout set.

In [34]:
model_fitted = best_model.fit(train)

In [35]:
# get predictions on test
test_preds = model_fitted.transform(test)

# Evaluate test
test_roem = cr.calculate_ROEM(sql_context, test_preds, 'account_id', 'bought')
test_roem

0.14376224135570556

Weird that test error is lower than train...but not too far off. Let's run with it for now.

## Get Factors

#### Save the item factors for future use!

In [36]:
item_factors = model_fitted.itemFactors.toPandas()

In [37]:
item_factors.shape

(790, 2)

In [38]:
item_factors.to_pickle("support_data/item_factors_20190921.pkl")

In [39]:
pd.set_option('display.max_colwidth', -1)

In [40]:
item_factors.head()

Unnamed: 0,id,features
0,60,"[0.0, 0.09691153466701508, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9247387647628784, 0.22059233486652374, 0.19497373700141907, 0.0, 0.042765721678733826, 0.03034142404794693, 0.029857859015464783, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.031502947211265564, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
1,80,"[0.19948233664035797, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011016706004738808, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7027317881584167, 0.5134339928627014, 0.07132630050182343, 0.0, 0.05733124539256096, 0.0, 0.0, 0.019320594146847725, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.044546954333782196, 0.029083142057061195, 0.0, 0.0, 0.0]"
2,110,"[0.0480114184319973, 0.31068646907806396, 0.02289436198771, 0.09501194953918457, 0.0, 0.0, 0.031020918861031532, 0.010718177072703838, 0.12014016509056091, 0.0, 0.09404737502336502, 0.0, 0.1620231568813324, 0.0, 0.0, 0.17322424054145813, 0.00905819982290268, 0.05470167472958565, 0.029735198244452477, 0.19797469675540924, 0.0, 0.0, 0.9482471346855164, 0.0, 0.06473582237958908, 0.0, 0.0, 0.09959202259778976, 0.014015444554388523, 0.0922638326883316, 0.22466379404067993, 0.0, 0.0, 0.004506054800003767, 0.0, 0.0, 0.008835907094180584, 0.07381134480237961, 0.0, 0.0]"
3,200,"[0.3273434340953827, 0.0, 0.060207005590200424, 0.0, 0.07319775968790054, 0.0, 0.08091767132282257, 0.0, 0.0638691782951355, 0.18519440293312073, 0.1623028814792633, 0.054778750985860825, 0.040024399757385254, 0.15281657874584198, 0.0, 0.006157767493277788, 0.0, 0.0, 0.11726079136133194, 0.0, 0.0, 0.0, 0.05401604622602463, 0.2168269157409668, 0.030285200104117393, 0.0, 0.005972058512270451, 0.0, 0.03901968523859978, 0.0, 0.0065006474032998085, 0.0, 0.11510273814201355, 0.0, 0.10229922831058502, 0.008200216107070446, 0.8732324242591858, 0.0, 0.1593346893787384, 0.1595969945192337]"
4,240,"[0.5348109602928162, 0.0, 0.017858237028121948, 0.015814248472452164, 0.0, 0.0, 0.0, 0.4094351530075073, 0.0, 0.09783218801021576, 0.0, 0.0, 0.0, 0.12842048704624176, 0.2638329863548279, 0.0, 0.26992252469062805, 0.0, 0.0, 0.02826196327805519, 0.0, 0.0, 0.0, 0.0, 0.09313179552555084, 0.0, 0.0, 0.04362291097640991, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07889299094676971, 0.0, 0.0, 0.0, 0.0, 0.43136391043663025, 0.0]"


### Test Retrieving the Factors

In [41]:
unpickled_items = pd.read_pickle('support_data/item_factors_20190921.pkl')

In [42]:
unpickled_items.head()

Unnamed: 0,id,features
0,60,"[0.0, 0.09691153466701508, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9247387647628784, 0.22059233486652374, 0.19497373700141907, 0.0, 0.042765721678733826, 0.03034142404794693, 0.029857859015464783, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.031502947211265564, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
1,80,"[0.19948233664035797, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011016706004738808, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7027317881584167, 0.5134339928627014, 0.07132630050182343, 0.0, 0.05733124539256096, 0.0, 0.0, 0.019320594146847725, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.044546954333782196, 0.029083142057061195, 0.0, 0.0, 0.0]"
2,110,"[0.0480114184319973, 0.31068646907806396, 0.02289436198771, 0.09501194953918457, 0.0, 0.0, 0.031020918861031532, 0.010718177072703838, 0.12014016509056091, 0.0, 0.09404737502336502, 0.0, 0.1620231568813324, 0.0, 0.0, 0.17322424054145813, 0.00905819982290268, 0.05470167472958565, 0.029735198244452477, 0.19797469675540924, 0.0, 0.0, 0.9482471346855164, 0.0, 0.06473582237958908, 0.0, 0.0, 0.09959202259778976, 0.014015444554388523, 0.0922638326883316, 0.22466379404067993, 0.0, 0.0, 0.004506054800003767, 0.0, 0.0, 0.008835907094180584, 0.07381134480237961, 0.0, 0.0]"
3,200,"[0.3273434340953827, 0.0, 0.060207005590200424, 0.0, 0.07319775968790054, 0.0, 0.08091767132282257, 0.0, 0.0638691782951355, 0.18519440293312073, 0.1623028814792633, 0.054778750985860825, 0.040024399757385254, 0.15281657874584198, 0.0, 0.006157767493277788, 0.0, 0.0, 0.11726079136133194, 0.0, 0.0, 0.0, 0.05401604622602463, 0.2168269157409668, 0.030285200104117393, 0.0, 0.005972058512270451, 0.0, 0.03901968523859978, 0.0, 0.0065006474032998085, 0.0, 0.11510273814201355, 0.0, 0.10229922831058502, 0.008200216107070446, 0.8732324242591858, 0.0, 0.1593346893787384, 0.1595969945192337]"
4,240,"[0.5348109602928162, 0.0, 0.017858237028121948, 0.015814248472452164, 0.0, 0.0, 0.0, 0.4094351530075073, 0.0, 0.09783218801021576, 0.0, 0.0, 0.0, 0.12842048704624176, 0.2638329863548279, 0.0, 0.26992252469062805, 0.0, 0.0, 0.02826196327805519, 0.0, 0.0, 0.0, 0.0, 0.09313179552555084, 0.0, 0.0, 0.04362291097640991, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07889299094676971, 0.0, 0.0, 0.0, 0.0, 0.43136391043663025, 0.0]"


---