# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# Reduced Data: Grid Search + Cross-Validation

This time, as explored in the EDA NB, let's consider removing customers who we feel have too few or too many purchases to influence the model in the intended way.

Examples:
- Too few - Customers who have only bought 1 comic (series).
- Too many - Customers with > 1000 series (for example, think all eBay customers are rolled into one account number).

# Libraries

In [1]:
import findspark

In [2]:
findspark.init()

In [3]:
%matplotlib inline
%load_ext autoreload
# %autoreload 1 #would be where you need to specify the files
# %aimport comic_recs

import pandas as pd # dataframes
import os
import pickle

In [4]:
# Data storage
from sqlalchemy import create_engine # SQL helper
#import psycopg2 as psql #PostgreSQL DBs

In [5]:
# import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.sql.types import (StructType, StructField, IntegerType
#                                ,FloatType, LongType, StringType)
from pyspark.sql.types import *

import pyspark.sql.functions as F
from pyspark.sql.functions import col, explode, lit, isnan, when, count
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import (CrossValidator, ParamGridBuilder, 
                               TrainValidationSplit)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql import DataFrame

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt

In [6]:
import sys

In [7]:
sys.path.append('..')

In [8]:
# Custom
import data_fcns as dfc
import keys  # Custom keys lib
import comic_recs as cr

import time
import itertools
from functools import reduce
import numpy as np

In [9]:
from pyspark import SparkConf

conf = SparkConf()

In [10]:
conf = (conf.setMaster('local[*]')
#         .set('spark.executor.memory', '1G') #https://stackoverflow.com/questions/48523629/spark-pyspark-an-error-occurred-while-trying-to-connect-to-the-java-server-127
        .set('spark.driver.memory', '4G')
        .set('spark.driver.maxResultSize', '1G'))
#         .set('spark.executor.memory', '1G')
#         .set('spark.driver.memory', '10G')
#         .set('spark.driver.maxResultSize', '5G'))

sc = pyspark.SparkContext().getOrCreate(conf=conf)

In [11]:
from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

In [12]:
sc.setCheckpointDir('./checkpoints')

# spark.sparkContext.setCheckpointDir("hdfs://datalake/check_point_directory/als")

In [13]:
# # spark config
# spark = pyspark.sql.SparkSession \
#     .builder \
#     .appName("comic recs") \
#     .config("spark.driver.maxResultSize", "8g") \
#     .config("spark.driver.memory", "8g") \
#     .config("spark.executor.memory", "8g") \
#     .config("spark.master", "local[*]") \
#     .getOrCreate()

In [14]:
# # instantiate SparkSession object
# spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()
# # spark = SparkSession.builder.master("local").getOrCreate()

In [15]:
# # spark config
# spark = pyspark.sql.SparkSession \
#     .builder \
#     .appName("movie recommendation") \
#     .config("spark.driver.maxResultSize", "1g") \
#     .config("spark.driver.memory", "1g") \
#     .config("spark.executor.memory", "20g") \
#     .config("spark.master", "local[*]") \
#     .getOrCreate()

## Import Data

We've previously set aside the dataset into a `json` file.

In [16]:
# !ls

In [17]:
# We have previously created a version of the transactions table 
# and filtered it down.
# sold = spark.read.json('raw_data/als_input_filtered.json')
sold = sql_context.read.json('raw_data/als_input_filtered.json')

In [18]:
# Persist the data
sold.persist()

DataFrame[account_id: bigint, bought: bigint, comic_id: bigint]

In [19]:
sold.count()

61871

### ALS Model

Let's start with  train/test split.

In [20]:
random_seed = 1234

In [21]:
# Split data into training and test set
(train, test) = sold.randomSplit([.75, .25], seed=random_seed)

Make sure shapes make sense.

In [22]:
print(train.count(), len(train.columns))

46417 3


In [23]:
print(test.count(), len(test.columns))

15454 3


In [24]:
# Evaluate the model by computing the RMSE on the test data
eval_reg = RegressionEvaluator(metricName="rmse"
                               , labelCol="bought"
                               , predictionCol="prediction")

### Grid Search

In [31]:
# hyper-param config
# num_iterations = [10, 20, 25, 30]
num_iterations = [20,25,30,35]
ranks = [5]
# ranks = [5, 10]
# reg_params = [0.01, 0.1]
reg_params = [0.1]
alphas = [1000]
# alphas = [40, 500, 1000, 2000]

Let's further subset into test and validation sets.

In [32]:
# Split data into training and validation sets
(gs_train, gs_val) = train.randomSplit([(1-(1/3)), (1/3)], seed=random_seed)

In [33]:
print(gs_train.count(), len(gs_train.columns))

30854 3


In [34]:
print(gs_val.count(), len(gs_val.columns))

15563 3


In [35]:
# num_iter = 25
# rank = 5
# reg = 0.1
# alpha = 1000

# als = ALS(maxIter=num_iter,
#               rank=rank,
#               userCol='account_id',
#               itemCol='comic_id',
#               ratingCol='bought',
#               implicitPrefs=True,
#               regParam=reg,
#               alpha=alpha,
#               coldStartStrategy='drop',  # Just for CV
#               seed=41916)

# model = als.fit(gs_train)

# # Generate predictions on Test
# predictions = model.transform(gs_val)

In [36]:
# grid search and select best model
start_time = time.time()
final_model, params_errs = cr.train_ALS(gs_train, gs_val, eval_reg, 
                                        num_iterations, reg_params, 
                                        ranks, alphas)

print ('Total Runtime: {:.2f} seconds'.format(time.time() - start_time))

20 iterations, 5 latent factors, regularization=0.1, and alpha @ 1000 : validation error is 0.3612
25 iterations, 5 latent factors, regularization=0.1, and alpha @ 1000 : validation error is 0.3557
30 iterations, 5 latent factors, regularization=0.1, and alpha @ 1000 : validation error is 0.3509
35 iterations, 5 latent factors, regularization=0.1, and alpha @ 1000 : validation error is 0.3469
Total Runtime: 69.99 seconds


Save the descriptive results

In [None]:
param_errs_rd_1 = params_errs

In [None]:
with open('support_data/params_errs_rd1_24seed.pkl', 'wb') as f:
    pickle.dump(param_errs_rd_1, f)
    
# Example - load pickle
# pickle_in = open("support_data/params_errs_rd1.pkl","rb")
# pe1 = pickle.load(pickle_in)

In [None]:
!ls support_data

#### Use this to reload the Grid Search results

In [None]:
pickle_in = open('support_data/params_errs_rd1_24seed.pkl', 'rb')
params_errs = pickle.load(pickle_in)
                         

Hmmm. Let's put `params_errs` into a dataframe and find the model with the lowest error!

In [None]:
gs_cols = ['max_iters', 'reg', 'rank', 'alpha', 'rmse']

In [None]:
gs_df = pd.DataFrame(params_errs, columns=gs_cols)

In [None]:
gs_df.head()

In [None]:
min_err = gs_df.rmse.min()

In [None]:
min_df = gs_df.loc[gs_df['rmse']==min_err]

In [None]:
min_df

In [None]:
best_max_iter = min_df['max_iters'].iloc[0]
best_reg = min_df['reg'].iloc[0]
best_rank = min_df['rank'].iloc[0]
best_alpha = min_df['alpha'].iloc[0]

Let's do some visual comparisons.

In [None]:
gs_rank_match = (gs_df['rank']==best_rank)
gs_reg_match = (gs_df['reg']==best_reg)
gs_iter_match = (gs_df['max_iters']==best_max_iter)
gs_alpha_match = (gs_df['alpha']==best_alpha)

In [None]:
gs_vary_rank = gs_df.loc[(gs_reg_match & gs_iter_match & gs_alpha_match),:]

In [None]:
gs_vary_rank

In [None]:
gs_vary_alpha = gs_df.loc[(gs_reg_match & gs_iter_match & gs_rank_match),:]

In [None]:
gs_vary_alpha

In [None]:
gs_vary_reg = gs_df.loc[(gs_alpha_match & gs_iter_match & gs_rank_match),:]

In [None]:
gs_vary_reg

In [None]:
gs_vary_iter = gs_df.loc[(gs_alpha_match & gs_reg_match & gs_rank_match),:]

In [None]:
gs_vary_iter

So quick inspection on these, lets:
- keep `rank` = 5
- When compared to all the other combos, the differences in `alpha`s seem to not really move the needle > 500. so let's just call it `1000`
- Keep `maxIter` at `20`; experience to date with my assets seems to show 20 is max capability before technical difficulties arise.
- Similar with `alpha`, the marginal change in error due to changing `reg` is really small. So let's just assume the default `.01`.

So, that means we are done selecting! We may really be pushing overfitting.

One last thing, let chart change in RMSE over change in alpha.

In [None]:
alpha_graph_df = gs_vary_alpha.copy()

In [None]:
alpha_graph_df['params_desc'] = (
                                '\u03B1=' + alpha_graph_df['alpha'].map(str) 
                                )
                                 

In [None]:
alpha_graph_df

In [None]:
sns.set(style="whitegrid")
sns.set(font_scale=2)

fig, ax = plt.subplots(figsize=(10, 10))

# Plot RMSE
sns.set_color_codes("pastel")

values = alpha_graph_df['params_desc'].tolist()

clrs = ['salmon' if (y == '\u03B1=1000') else 'steelblue' for y in values ]

s = sns.barplot(x="rmse", y="params_desc", data=alpha_graph_df,
                label="RMSE",
                palette=clrs)

# Add a legend and informative axis label
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="",
       xlabel="Max Iterations: 20 | Latent Factors: 5")
ax.set_title("Change in Error over Alpha")

sns.despine(left=True, bottom=True)
fig = s.get_figure()
fig.savefig('support_data/alphas.png') 

**OK**. Let's call it good. 

## Results 
Looks like the best parameters we could find are:
- `maxIter` = 20
- `rank` = 5
- `regParam` = 0.1 (default)
- `alpha` = 1000

Let's cross-validate this candidate model.

## Cross Validation

Let's cross-validate because we didn't actually do it in the grid search. We want to make sure that the selected model is not overfitting.

The built-in cross validator in `Spark` keeps breaking when I try to use it, so let's build our own function.

In [None]:
k = 10

In [None]:
folds = cr.get_spark_k_folds(train, k=k, random_seed=random_seed)

In [None]:
# Create ALS instance for cv with our chosen parametrs
als_cv = ALS(maxIter=best_max_iter,
          rank=best_rank,
          userCol='account_id',
          itemCol='comic_id',
          ratingCol='bought',
          implicitPrefs=True,
          regParam=best_reg,
          alpha=best_alpha,
          coldStartStrategy='drop', # we want to drop so can get through CV
          seed=random_seed)

In [None]:
errors = cr.get_cv_errors(folds, als_cv, eval_reg)

In [None]:
# Make sure that # of errors = k
k == len(errors)

In [None]:
print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(errors), np.std(errors) * 2))

Looks stable. Let's go with it.

## Test the Candidate Model

Test vs our holdout set.

In [None]:
best_max_iter = 20
best_reg = 0.1
best_rank = 5
best_alpha = 1000

In [None]:
# Create ALS instance and fit model
als = ALS(maxIter=best_max_iter,
          rank=best_rank,
          userCol='account_id',
          itemCol='comic_id',
          ratingCol='bought',
          implicitPrefs=True,
          regParam=best_reg,
          alpha=best_alpha,
          coldStartStrategy='drop', # To get our eval
          seed=random_seed)
model_use = als.fit(train)

In [None]:
# get predictions on test
test_preds = model_use.transform(test)

# Evaluate test
test_rmse = eval_reg.evaluate(test_preds)
test_rmse

Well, this is unexpected. Test error being noticeably lower than train error usually indicates an unknown fit. Since we trained on 'train' data we would expect test error to be at minimum as worse AND _probably_ a little worse than train. Not less than.

It's not THAT much better, but need to make note of it. For now we need to move on.

In [None]:
# Create ALS instance and fit model
als = ALS(maxIter=best_max_iter,
          rank=best_rank,
          userCol='account_id',
          itemCol='comic_id',
          ratingCol='bought',
          implicitPrefs=True,
          regParam=best_reg,
          alpha=best_alpha,
          coldStartStrategy='nan', # To get our eval
          seed=random_seed)
model_use = als.fit(train)

#### Save the item factors for future use!

In [None]:
item_factors = model_use.itemFactors.toPandas()

In [None]:
item_factors.shape

In [None]:
!ls

In [None]:
item_factors.to_pickle("support_data/item_factors.pkl")

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
item_factors.head()

Test unpickle

In [None]:
unpickled_items = pd.read_pickle('support_data/item_factors.pkl')

### Get Top N recommendations for Single User

Let's make a reference list of `account_id`'s, for testing purposes.

In [None]:
n_to_test = 2

users = (sold.select(als.getUserCol())
                          .sample(False
                                  ,n_to_test/sold.count()
                                  )
        )
users.persist()
users.show(2)

We developed and wrote the functionality out to a function in `comic_recs.py`

###  Testing function!

- Pass the function to a pandas dataframe. 
- Function will ask for an account_id.
- Will return top n, n defined in parameters.

In [None]:
top_n_df = cr.get_top_n_new_recs(spark=spark, model=model_use, topn=5)
top_n_df

In [None]:
top_n_df = cr.get_top_n_new_recs(spark=spark, model=model_use, topn=5)
top_n_df

In [None]:
top_n_df = cr.get_top_n_new_recs(spark=spark, model=model_use, topn=10)
top_n_df

## Conclusions
- Seems realistic? Only three tests, but the results seem 'individualized' in the sense that there is no overlap between the sets (albeit small samples).

## Save the Model!

In [None]:
model_use.save('models/als_use')

## Retrieving Saved Model

In [None]:
comic_rec_model = ALSModel.load('models/als_use')

In [None]:
top_n_df = cr.get_top_n_new_recs(spark=spark, model=comic_rec_model, topn=10)
top_n_df