# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# B. ALS with PySpark

---

# Libraries

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2  # 1 would be where you need to specify the files
#%aimport data_fcns

import pandas as pd # dataframes
import os
# import gspread_pandas
# from gspread_pandas import Spread, Client # gsheets interaction

# Data storage
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

# import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.sql.types import (StructType, StructField, IntegerType
#                                ,FloatType, LongType, StringType)
from pyspark.sql.types import *

import pyspark.sql.functions as F
from pyspark.sql.functions import col, explode, lit, isnan, when, count
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Custom
import data_fcns as dfc
import keys  # Custom keys lib
import comic_recs as cr

import time

import numpy as np

In [2]:
# instantiate SparkSession object
spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()
# spark = SparkSession.builder.master("local").getOrCreate()

## Import the data

In a previous NB we have already scrubbed the data to get into proper format for ALS. We saved to JSON to make it easy to re-ingest for ALS!

In [7]:
#comics_sold = spark.read.json('raw_data/als_input_filtered.json')
comics_sold = spark.read.json('raw_data/als_input_filtered.json')

In [8]:
comics_sold.persist()

DataFrame[account_id: bigint, bought: bigint, comic_id: bigint]

In [9]:
comics_sold.show(2)

+----------+------+--------+
|account_id|bought|comic_id|
+----------+------+--------+
|      2247|     1|     995|
|       487|     1|    1102|
+----------+------+--------+
only showing top 2 rows



### ALS Model

Let's start with  train/test split.

In [10]:
# Split data into training and test set
(train, test) = comics_sold.randomSplit([.8, .2], seed=41916)

Make sure shapes make sense.

In [11]:
print(train.count(), len(train.columns))

44197 3


In [12]:
print(test.count(), len(test.columns))

11145 3


#### 3rd Run @ ALS with filtered dataset

- Minimum number of titles bought by account = 5
- Maximum number of titles bought by account = 250

In [13]:
now = time.ctime(int(time.time()))
print("Started on {}.".format(now))

Started on Thu Jun 27 17:22:02 2019.


In [14]:
# Create ALS instance and fit model
als = ALS(maxIter=20,
          rank=10,
          userCol='account_id',
          itemCol='comic_id',
          ratingCol='bought',
          alpha=.01,
          implicitPrefs=True,
          seed=41916)
model = als.fit(train)

In [15]:
now = time.ctime(int(time.time()))
print("Completed on {}.".format(now))

Completed on Thu Jun 27 17:22:14 2019.


### Evaluation on Test

In [12]:
# Generate predictions on TEST
predictions = model.transform(test)
predictions.persist()

DataFrame[account_id: bigint, bought: bigint, comic_id: bigint, prediction: float]

In [13]:
predictions.show(10)

+----------+------+--------+-------------+
|account_id|bought|comic_id|   prediction|
+----------+------+--------+-------------+
|       336|     1|     471|   0.08173462|
|      1110|     1|     471|   0.07119954|
|      1842|     1|     471|  0.046360236|
|       691|     1|    1088|  0.012279962|
|       125|     1|    1342|  0.009202151|
|       296|     1|    1959|  0.009963108|
|       731|     1|    1959|  0.094873644|
|      2223|     1|    1959|  0.056293443|
|        49|     1|    1959|   0.00999463|
|       708|     1|    1959|-0.0026608421|
+----------+------+--------+-------------+
only showing top 10 rows



`BinaryClassificationEvaluator` only likes doubles for `rawPredictionCol`, so cast it.

In [14]:
predictions = predictions.withColumn("prediction", predictions["prediction"].cast(DoubleType()))

In [15]:
predictions.show(10)

+----------+------+--------+--------------------+
|account_id|bought|comic_id|          prediction|
+----------+------+--------+--------------------+
|       336|     1|     471| 0.08173462003469467|
|      1110|     1|     471| 0.07119953632354736|
|      1842|     1|     471|0.046360235661268234|
|       691|     1|    1088|0.012279962189495564|
|       125|     1|    1342|0.009202150627970695|
|       296|     1|    1959|0.009963108226656914|
|       731|     1|    1959| 0.09487364441156387|
|      2223|     1|    1959| 0.05629344284534454|
|        49|     1|    1959| 0.00999462977051735|
|       708|     1|    1959|-0.00266084214672...|
+----------+------+--------+--------------------+
only showing top 10 rows



### Initial Evaluation

Based on our first swing of the bat:
- `maxIter` = 20
- `rank` = 10

In [16]:
evaluator = BinaryClassificationEvaluator(metricName='areaUnderROC'
                                          ,labelCol='bought'
                                          ,rawPredictionCol='prediction'
                                          )

In [17]:
auc = evaluator.evaluate(predictions)

In [18]:
print("Area Under the Curve = " + str(auc))

Area Under the Curve = 1.0


#### **REALLY?** 
Smells weird. Especially when we look at the actual `prediction` values. How can it be `1` when even just the sample I took is showing basically `0`'s for all predictions???

## Check nans

In [19]:
# Convert to pandas dataframe
pred_df = predictions.select('*').toPandas()

# Check nulls
pred_num_nulls = pred_df['prediction'].isna().sum()

# Num rows
preds_attempted = pred_df.shape[0]

In [20]:
print("There are {} nulls out of {}.".format(pred_num_nulls, preds_attempted))

There are 374 nulls out of 11145.


In [21]:
print("So {:.2f}% are nulls.".format(pred_num_nulls/preds_attempted))

So 0.03% are nulls.


Since, they're such a tiny portion of the test set, for now let's remove the nulls for now.

In [22]:
predictions.count()

11145

In [23]:
# Convert back to spark dataframe
predictions = spark.createDataFrame(pred_df)

In [24]:
predictions.select([count(when(isnan(c), c)).alias(c) for c in predictions.columns]).show()

+----------+------+--------+----------+
|account_id|bought|comic_id|prediction|
+----------+------+--------+----------+
|         0|     0|       0|       374|
+----------+------+--------+----------+



In [25]:
pred_no_na = predictions.na.drop()

In [26]:
pred_no_na.persist()

DataFrame[account_id: bigint, bought: bigint, comic_id: bigint, prediction: double]

In [27]:
pred_no_na.count()

10771

In [28]:
# Evaluate the model by computing the AUC on the test data
auc2 = evaluator.evaluate(pred_no_na)

print("AUC = " + str(auc2))

AUC = 1.0


Still the same: AUC 1.0, but that makes sense, relative to what we did (just remove some rows). Also, how did we get an AUC before if there were NAN's???

Ok, so that gives us a baseline. Let's switch gears and do a little grid search action.

Before we do that, some testing shows that I can't get the BinaryEvaluator to work because I need to manually cast the `predictionCol` to double, but can't do that while encapsulated inside of the CrossValidator class. So will do RMSE _for now_.

In [29]:
# Evaluate the model by computing the RMSE on the test data
eval_reg = RegressionEvaluator(metricName="rmse", labelCol="bought",
                                predictionCol="prediction")

### Grid Search

In [30]:
# Create a parameter grid
params = (ParamGridBuilder()
          .addGrid(als.regParam, [1, 0.01, 0.001, 0.1])
          .addGrid(als.maxIter, [5, 10, 20])
          .addGrid(als.rank, [4, 10, 50])
          .addGrid(als.alpha, [.01, 0.1, 1, 10, 40])).build()

cv = TrainValidationSplit(estimator=als
                          , evaluator=eval_reg
                          , estimatorParamMaps=params
                          , trainRatio=0.8)
#cv = CrossValidator(estimator=als_implicit, estimatorParamMaps=paramGrid, evaluator=RegressionEvaluator())
model_implicit = cv.fit(train)

In [42]:
### Cross validate for best hyperparameters
cv = CrossValidator(estimator=als
                    ,estimatorParamMaps=params
                    ,evaluator=eval_reg
                    ,parallelism=4
                   )

model_implicit = cv.fit(train)

KeyboardInterrupt: 

In [None]:
now = time.ctime(int(time.time()))
print("Completed on {}.".format(now))

##### Fit and store model
best_model = cv.fit(test)

In [None]:
als_model = model_implicit.bestModel

In [None]:
now = time.ctime(int(time.time()))
print("Completed on {}.".format(now))

# CV: Using RMSE. Bleh.

### Get Top N recommendations for Single User

Let's make a reference list of `account_id`'s, for testing purposes.

In [35]:
comics_sold.count()

55342

In [40]:
n_to_test = 5

users = (comics_sold.select(als.getUserCol())
                          .sample(False
                                  ,n_to_test/comics_sold.count()
                                  )
        )
users.persist()
users.show()

+----------+
|account_id|
+----------+
|      2976|
|      1847|
|       810|
|      1331|
|       132|
|       715|
|      2504|
+----------+



We developed and wrote the functionality out to a function in `comic_recs.py`

###  Testing function!

- Pass the function to a pandas dataframe. 
- Function will ask for an account_id.
- Will return top n, n defined in parameters.

In [41]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=model, topn=5)
top_n_df

2976


AnalysisException: 'Path does not exist: file:/home/ubuntu/projects/comics_rx/raw_data/comics.json;'

In [None]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=model, topn=5)
top_n_df

In [None]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=model, topn=5)
top_n_df

## Conclusions
- OK this seems more realistic(?) Only three tests, but it seems more realistic that two users would completely different recommendations, than other results I've seen to date, where there would be overlap of 2 or 3.

---