# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# 3 - ALS Model - All transactions data

# Libraries

In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2  # 1 would be where you need to specify the files
#%aimport data_fcns

import pandas as pd # dataframes
import os
import time
import numpy as np

# Data storage
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

# import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.sql.types import (StructType, StructField, IntegerType
#                                ,FloatType, LongType, StringType)
from pyspark.sql.types import *

import pyspark.sql.functions as F
from pyspark.sql.functions import col, explode, lit, isnan, when, count
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Custom
import lib.data_fcns as dfc
import lib.keys  # Custom keys lib
import lib.comic_recs as cr


In [3]:
# instantiate SparkSession object
spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()
# spark = SparkSession.builder.master("local").getOrCreate()

## Import the data

There is way to directly hit PostgreSQL through JDBC, but I don't know how to do that yet. So have worked around by saving the candidate dataset to JSON, and then will use that as input to Spark.

In [3]:
trans = spark.read.json('raw_data/trans.json')

In [4]:
# Persist the data
trans.persist()

DataFrame[account_num: string, comic_title: string, date_sold: bigint, item_id: string, publisher: string, qty_sold: bigint, title_and_num: string]

In [5]:
print(trans.count(), len(trans.columns))

494703 7


In [6]:
# check schema
trans.printSchema()

root
 |-- account_num: string (nullable = true)
 |-- comic_title: string (nullable = true)
 |-- date_sold: long (nullable = true)
 |-- item_id: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- qty_sold: long (nullable = true)
 |-- title_and_num: string (nullable = true)



### More exploration/testing

We won't be using pandas dataframes in the matrix factorization through Spark, but let's cast to one anyway as it will be easier to work with for EDA.

In [7]:
# cast to Pandas dataframe to turn timestamp data to datetime and check nulls. 
trans_df = trans.select('*').toPandas()
trans_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494703 entries, 0 to 494702
Data columns (total 7 columns):
account_num      494703 non-null object
comic_title      494703 non-null object
date_sold        494703 non-null int64
item_id          494703 non-null object
publisher        494703 non-null object
qty_sold         494703 non-null int64
title_and_num    494703 non-null object
dtypes: int64(2), object(5)
memory usage: 26.4+ MB


In [8]:
# Let's double check the data is how we expect it
trans_df.head()

Unnamed: 0,account_num,comic_title,date_sold,item_id,publisher,qty_sold,title_and_num
0,174,Filler Bunny (SLG),1313344863000,DCD151935,Amaze Ink Slave Labor Graphics,1,Filler Bunny #2
1,593,Gargoyles (SLG),1340374297000,DCD341726,Amaze Ink Slave Labor Graphics,1,Gargoyles #6
2,226,Royal Historian of Oz (SLG),1279720987000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1
3,399,Royal Historian of Oz (SLG),1279136980000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1
4,237,Royal Historian of Oz (SLG),1279535944000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1


In [9]:
trans_df['dt'] = pd.to_datetime(trans_df['date_sold'], unit='ms')

Yes. Reverse-confirmed versus the original transactions dataframe in the other notebook that this datetime is correct. 

### Data Prep for ALS

Let's aggregate the data to the two columns we need:
- `account_num` - This is the identifier for individual customers.


- `comic_title` - The comic. Represents individual volumes/runs of a comic.


- `score` - We need to figure out what we want to use to act as a `score`. If these were Amazon items then review scores would be natural fit; but we don't have that. We can maybe use a binary flag of `bought`/`not bought`. Or we can use the `qty_sold`. This might be interesting in that it might capture some interesting behavior from comic 'collectors/speculators'. Since this is first pass, I'm curious as to what `qty_sold` might do!


We only care about `account_num`, `comic_title` and `qty_sold`.

In [10]:
comics_sold = trans[['account_num', 'comic_title', 'qty_sold']]
comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [11]:
comics_sold = comics_sold.withColumn('bought', lit(1))

In [12]:
comics_sold.show(10)

+-----------+--------------------+--------+------+
|account_num|         comic_title|qty_sold|bought|
+-----------+--------------------+--------+------+
|      00174|  Filler Bunny (SLG)|       1|     1|
|      00593|     Gargoyles (SLG)|       1|     1|
|      00226|Royal Historian o...|       1|     1|
|      00399|Royal Historian o...|       1|     1|
|      00237|Royal Historian o...|       1|     1|
|      00327|Royal Historian o...|       1|     1|
|      00226|Royal Historian o...|       1|     1|
|      00327|Royal Historian o...|       1|     1|
|      00226|Royal Historian o...|       1|     1|
|      00226|Royal Historian o...|       1|     1|
+-----------+--------------------+--------+------+
only showing top 10 rows



In [13]:
comics_sold = trans[['account_num', 'comic_title', 'qty_sold']]
comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [14]:
total_comics_sold = comics_sold.groupBy(['account_num', 'comic_title']).agg({'qty_sold':'sum'})
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, sum(qty_sold): bigint]

Ok, let's take a look at the results.

In [15]:
total_comics_sold.show(10)

+-----------+--------------------+-------------+
|account_num|         comic_title|sum(qty_sold)|
+-----------+--------------------+-------------+
|      01858|Afterlife With Ar...|            5|
|      02247|Bubblegun VOL 2 (...|            1|
|      00191|    Caliban (Avatar)|            7|
|      00487|Captain Swing (Av...|            2|
|      00029|God Is Dead (Avatar)|            7|
|      01260| Providence (Avatar)|            1|
|      00172|   Supergod (Avatar)|            3|
|      01132|Futurama Annual (...|            3|
|      02493|       Abbott (Boom)|            3|
|      00298|Adventure Time (B...|            2|
+-----------+--------------------+-------------+
only showing top 10 rows



In [16]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

92098 3


In [17]:
total_comics_sold = total_comics_sold.withColumn('bought', lit(1))

I don't like that default column name. Let's fix that to be `qty_sold` again.

In [18]:
total_comics_sold.show(10)

+-----------+--------------------+-------------+------+
|account_num|         comic_title|sum(qty_sold)|bought|
+-----------+--------------------+-------------+------+
|      01858|Afterlife With Ar...|            5|     1|
|      02247|Bubblegun VOL 2 (...|            1|     1|
|      00191|    Caliban (Avatar)|            7|     1|
|      00487|Captain Swing (Av...|            2|     1|
|      00029|God Is Dead (Avatar)|            7|     1|
|      01260| Providence (Avatar)|            1|     1|
|      00172|   Supergod (Avatar)|            3|     1|
|      01132|Futurama Annual (...|            3|     1|
|      02493|       Abbott (Boom)|            3|     1|
|      00298|Adventure Time (B...|            2|     1|
+-----------+--------------------+-------------+------+
only showing top 10 rows



In [19]:
total_comics_sold = total_comics_sold[['account_num', 'comic_title', 'bought']]

In [20]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

92098 3


### Formatting

Sooooooo, I forgot that the values need to be numeric. So need to fix that.

#### Convert `account_id` to integer

In [21]:
to_int_udf = F.udf(dfc.make_int, IntegerType())

In [22]:
account_num_col = total_comics_sold['account_num']

In [23]:
total_comics_sold = total_comics_sold.withColumn('account_id'
                                        ,to_int_udf(account_num_col))
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, bought: int, account_id: int]

In [24]:
total_comics_sold.show(10)

+-----------+--------------------+------+----------+
|account_num|         comic_title|bought|account_id|
+-----------+--------------------+------+----------+
|      01858|Afterlife With Ar...|     1|      1858|
|      02247|Bubblegun VOL 2 (...|     1|      2247|
|      00191|    Caliban (Avatar)|     1|       191|
|      00487|Captain Swing (Av...|     1|       487|
|      00029|God Is Dead (Avatar)|     1|        29|
|      01260| Providence (Avatar)|     1|      1260|
|      00172|   Supergod (Avatar)|     1|       172|
|      01132|Futurama Annual (...|     1|      1132|
|      02493|       Abbott (Boom)|     1|      2493|
|      00298|Adventure Time (B...|     1|       298|
+-----------+--------------------+------+----------+
only showing top 10 rows



In [25]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

92098 4


Now I need to find a way to give ids to the `comic_title`. Kind of clunky, but I do have the version in PostgreSQL of the big table. I can just build an ID table up there as source of truth. I could do something on PySpark side, but then think would want to save it somewhere (e.g. the DB) anyway. So might as well do it from the top.

#### Get `comic_id`

In [26]:
comics = spark.read.json('raw_data/comics.json')
comics.persist()

DataFrame[comic_id: bigint, comic_title: string]

In [27]:
comics.count()

7202

In [28]:
comics.show(10)

+--------+--------------------+
|comic_id|         comic_title|
+--------+--------------------+
|       1|0Secret Wars (Mar...|
|       2|100 Bullets Broth...|
|       3|100 Penny Press L...|
|       4|100 Penny Press S...|
|       5|100 Penny Press T...|
|       6|100 Penny Press T...|
|       7|100th Anniversary...|
|       8|12 Reasons To Die...|
|       9|    13 Coins (Other)|
|      10|13th Artifact One...|
+--------+--------------------+
only showing top 10 rows



In [29]:
print(comics.count(), len(comics.columns))

7202 2


Now we need to join this back into `total_comics_sold`.

In [30]:
# Set aliases
tot = total_comics_sold.alias('tot')
com = comics.alias('com')

In [31]:
tot_sold_ids_only = tot.join(com.select('comic_id','comic_title')
                      ,tot.comic_title==com.comic_title).select('account_id'
                                                                , 'comic_id'
                                                                , 'bought')
tot_sold_ids_only.persist()
tot_sold_ids_only.show(10)

+----------+--------+------+
|account_id|comic_id|bought|
+----------+--------+------+
|      1858|     128|     1|
|      2247|     995|     1|
|       191|    1039|     1|
|       487|    1102|     1|
|        29|    2680|     1|
|      1260|    4870|     1|
|       172|    6023|     1|
|      1132|    2413|     1|
|      2493|      66|     1|
|       298|     110|     1|
+----------+--------+------+
only showing top 10 rows



In [32]:
tot_sold_ids_only.printSchema()

root
 |-- account_id: integer (nullable = true)
 |-- comic_id: long (nullable = true)
 |-- bought: integer (nullable = false)



In [33]:
print(tot_sold_ids_only.count(), len(tot_sold_ids_only.columns))

92098 3


## Save this intermediate table.

To save work, if needed.

In [34]:
als_input_df = tot_sold_ids_only.toPandas()

In [35]:
als_input_df.shape

(92098, 3)

In [36]:
als_input_df.to_json('raw_data/als_input_all.json', orient='records'
                     ,lines=True)

In [None]:
!head raw_data/als_input_all.json

### ALS Model

Let's start with  train/test split.

In [38]:
# Split data into training and test set
(train, test) = tot_sold_ids_only.randomSplit([.8, .2], seed=41916)

Make sure shapes make sense.

In [39]:
print(train.count(), len(train.columns))

73697 3


In [40]:
print(test.count(), len(test.columns))

18401 3


In [41]:
# Create ALS instance and fit model
als = ALS(maxIter=20,
          rank=10,
          userCol='account_id',
          itemCol='comic_id',
          ratingCol='bought',
          implicitPrefs=True,
          alpha=40,
          seed=41916)
model = als.fit(train)

In [42]:
now = time.ctime(int(time.time()))
print("Completed on {}.".format(now))

Completed on Fri Jun 28 04:46:16 2019.


### Evaluation on Test

In [43]:
# Generate predictions on TEST
predictions = model.transform(test)
predictions.persist()

DataFrame[account_id: int, comic_id: bigint, bought: int, prediction: float]

In [44]:
predictions.show(10)

+----------+--------+------+-----------+
|account_id|comic_id|bought| prediction|
+----------+--------+------+-----------+
|       217|     463|     1|0.121892005|
|      1489|     496|     1|-0.36663285|
|       690|     496|     1| 0.06409974|
|      1695|     833|     1|  0.7557971|
|      2573|     833|     1|  1.1742297|
|        62|    1088|     1|-0.01460916|
|      1781|    1238|     1|0.008104252|
|       226|    1342|     1|   0.533931|
|       191|    1591|     1| 0.97786826|
|        24|    1645|     1| 0.13036774|
+----------+--------+------+-----------+
only showing top 10 rows



`BinaryClassificationEvaluator` only likes doubles for `rawPredictionCol`, so cast it.

In [45]:
predictions = predictions.withColumn("prediction", predictions["prediction"].cast(DoubleType()))

In [46]:
predictions.show(10)

+----------+--------+------+--------------------+
|account_id|comic_id|bought|          prediction|
+----------+--------+------+--------------------+
|       217|     463|     1| 0.12189200520515442|
|      1489|     496|     1| -0.3666328489780426|
|       690|     496|     1|  0.0640997365117073|
|      1695|     833|     1|  0.7557970881462097|
|      2573|     833|     1|  1.1742297410964966|
|        62|    1088|     1|-0.01460915990173...|
|      1781|    1238|     1|0.008104251697659492|
|       226|    1342|     1|  0.5339310169219971|
|       191|    1591|     1|  0.9778682589530945|
|        24|    1645|     1| 0.13036774098873138|
+----------+--------+------+--------------------+
only showing top 10 rows



### Initial Evaluation

Based on our first swing of the bat:
- `maxIter` = 20
- `rank` = 10

In [47]:
# Evaluate the model by computing the RMSE on the test data
eval_reg = RegressionEvaluator(metricName="rmse", labelCol="bought",
                                predictionCol="prediction")

In [48]:
rmse = eval_reg.evaluate(predictions)

In [49]:
print("RMSE= " + str(rmse))

RMSE= nan


Oh boy, better remove some nans (for now).

evaluator = BinaryClassificationEvaluator(metricName='areaUnderROC'
                                          ,labelCol='bought'
                                          ,rawPredictionCol='prediction'
                                          )

auc = evaluator.evaluate(predictions)

print("Area Under the Curve = " + str(auc))

## Check nans

In [50]:
# Convert to pandas dataframe
pred_df = predictions.select('*').toPandas()

# Check nulls
pred_num_nulls = pred_df['prediction'].isna().sum()

# Num rows
preds_attempted = pred_df.shape[0]

In [51]:
print("There are {} nulls out of {}.".format(pred_num_nulls, preds_attempted))

There are 409 nulls out of 18401.


In [52]:
print("So {:.2f}% are nulls.".format(pred_num_nulls/preds_attempted))

So 0.02% are nulls.


Since, they're such a tiny portion of the test set, for now let's remove the nulls for now.

In [53]:
predictions.count()

18401

In [54]:
# Convert back to spark dataframe
predictions = spark.createDataFrame(pred_df)

In [55]:
predictions.select([count(when(isnan(c), c)).alias(c) for c in predictions.columns]).show()

+----------+--------+------+----------+
|account_id|comic_id|bought|prediction|
+----------+--------+------+----------+
|         0|       0|     0|       409|
+----------+--------+------+----------+



In [56]:
pred_no_na = predictions.na.drop()

In [57]:
pred_no_na.persist()

DataFrame[account_id: bigint, comic_id: bigint, bought: bigint, prediction: double]

In [58]:
pred_no_na.count()

17992

In [59]:
# Evaluate the model by computing the rmse on the test data
rmse2 = eval_reg.evaluate(pred_no_na)

print("RMSE = " + str(rmse2))

RMSE = 0.4390995770366977


Ok, so that gives us a baseline. Let's see if we can do a little grid search action.

### Get Top N recommendations for Single User

Let's make a reference list of `account_id`'s, for testing purposes.

In [62]:
n_to_test = 5

users = (tot_sold_ids_only.select(als.getUserCol())
                          .sample(False
                                  ,n_to_test/tot_sold_ids_only.count()
                                  ,41916)
        )
users.persist()
users.show()

+----------+
|account_id|
+----------+
|       593|
|       315|
|      1165|
|      1621|
|       121|
|       770|
+----------+



We developed and wrote the functionality out to a function in `comic_recs.py`

###  Testing function!

- Pass the function to a pandas dataframe. 
- Function will ask for an account_id.
- Will return top n, n defined in parameters.

In [64]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=model, topn=5)
top_n_df

593


Unnamed: 0,comic_title
1,Star Wars (Marvel)
2,X-Men Grand Design Second Gen (Marvel)
3,Thor (Marvel)
4,100th Anniversary Special (Marvel)
5,Fantastic Four (Marvel)


In [65]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=model, topn=5)
top_n_df

315


Unnamed: 0,comic_title
1,Saga (Image)
2,Walking Dead (Image)
3,Chew (Image)
4,East of West (Image)
5,Batman (DC)


In [66]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=model, topn=5)
top_n_df

161


Unnamed: 0,comic_title
1,Isola (Image)
2,Adventure Time 2013 Spoooktac (Boom)
3,Umbrella Academy Hotel Oblivi (Dark Horse)
4,Sandman Universe (Vertigo)
5,Criminal (Image)


## Conclusions after single run
- Seems realistic? Only three tests, but the results seem 'individualized' in the sense that there is no overlap between the sets (albeit small samples).

---

## Grid Search

In [79]:
# Create a parameter grid
params = (ParamGridBuilder()
          .addGrid(als.regParam, [1, 0.1])
          .addGrid(als.maxIter, [20, 30])
          .addGrid(als.rank, [10, 20])
          .addGrid(als.alpha, [30, 40])).build()

cv = TrainValidationSplit(estimator=als
                          , evaluator=eval_reg
                          , estimatorParamMaps=params
                          , trainRatio=0.8)
#cv = CrossValidator(estimator=als_implicit, estimatorParamMaps=paramGrid, evaluator=RegressionEvaluator())
model_implicit = cv.fit(train)

In [80]:
### Cross validate for best hyperparameters
cv = CrossValidator(estimator=als
                    ,estimatorParamMaps=params
                    ,evaluator=eval_reg
                    ,parallelism=4
                   )

In [81]:
now = time.ctime(int(time.time()))
print("Completed on {}.".format(now))

Completed on Fri Jun 28 05:43:55 2019.


In [None]:
# Fit and store model
best_model = cv.fit(test)

In [4]:
als_model = best_model.bestModel

NameError: name 'best_model' is not defined

In [None]:
now = time.ctime(int(time.time()))
print("Completed on {}.".format(now))