# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# B. ALS with PySpark

---

# Libraries

In [36]:
%matplotlib inline
%load_ext autoreload
%autoreload 2  # 1 would be where you need to specify the files
#%aimport data_fcns

import pandas as pd # dataframes
import os
import gspread_pandas
from gspread_pandas import Spread, Client # gsheets interaction

# Data storage
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

# import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import (StructType, StructField, IntegerType
                               ,FloatType, LongType )
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator

# Custom
import data_fcns as dfc
import keys  # Custom keys lib

# Data storage
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

import time

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# instantiate SparkSession object
spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()
# spark = SparkSession.builder.master("local").getOrCreate()

## Import the data

There is way to directly hit PostgreSQL through JDBC, but I don't know how to do that yet. So have worked around by saving the candidate dataset to JSON, and then will use that as input to Spark.

In [3]:
trans = spark.read.json('raw_data/trans.json')

In [4]:
# Persist the data
trans.persist()

DataFrame[account_num: string, comic_title: string, date_sold: bigint, item_id: string, publisher: string, qty_sold: bigint, title_and_num: string]

In [5]:
print(trans.count(), len(trans.columns))

494703 7


In [6]:
# check schema
trans.printSchema()

root
 |-- account_num: string (nullable = true)
 |-- comic_title: string (nullable = true)
 |-- date_sold: long (nullable = true)
 |-- item_id: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- qty_sold: long (nullable = true)
 |-- title_and_num: string (nullable = true)



### More exploration/testing

We won't be using pandas dataframes in the matrix factorization through Spark, but let's cast to one anyway as it will be easier to work with for EDA.

In [None]:
# cast to Pandas dataframe to turn timestamp data to datetime and check nulls. 
trans_df = trans.select('*').toPandas()
trans_df.info()

In [None]:
# Let's double check the data is how we expect it
trans_df.head()

In [None]:
trans_df['dt'] = pd.to_datetime(trans_df['date_sold'], unit='ms')

Yes. Reverse-confirmed versus the original transactions dataframe in the other notebook that this datetime is correct. 

### Data Prep for ALS

Let's aggregate the data to the two columns we need:
- `account_num` - This is the identifier for individual customers.


- `comic_title` - The comic. Represents individual volumes/runs of a comic.


- `score` - We need to figure out what we want to use to act as a `score`. If these were Amazon items then review scores would be natural fit; but we don't have that. We can maybe use a binary flag of `bought`/`not bought`. Or we can use the `qty_sold`. This might be interesting in that it might capture some interesting behavior from comic 'collectors/speculators'. Since this is first pass, I'm curious as to what `qty_sold` might do!


We only care about `account_num`, `comic_title` and `qty_sold`.

In [7]:
comics_sold = trans[['account_num', 'comic_title', 'qty_sold']]
comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [8]:
total_comics_sold = comics_sold.groupBy(['account_num', 'comic_title']).agg({'qty_sold':'sum'})
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, sum(qty_sold): bigint]

Ok, let's take a look at the results.

In [9]:
total_comics_sold.show(10)

+-----------+--------------------+-------------+
|account_num|         comic_title|sum(qty_sold)|
+-----------+--------------------+-------------+
|      01858|Afterlife With Ar...|            5|
|      02247|Bubblegun VOL 2 (...|            1|
|      00191|    Caliban (Avatar)|            7|
|      00487|Captain Swing (Av...|            2|
|      00029|God Is Dead (Avatar)|            7|
|      01260| Providence (Avatar)|            1|
|      00172|   Supergod (Avatar)|            3|
|      01132|Futurama Annual (...|            3|
|      02493|       Abbott (Boom)|            3|
|      00298|Adventure Time (B...|            2|
+-----------+--------------------+-------------+
only showing top 10 rows



I don't like that default column name. Let's fix that to be `qty_sold` again.

In [10]:
total_comics_sold = total_comics_sold.select(
    *[col(s).alias('qty_sold') if s == 'sum(qty_sold)' 
      else s 
      for s in total_comics_sold.columns])
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [11]:
total_comics_sold.show(10)

+-----------+--------------------+--------+
|account_num|         comic_title|qty_sold|
+-----------+--------------------+--------+
|      01858|Afterlife With Ar...|       5|
|      02247|Bubblegun VOL 2 (...|       1|
|      00191|    Caliban (Avatar)|       7|
|      00487|Captain Swing (Av...|       2|
|      00029|God Is Dead (Avatar)|       7|
|      01260| Providence (Avatar)|       1|
|      00172|   Supergod (Avatar)|       3|
|      01132|Futurama Annual (...|       3|
|      02493|       Abbott (Boom)|       3|
|      00298|Adventure Time (B...|       2|
+-----------+--------------------+--------+
only showing top 10 rows



In [12]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

92098 3


Nice!

### Formatting

Sooooooo, I forgot that the values need to be numeric. So need to fix that.

#### Convert `account_id` to integer

In [13]:
to_int_udf = F.udf(dfc.make_int, IntegerType())

In [14]:
account_num_col = total_comics_sold['account_num']

In [15]:
total_comics_sold = total_comics_sold.withColumn('account_id'
                                        ,to_int_udf(account_num_col))
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint, account_id: int]

In [16]:
total_comics_sold.show(10)

+-----------+--------------------+--------+----------+
|account_num|         comic_title|qty_sold|account_id|
+-----------+--------------------+--------+----------+
|      01858|Afterlife With Ar...|       5|      1858|
|      02247|Bubblegun VOL 2 (...|       1|      2247|
|      00191|    Caliban (Avatar)|       7|       191|
|      00487|Captain Swing (Av...|       2|       487|
|      00029|God Is Dead (Avatar)|       7|        29|
|      01260| Providence (Avatar)|       1|      1260|
|      00172|   Supergod (Avatar)|       3|       172|
|      01132|Futurama Annual (...|       3|      1132|
|      02493|       Abbott (Boom)|       3|      2493|
|      00298|Adventure Time (B...|       2|       298|
+-----------+--------------------+--------+----------+
only showing top 10 rows



In [17]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

92098 4


Now I need to find a way to give ids to the `comic_title`. Kind of clunky, but I do have the version in PostgreSQL of the big table. I can just build an ID table up there as source of truth. I could do something on PySpark side, but then think would want to save it somewhere (e.g. the DB) anyway. So might as well do it from the top.

#### Get `comic_id`

In [18]:
comics = spark.read.json('raw_data/comics.json')
comics.persist()

DataFrame[comic_id: bigint, comic_title: string]

In [19]:
comics.count()

7202

In [20]:
comics.show(10)

+--------+--------------------+
|comic_id|         comic_title|
+--------+--------------------+
|       1|0Secret Wars (Mar...|
|       2|100 Bullets Broth...|
|       3|100 Penny Press L...|
|       4|100 Penny Press S...|
|       5|100 Penny Press T...|
|       6|100 Penny Press T...|
|       7|100th Anniversary...|
|       8|12 Reasons To Die...|
|       9|    13 Coins (Other)|
|      10|13th Artifact One...|
+--------+--------------------+
only showing top 10 rows



In [21]:
print(comics.count(), len(comics.columns))

7202 2


Now we need to join this back into `total_comics_sold`.

In [22]:
# Set aliases
tot = total_comics_sold.alias('tot')
com = comics.alias('com')

In [23]:
tot_sold_ids_only = tot.join(com.select('comic_id','comic_title')
                      ,tot.comic_title==com.comic_title).select('account_id'
                                                                , 'comic_id'
                                                                , 'qty_sold')
tot_sold_ids_only.persist()
tot_sold_ids_only.show(10)

+----------+--------+--------+
|account_id|comic_id|qty_sold|
+----------+--------+--------+
|      1858|     128|       5|
|      2247|     995|       1|
|       191|    1039|       7|
|       487|    1102|       2|
|        29|    2680|       7|
|      1260|    4870|       1|
|       172|    6023|       3|
|      1132|    2413|       3|
|      2493|      66|       3|
|       298|     110|       2|
+----------+--------+--------+
only showing top 10 rows



In [24]:
tot_sold_ids_only.printSchema()

root
 |-- account_id: integer (nullable = true)
 |-- comic_id: long (nullable = true)
 |-- qty_sold: long (nullable = true)



In [25]:
print(tot_sold_ids_only.count(), len(tot_sold_ids_only.columns))

92098 3


### ALS Model

Let's start with  train/test split.

In [26]:
# Split data into training and test set
(train, test) = tot_sold_ids_only.randomSplit([.8, .2])

Make sure shapes make sense.

In [27]:
print(train.count(), len(train.columns))

73806 3


In [28]:
print(test.count(), len(test.columns))

18292 3


#### Inaugural ALS AKA FSM

Let's do this.

In [29]:
now = time.ctime(int(time.time()))
print("Started on {}.".format(now))

Started on Tue Jun 25 14:26:36 2019.


In [30]:
# Create ALS instance and fit model
als = ALS(maxIter=20,
          rank=10,
          userCol='account_id',
          itemCol='comic_id',
          ratingCol='qty_sold',
          seed=41916)

model = als.fit(train)

In [31]:
now = time.ctime(int(time.time()))
print("Completed on {}.".format(now))

Completed on Tue Jun 25 14:27:04 2019.


In [32]:
# Generate predictions
predictions = model.transform(test)
predictions.persist()

DataFrame[account_id: int, comic_id: bigint, qty_sold: bigint, prediction: float]

In [33]:
type(predictions)

pyspark.sql.dataframe.DataFrame

In [34]:
predictions.show(10)

+----------+--------+--------+----------+
|account_id|comic_id|qty_sold|prediction|
+----------+--------+--------+----------+
|      1621|     471|       1| 0.9409931|
|       224|     471|       1|  3.225736|
|      1110|     471|       1| 1.0456091|
|       566|     471|       1| 3.3903987|
|      2920|     833|       1|0.67359966|
|      1089|     833|       2|  4.801197|
|        37|     833|       2| 3.1951754|
|      2123|     833|       5| 1.1017541|
|      2573|     833|       3| 1.9102834|
|      2373|     833|       1|0.06598443|
+----------+--------+--------+----------+
only showing top 10 rows



Wow. Upon inspection? Horrid. And I'm not sure now if **qty_sold** is good for anything.

### Initial Evaluation

Based on our first swing of the bat:
- `maxIter` = 20
- `rank` = 10

In [35]:
# Evaluate the model by computing the RMSE on the test data
evaluator = RegressionEvaluator(metricName="rmse", labelCol="qty_sold",
                                predictionCol="prediction")

rmse = evaluator.evaluate(predictions)

print("Root-mean-square error = " + str(rmse))

Root-mean-square error = nan


Bummer. Guessing it's because some users get NaNs. Let's look!

---

# <font color="red">Cleaning Tasks I will ignore as I race to get my FSM</font>
- Remove outlier users (eBay, one-timers, etc)