# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# B. ALS with PySpark

---

# Libraries

In [9]:
%matplotlib inline
%load_ext autoreload
%autoreload 2  # 1 would be where you need to specify the files
#%aimport data_fcns

import pandas as pd # dataframes
import os
import gspread_pandas
from gspread_pandas import Spread, Client # gsheets interaction

# Data storage
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

# import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql.types import (StructType, StructField, IntegerType
                               ,FloatType, LongType )
import pyspark.sql.functions as F
from pyspark.sql.functions import col

# Custom
import data_fcns as dfc
import keys  # Custom keys lib


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
# instantiate SparkSession object
spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()
# spark = SparkSession.builder.master("local").getOrCreate()

schema = StructType(
    [
                    StructField('userID', IntegerType())
                    ,StructField('movieID', IntegerType())
                    ,StructField('rating', FloatType())
                    ,StructField('timestamp', LongType())
    ]
)

## Import the data

There is way to directly hit PostgreSQL through JDBC, but I don't know how to do that yet. So have worked around by saving the candidate dataset to JSON, and then will use that as input to Spark.

In [11]:
comics = spark.read.json('raw_data/trans.json')

In [12]:
# Persist the data
comics.persist()

DataFrame[account_num: string, comic_title: string, date_sold: bigint, item_id: string, publisher: string, qty_sold: bigint, title_and_num: string]

###### read in the dataset into pyspark DataFrame
testing = spark.read.csv('./data/ratings.csv'
                               , inferSchema=False
                               , schema=schema
                               , header=True)

In [13]:
# check schema
comics.printSchema()

root
 |-- account_num: string (nullable = true)
 |-- comic_title: string (nullable = true)
 |-- date_sold: long (nullable = true)
 |-- item_id: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- qty_sold: long (nullable = true)
 |-- title_and_num: string (nullable = true)



### More exploration/testing

We won't be using pandas dataframes in the matrix factorization through Spark, but let's cast to one anyway as it will be easier to work with for EDA.

In [14]:
# cast to Pandas dataframe to turn timestamp data to datetime and check nulls. 
comics_df = comics.select('*').toPandas()
comics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494703 entries, 0 to 494702
Data columns (total 7 columns):
account_num      494703 non-null object
comic_title      494703 non-null object
date_sold        494703 non-null int64
item_id          494703 non-null object
publisher        494703 non-null object
qty_sold         494703 non-null int64
title_and_num    494703 non-null object
dtypes: int64(2), object(5)
memory usage: 26.4+ MB


In [15]:
# Let's double check the data is how we expect it
comics_df.head()

Unnamed: 0,account_num,comic_title,date_sold,item_id,publisher,qty_sold,title_and_num
0,174,Filler Bunny (SLG),1313344863000,DCD151935,Amaze Ink Slave Labor Graphics,1,Filler Bunny #2
1,593,Gargoyles (SLG),1340374297000,DCD341726,Amaze Ink Slave Labor Graphics,1,Gargoyles #6
2,226,Royal Historian of Oz (SLG),1279720987000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1
3,399,Royal Historian of Oz (SLG),1279136980000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1
4,237,Royal Historian of Oz (SLG),1279535944000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1


In [16]:
comics_df['dt'] = pd.to_datetime(comics_df['date_sold'], unit='ms')

Yes. Reverse-confirmed versus the original transactions dataframe in the other notebook that this datetime is correct. 

### Data Prep for ALS

Let's aggregate the data to the two columns we need:
- `account_num` - This is the identifier for individual customers.


- `comic_title` - The comic. Represents individual volumes/runs of a comic.


- `score` - We need to figure out what we want to use to act as a `score`. If these were Amazon items then review scores would be natural fit; but we don't have that. We can maybe use a binary flag of `bought`/`not bought`. Or we can use the `qty_sold`. This might be interesting in that it might capture some interesting behavior from comic 'collectors/speculators'. Since this is first pass, I'm curious as to what `qty_sold` might do!


We only care about `account_num`, `comic_title` and `qty_sold`.

In [17]:
comics_sold = comics[['account_num', 'comic_title', 'qty_sold']]
comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [18]:
total_comics_sold = comics_sold.groupBy(['account_num', 'comic_title']).agg({'qty_sold':'sum'})
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, sum(qty_sold): bigint]

Ok, let's take a look at the results.

In [19]:
total_comics_sold.show(10)

+-----------+--------------------+-------------+
|account_num|         comic_title|sum(qty_sold)|
+-----------+--------------------+-------------+
|      01858|Afterlife With Ar...|            5|
|      02247|Bubblegun VOL 2 (...|            1|
|      00191|    Caliban (Avatar)|            7|
|      00487|Captain Swing (Av...|            2|
|      00029|God Is Dead (Avatar)|            7|
|      01260| Providence (Avatar)|            1|
|      00172|   Supergod (Avatar)|            3|
|      01132|Futurama Annual (...|            3|
|      02493|       Abbott (Boom)|            3|
|      00298|Adventure Time (B...|            2|
+-----------+--------------------+-------------+
only showing top 10 rows



I don't like that default column name. Let's fix that to be `qty_sold` again.

In [20]:
total_comics_sold = total_comics_sold.select(
    *[col(s).alias('qty_sold') if s == 'sum(qty_sold)' 
      else s 
      for s in total_comics_sold.columns])
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [21]:
total_comics_sold.show(10)

+-----------+--------------------+--------+
|account_num|         comic_title|qty_sold|
+-----------+--------------------+--------+
|      01858|Afterlife With Ar...|       5|
|      02247|Bubblegun VOL 2 (...|       1|
|      00191|    Caliban (Avatar)|       7|
|      00487|Captain Swing (Av...|       2|
|      00029|God Is Dead (Avatar)|       7|
|      01260| Providence (Avatar)|       1|
|      00172|   Supergod (Avatar)|       3|
|      01132|Futurama Annual (...|       3|
|      02493|       Abbott (Boom)|       3|
|      00298|Adventure Time (B...|       2|
+-----------+--------------------+--------+
only showing top 10 rows



Nice!

#### Formatting

Sooooooo, I forgot that the values need to be numeric. So need to fix that.

First, let's convert `account_num` to an integer.

In [22]:
to_int_udf = F.udf(dfc.make_int, IntegerType())

In [23]:
account_num_col = total_comics_sold['account_num']

In [25]:
test_df = total_comics_sold.withColumn('account_id'
                                        ,to_int_udf(account_num_col))

In [26]:
test_df.show(10)

+-----------+--------------------+--------+----------+
|account_num|         comic_title|qty_sold|account_id|
+-----------+--------------------+--------+----------+
|      01858|Afterlife With Ar...|       5|      1858|
|      02247|Bubblegun VOL 2 (...|       1|      2247|
|      00191|    Caliban (Avatar)|       7|       191|
|      00487|Captain Swing (Av...|       2|       487|
|      00029|God Is Dead (Avatar)|       7|        29|
|      01260| Providence (Avatar)|       1|      1260|
|      00172|   Supergod (Avatar)|       3|       172|
|      01132|Futurama Annual (...|       3|      1132|
|      02493|       Abbott (Boom)|       3|      2493|
|      00298|Adventure Time (B...|       2|       298|
+-----------+--------------------+--------+----------+
only showing top 10 rows



Now I need to find a way to give ids to the `comic_title`. Kind of clunky, but I do have the version in PostgreSQL of the big table. I can just build an ID table up there as source of truth. I could do something on PySpark side, but then think would want to save it somewhere (e.g. the DB) anyway. So might as well do it from the top.

### ALS Model

Let's start with  train/test split.

In [None]:
# Split data into training and test set
(train, test) = total_comics_sold.randomSplit([.8, .2])

Instantiate VM

In [None]:
# Create ALS instance and fit model
als = ALS(maxIter=10,
          rank=10,
          userCol='account_num',
          itemCol='comic_title',
          ratingCol='qty_sold',
          seed=41916)

model = als.fit(train)

---

# <font color="red">Cleaning Tasks I will ignore as I race to get my FSM</font>
- Remove outlier users (eBay, one-timers, etc)