# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# ALS Model - Reduced Data - EDA, Prep

This time, as explored in the EDA NB, let's consider removing customers who we feel have too few or too many purchases to influence the model in the intended way.

Examples:
- Too few - Customers who have only bought 1 comic (series).
- Too many - Customers with > 1000 series (for example, think all eBay customers are rolled into one account number).

# Libraries

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2  # 1 would be where you need to specify the files
#%aimport data_fcns

import pandas as pd  # dataframes
import os
import time
import numpy as np

# Data storage
from sqlalchemy import create_engine  # SQL helper
import psycopg2 as psql  #PostgreSQL DBs

# import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.sql.types import (StructType, StructField, IntegerType
#                                ,FloatType, LongType, StringType)
from pyspark.sql.types import *

import pyspark.sql.functions as F
from pyspark.sql.functions import col, explode, lit, isnan, when, count
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import (CrossValidator, ParamGridBuilder, 
                               TrainValidationSplit)
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [2]:
import sys

In [3]:
sys.path.append('..')

In [4]:
# Custom
import data_fcns as dfc
import keys  # Custom keys lib
import comic_recs as cr

In [5]:
# # instantiate SparkSession object
# spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()
# # spark = SparkSession.builder.master("local").getOrCreate()

In [6]:
from pyspark import SparkConf

conf = SparkConf()

conf = (conf.setMaster('local[*]')
#         .set('spark.executor.memory', '1G') #https://stackoverflow.com/questions/48523629/spark-pyspark-an-error-occurred-while-trying-to-connect-to-the-java-server-127
        .set('spark.driver.memory', '7G')
        .set('spark.driver.maxResultSize', '4G'))
#         .set('spark.executor.memory', '1G')
#         .set('spark.driver.memory', '10G')
#         .set('spark.driver.maxResultSize', '5G'))

sc = pyspark.SparkContext().getOrCreate(conf=conf)

from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

sc.setCheckpointDir('./checkpoints')

# spark.sparkContext.setCheckpointDir("hdfs://datalake/check_point_directory/als")

## Import the data

There is way to directly hit PostgreSQL through JDBC, but I don't know how to do that yet. So have worked around by saving the candidate dataset to JSON, and then will use that as input to Spark.


In [7]:
# We have previously created a version of the transactions table 
# and filtered it down.
trans = sql_context.read.json('raw_data/trans_filtered.json')

In [8]:
# Persist the data
trans.persist()

DataFrame[account_num: string, comic_title: string, date_sold: bigint, item_id: string, publisher: string, qty_sold: bigint, title_and_num: string]

In [9]:
print(trans.count(), len(trans.columns))

327839 7


In [10]:
# check schema
trans.printSchema()

root
 |-- account_num: string (nullable = true)
 |-- comic_title: string (nullable = true)
 |-- date_sold: long (nullable = true)
 |-- item_id: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- qty_sold: long (nullable = true)
 |-- title_and_num: string (nullable = true)



### More exploration/testing

We won't be using pandas dataframes in the matrix factorization through Spark, but let's cast to one anyway as it will be easier to work with for EDA.

In [11]:
# cast to Pandas dataframe to turn timestamp data to datetime and check nulls. 
trans_df = trans.select('*').toPandas()
trans_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327839 entries, 0 to 327838
Data columns (total 7 columns):
account_num      327839 non-null object
comic_title      327839 non-null object
date_sold        327839 non-null int64
item_id          327839 non-null object
publisher        327839 non-null object
qty_sold         327839 non-null int64
title_and_num    327839 non-null object
dtypes: int64(2), object(5)
memory usage: 17.5+ MB


In [12]:
# Let's double check the data is how we expect it
trans_df.head()

Unnamed: 0,account_num,comic_title,date_sold,item_id,publisher,qty_sold,title_and_num
0,399,Royal Historian of Oz (SLG),1279136980000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1
1,327,Royal Historian of Oz (SLG),1288543119000,DCD416182,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #1
2,327,Royal Historian of Oz (SLG),1288543119000,DCD423794,Amaze Ink Slave Labor Graphics,1,Royal Historian of Oz #2
3,1065,Warlord of Io & Other Storie (SLG),1412166247000,DCD390709,Amaze Ink Slave Labor Graphics,1,Warlord of Io & Other Stories
4,1033,Afterlife With Archie (Archie),1390505789000,DCD630105,Archie Comics,1,Afterlife With Archie #1 2nd P


In [13]:
trans_df['dt'] = pd.to_datetime(trans_df['date_sold'], unit='ms')

Yes. Reverse-confirmed versus the original transactions dataframe in the other notebook that this datetime is correct. 

### Data Prep for ALS

Let's aggregate the data to the two columns we need:
- `account_num` - This is the identifier for individual customers.


- `comic_title` - The comic. Represents individual volumes/runs of a comic.


- `score` - We need to figure out what we want to use to act as a `score`. If these were Amazon items then review scores would be natural fit; but we don't have that. We can maybe use a binary flag of `bought`/`not bought`. Or we can use the `qty_sold`. This might be interesting in that it might capture some interesting behavior from comic 'collectors/speculators'. Since this is first pass, I'm curious as to what `qty_sold` might do!


We only care about `account_num`, `comic_title` and `qty_sold`.

In [14]:
comics_sold = trans[['account_num', 'comic_title', 'qty_sold']]
comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [15]:
comics_sold = comics_sold.withColumn('bought', lit(1))

In [16]:
comics_sold.show(10)

+-----------+--------------------+--------+------+
|account_num|         comic_title|qty_sold|bought|
+-----------+--------------------+--------+------+
|      00399|Royal Historian o...|       1|     1|
|      00327|Royal Historian o...|       1|     1|
|      00327|Royal Historian o...|       1|     1|
|      01065|Warlord of Io & O...|       1|     1|
|      01033|Afterlife With Ar...|       1|     1|
|      01333|Afterlife With Ar...|       1|     1|
|      00946|Afterlife With Ar...|       1|     1|
|      01278|Afterlife With Ar...|       1|     1|
|      01212|Afterlife With Ar...|       1|     1|
|      00877|Afterlife With Ar...|       1|     1|
+-----------+--------------------+--------+------+
only showing top 10 rows



In [17]:
comics_sold = trans[['account_num', 'comic_title', 'qty_sold']]
comics_sold.persist()

DataFrame[account_num: string, comic_title: string, qty_sold: bigint]

In [18]:
total_comics_sold = ( comics_sold.groupBy(['account_num', 'comic_title'])
                               .agg({'qty_sold':'sum'})
                    )
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, sum(qty_sold): bigint]

Ok, let's take a look at the results.

In [19]:
total_comics_sold.show(10)

+-----------+--------------------+-------------+
|account_num|         comic_title|sum(qty_sold)|
+-----------+--------------------+-------------+
|      02247|Bubblegun VOL 2 (...|            1|
|      00487|Captain Swing (Av...|            2|
|      00029|God Is Dead (Avatar)|            7|
|      01260| Providence (Avatar)|            1|
|      00172|   Supergod (Avatar)|            3|
|      02493|       Abbott (Boom)|            3|
|      00052|Adventure Time Ma...|            6|
|      00032|Big Trouble In Li...|           11|
|      01149| Broken World (Boom)|            2|
|      01489|Jim Henson Labyri...|            1|
+-----------+--------------------+-------------+
only showing top 10 rows



In [20]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

61871 3


In [21]:
total_comics_sold = total_comics_sold.withColumn('bought', lit(1))

I don't like that default column name. Let's fix that to be `qty_sold` again.

In [22]:
total_comics_sold.show(10)

+-----------+--------------------+-------------+------+
|account_num|         comic_title|sum(qty_sold)|bought|
+-----------+--------------------+-------------+------+
|      02247|Bubblegun VOL 2 (...|            1|     1|
|      00487|Captain Swing (Av...|            2|     1|
|      00029|God Is Dead (Avatar)|            7|     1|
|      01260| Providence (Avatar)|            1|     1|
|      00172|   Supergod (Avatar)|            3|     1|
|      02493|       Abbott (Boom)|            3|     1|
|      00052|Adventure Time Ma...|            6|     1|
|      00032|Big Trouble In Li...|           11|     1|
|      01149| Broken World (Boom)|            2|     1|
|      01489|Jim Henson Labyri...|            1|     1|
+-----------+--------------------+-------------+------+
only showing top 10 rows



In [23]:
cols = ['account_num', 'comic_title', 'bought']
total_comics_sold = total_comics_sold[cols]

In [24]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

61871 3


### Formatting

Sooooooo, I forgot that the values need to be numeric. So need to fix that.

#### Convert `account_id` to integer

In [25]:
to_int_udf = F.udf(dfc.make_int, IntegerType())

In [26]:
account_num_col = total_comics_sold['account_num']

In [27]:
total_comics_sold = total_comics_sold.withColumn('account_id'
                                        ,to_int_udf(account_num_col))
total_comics_sold.persist()

DataFrame[account_num: string, comic_title: string, bought: int, account_id: int]

In [28]:
total_comics_sold.show(10)

+-----------+--------------------+------+----------+
|account_num|         comic_title|bought|account_id|
+-----------+--------------------+------+----------+
|      02247|Bubblegun VOL 2 (...|     1|      2247|
|      00487|Captain Swing (Av...|     1|       487|
|      00029|God Is Dead (Avatar)|     1|        29|
|      01260| Providence (Avatar)|     1|      1260|
|      00172|   Supergod (Avatar)|     1|       172|
|      02493|       Abbott (Boom)|     1|      2493|
|      00052|Adventure Time Ma...|     1|        52|
|      00032|Big Trouble In Li...|     1|        32|
|      01149| Broken World (Boom)|     1|      1149|
|      01489|Jim Henson Labyri...|     1|      1489|
+-----------+--------------------+------+----------+
only showing top 10 rows



In [29]:
print(total_comics_sold.count(), len(total_comics_sold.columns))

61871 4


Now I need to find a way to give ids to the `comic_title`. Kind of clunky, but I do have the version in PostgreSQL of the big table. I can just build an ID table up there as source of truth. I could do something on PySpark side, but then think would want to save it somewhere (e.g. the DB) anyway. So might as well do it from the top.

#### Get `comic_id`

In [30]:
comics = sql_context.read.json('raw_data/comics.json')
comics.persist()

DataFrame[comic_id: bigint, comic_title: string]

In [31]:
comics.count()

7202

In [32]:
comics.show(10)

+--------+--------------------+
|comic_id|         comic_title|
+--------+--------------------+
|       1|0Secret Wars (Mar...|
|       2|100 Bullets Broth...|
|       3|100 Penny Press L...|
|       4|100 Penny Press S...|
|       5|100 Penny Press T...|
|       6|100 Penny Press T...|
|       7|100th Anniversary...|
|       8|12 Reasons To Die...|
|       9|    13 Coins (Other)|
|      10|13th Artifact One...|
+--------+--------------------+
only showing top 10 rows



In [33]:
print(comics.count(), len(comics.columns))

7202 2


Now we need to join this back into `total_comics_sold`.

In [34]:
# Set aliases
tot = total_comics_sold.alias('tot')
com = comics.alias('com')

In [35]:
tot_sold_ids_only = tot.join(com.select('comic_id','comic_title')
                      ,tot.comic_title==com.comic_title).select('account_id'
                                                                , 'comic_id'
                                                                , 'bought')
tot_sold_ids_only.persist()
tot_sold_ids_only.show(10)

+----------+--------+------+
|account_id|comic_id|bought|
+----------+--------+------+
|      2247|     995|     1|
|       487|    1102|     1|
|        29|    2680|     1|
|      1260|    4870|     1|
|       172|    6023|     1|
|      2493|      66|     1|
|        52|     116|     1|
|        32|     755|     1|
|      1149|     971|     1|
|      1489|    3503|     1|
+----------+--------+------+
only showing top 10 rows



In [36]:
tot_sold_ids_only.printSchema()

root
 |-- account_id: integer (nullable = true)
 |-- comic_id: long (nullable = true)
 |-- bought: integer (nullable = false)



In [37]:
print(tot_sold_ids_only.count(), len(tot_sold_ids_only.columns))

61871 3


## Create table with zeros

In [38]:
# Get all accounts
acct_ids = tot_sold_ids_only.select("account_id").distinct().persist()
acct_ids.show()

+----------+
|account_id|
+----------+
|       148|
|      2659|
|      1645|
|      2142|
|       833|
|      1088|
|      2866|
|       897|
|       243|
|      1896|
|      2811|
|      2235|
|      1025|
|      1395|
|      2563|
|      1507|
|      1522|
|      1460|
|      2393|
|      1352|
+----------+
only showing top 20 rows



In [39]:
# Get just comic_ids
comic_ids = comics.select("comic_id").distinct().persist()
comic_ids.show()

+--------+
|comic_id|
+--------+
|      26|
|      29|
|     474|
|     964|
|    1677|
|    1697|
|    1806|
|    1950|
|    2040|
|    2214|
|    2250|
|    2453|
|    2509|
|    2529|
|    2927|
|    3091|
|    3506|
|    3764|
|    4590|
|    4823|
+--------+
only showing top 20 rows



In [40]:
comic_ids.count()

7202

In [41]:
acct_ids.count()

1071

In [43]:
tot_sold_ids_only.show()

+----------+--------+------+
|account_id|comic_id|bought|
+----------+--------+------+
|      2247|     995|     1|
|       487|    1102|     1|
|        29|    2680|     1|
|      1260|    4870|     1|
|       172|    6023|     1|
|      2493|      66|     1|
|        52|     116|     1|
|        32|     755|     1|
|      1149|     971|     1|
|      1489|    3503|     1|
|      1234|    3979|     1|
|       560|    4221|     1|
|       130|    5497|     1|
|       742|    6717|     1|
|      1212|    5001|     1|
|      1287|    7018|     1|
|      1406|     136|     1|
|       291|     260|     1|
|        32|     525|     1|
|       742|     762|     1|
+----------+--------+------+
only showing top 20 rows



### Limit comic Ids to model

I think keeping comics with only a handful of sales will be a little noisy. And to a more pragmatic point, the less comics, the less resource intensive it will be because the matrix will be not as big.

Arbitrarily going to pick >= 20 sales for now.

In [44]:
comic_ids = ( tot_sold_ids_only.groupBy("comic_id").count().
             filter(col('count') >= 20).select("comic_id")
            )

In [45]:
comic_ids.show()

+--------+
|comic_id|
+--------+
|    2250|
|    6721|
|    3009|
|    3015|
|    7130|
|    1010|
|    2173|
|    1840|
|    4551|
|    7121|
|    1642|
|     938|
|     243|
|     720|
|     278|
|    6287|
|     705|
|    2797|
|    1217|
|    1280|
+--------+
only showing top 20 rows



In [46]:
comic_ids.count()

790

#### Save to pandas

In [47]:
comic_ids_df = comic_ids.toPandas()

#### How many records are there after we limit to comics with mininum number of sales?

In [48]:
trans_df.shape

(327839, 8)

In [49]:
comic_ids_df.shape

(790, 1)

In [50]:
trans_df.columns

Index(['account_num', 'comic_title', 'date_sold', 'item_id', 'publisher',
       'qty_sold', 'title_and_num', 'dt'],
      dtype='object')

Make a pandas df of `comics`

In [51]:
comics_df = comics.toPandas()

In [52]:
comics_df.head()

Unnamed: 0,comic_id,comic_title
0,1,0Secret Wars (Marvel)
1,2,100 Bullets Brother Lono (DC)
2,3,100 Penny Press Locke & Key (IDW)
3,4,100 Penny Press Star Trek (IDW)
4,5,100 Penny Press Thunder Agent (IDW)


In [53]:
comics_df_filtered = comics_df.merge(comic_ids_df, right_on="comic_id"
                                     ,left_on="comic_id"
                                     ,how="inner")

In [54]:
comics_df_filtered.shape

(790, 2)

In [55]:
comics_df_filtered.head()

Unnamed: 0,comic_id,comic_title
0,7,100th Anniversary Special (Marvel)
1,59,8house Arclight (Image)
2,60,8house (Image)
3,68,Abe Sapien (Dark Horse)
4,69,Abe Sapien Devil Does Not Jes (Dark Horse)


In [56]:
comics_df_filtered.columns

Index(['comic_id', 'comic_title'], dtype='object')

In [57]:
tot_sold_ids_only.count()

61871

In [58]:
sold_ids_df = tot_sold_ids_only.toPandas()

In [59]:
sold_ids_df.head()

Unnamed: 0,account_id,comic_id,bought
0,2247,995,1
1,487,1102,1
2,29,2680,1
3,1260,4870,1
4,172,6023,1


In [60]:
sold_df_floored = sold_ids_df.merge(comics_df_filtered, right_on="comic_id"
                                  ,left_on="comic_id"
                                  ,how="inner")

In [61]:
sold_df_floored.head()

Unnamed: 0,account_id,comic_id,bought,comic_title
0,487,1102,1,Captain Swing (Avatar)
1,47,1102,1,Captain Swing (Avatar)
2,215,1102,1,Captain Swing (Avatar)
3,628,1102,1,Captain Swing (Avatar)
4,39,1102,1,Captain Swing (Avatar)


#### How many account-comic combos are there after filtering?

In [62]:
sold_df_floored.shape[0]

33138

In [63]:
trans_floored = trans_df.merge(comics_df_filtered, right_on="comic_title"
                                  ,left_on="comic_title"
                                  ,how="inner")

In [64]:
trans_floored.head()

Unnamed: 0,account_num,comic_title,date_sold,item_id,publisher,qty_sold,title_and_num,dt,comic_id
0,1033,Afterlife With Archie (Archie),1390505789000,DCD630105,Archie Comics,1,Afterlife With Archie #1 2nd P,2014-01-23 19:36:29,128
1,1333,Afterlife With Archie (Archie),1398947834000,DCD630105,Archie Comics,1,Afterlife With Archie #1 2nd P,2014-05-01 12:37:14,128
2,946,Afterlife With Archie (Archie),1399904156000,DCD630105,Archie Comics,1,Afterlife With Archie #1 2nd P,2014-05-12 14:15:56,128
3,1278,Afterlife With Archie (Archie),1407954335000,DCD630105,Archie Comics,1,Afterlife With Archie #1 2nd P,2014-08-13 18:25:35,128
4,1212,Afterlife With Archie (Archie),1383751270000,DCD630105,Archie Comics,1,Afterlife With Archie #1 2nd P,2013-11-06 15:21:10,128


In [65]:
len(trans_floored['account_num'].unique())

1051

#### Q: How many transactions after all filters?
- Accounts with >= 5 transactions and <= 300 transactions
- Comics that have been bought by >= 20 accounts

In [66]:
trans_floored.shape[0]

242997

#### Q: Number of comic - account combos?

In [68]:
sold_df_floored.shape

(33138, 4)

#### Q: Number of unique accounts before filtering?

In [69]:
len(sold_ids_df['account_id'].unique())

1071

#### Number of unique accounts after filtering

In [70]:
len(sold_df_floored['account_id'].unique())

1051

In [74]:
# comic_ids_df.to_json('raw_data/comic_ids.json', orient='records'
#                      ,lines=True)

comic_ids_df.to_json('support_data/comic_ids.json', orient='records'
                     ,lines=True)

In [75]:
comic_ids_df.head()

Unnamed: 0,comic_id
0,2250
1,6721
2,3009
3,3015
4,7130


In [76]:
acct_ids.count()

1071

In [77]:
total_combos = comic_ids.count() * acct_ids.count()
total_combos

846090

In [81]:
# Join together
all_combos = comic_ids.crossJoin(acct_ids).persist()

all_combos.count()

846090

In [82]:
sold = tot_sold_ids_only.alias("sold")

In [83]:
tot_sold_ids_only.columns

['account_id', 'comic_id', 'bought']

In [100]:
final_combos = all_combos.join(sold, [sold.comic_id == all_combos.comic_id
                                      ,sold.account_id == all_combos.account_id], 
                              "left").select(all_combos.comic_id
                                             ,all_combos.account_id
                                             ,sold.bought).fillna(0).persist()

In [85]:
final_combos.show()

+--------+----------+------+
|comic_id|account_id|bought|
+--------+----------+------+
|    2250|       148|     0|
|    2250|      2659|     0|
|    2250|      1645|     0|
|    2250|      2142|     0|
|    2250|       833|     0|
|    2250|      1088|     0|
|    2250|      2866|     0|
|    2250|       897|     0|
|    2250|       243|     0|
|    2250|      1896|     0|
|    2250|      2811|     0|
|    2250|      2235|     0|
|    2250|      1025|     0|
|    2250|      1395|     0|
|    2250|      2563|     0|
|    2250|      1507|     0|
|    2250|      1522|     0|
|    2250|      1460|     0|
|    2250|      2393|     0|
|    2250|      1352|     0|
+--------+----------+------+
only showing top 20 rows



In [87]:
final_combos.count()

846090

We have about 850K potential `account`, `comic` combinations.

Let's take a look at the sparsity of the matrix.

In [88]:
sparse_numerator = sold.count()
sparse_denominator = final_combos.count()
sparsity = 1 - (sparse_numerator/sparse_denominator)
sparsity

0.9268742095994515

So about 7.5% populated. Not bad.

In [89]:
#df2.coalesce(1).write.format('json').save('/path/file_name.json')
#final_combos.write.format('json').save('raw_data/als_input_filtered_190915.json')

## Save this intermediate table.

To save work, if needed.

In [98]:
!rm -r raw_data/als_input_filtered_190916.pkl

rm: raw_data/als_input_filtered_190916.pkl: No such file or directory


In [99]:
final_combos.rdd.saveAsPickleFile('raw_data/als_input_filtered_190916.pkl')

Test reconstituting the pickle

In [None]:
#pickleRdd = sc.pickleFile('raw_data/als_input_filtered_190915.pkl').collect()
#df2 = sql_context.createDataFrame(pickleRdd)

In [None]:
als_data = final_combos.toPandas()

In [None]:
als_data.to_json('raw_data/als_input_filtered_190915.json', orient='records'
                     ,lines=True)

Test the pickle

In [None]:
unpickled_items = pd.read_pickle('support_data/item_factors_20190916.pkl')

In [None]:
comics_df.head()

In [None]:
unpickled_items.head()

In [None]:
ddd = unpickled_items.merge(comics_df, left_on='id', right_on='comic_id', how="inner")

In [None]:
ddd.head(20)