# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# 5 - ALS Model - 'Pseudo' Deployment

This notebook is to explore and develop 'deploying' from a previously saved ALS model.

# Libraries

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2  # 1 would be where you need to specify the files
#%aimport data_fcns

import pandas as pd # dataframes
import os
import time
import numpy as np

# Data storage
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

# import necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
# from pyspark.sql.types import (StructType, StructField, IntegerType
#                                ,FloatType, LongType, StringType)
from pyspark.sql.types import *

import pyspark.sql.functions as F
from pyspark.sql.functions import col, explode, lit, isnan, when, count
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Custom
import lib.data_fcns as dfc
import lib.keys  # Custom keys lib
import lib.comic_recs as cr

In [2]:
# instantiate SparkSession object
spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()
# spark = SparkSession.builder.master("local").getOrCreate()

## Retrieving Saved Model

In [3]:
comic_rec_model = ALSModel.load('als_filtered')

In [4]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=comic_rec_model, topn=50)
top_n_df

161


Unnamed: 0,comic_title
1,Criminal (Image)
2,Bitch Planet (Image)
3,Royal City (Image)
4,Black Widow (Marvel)
5,All New Hawkeye (Marvel)
6,Shipwreck (Other)
7,Sex Criminals (Image)
8,Neil Gaiman American Gods Sha (Dark Horse)
9,Spider-Gwen (Marvel)
10,Sweet Tooth (Vertigo)


I'm testing on myself. I'm pretty sure I've bought a few of those title's above. But this could be a failure in how I aggregated on series, but there some evidence of that failing. One example is *Gideon Falls*. There should be only one volume of that. Maybe it's graphic novels? But that shouldn't be an issue (no pun intended) because I believe the original dataset should just be individual comic books. 

Let's test versus the original dataset!

#### Set aside some test series.

- Paper Girls (Image)
- Saga (Other)
- Fade Out (Image)

These I know **for sure** I've bought, if not subscribed.

## Set up connection to AWS RDS

In [157]:
# Define path to secret
secret_path_aws = os.path.join(os.environ['HOME'], '.secret', 
                           'aws_ps_flatiron.json')
secret_path_aws

'/Users/werlindo/.secret/aws_ps_flatiron.json'

In [158]:
aws_keys = keys.get_keys(secret_path_aws)
user = aws_keys['user']
ps = aws_keys['password']
host = aws_keys['host']
db = aws_keys['db_name']

aws_ps_engine = ('postgresql://' + user + ':' + ps + '@' + host + '/' + db)

In [159]:
# Setup PSQL connection
conn = psql.connect(
    database=db,
    user=user,
    password=ps,
    host=host,
    port='5432'
)

In [160]:
# Instantiate cursor
cur = conn.cursor()

In [161]:
#  Count records.
query = """
    SELECT
       *
    FROM 
        comic_trans 
    WHERE
        account_num = '00161'
    ;
"""

In [162]:
conn.rollback()

In [163]:
# Execute the query
cur.execute(query)

In [164]:
# Check results
temp_df = pd.DataFrame(cur.fetchall())
temp_df.columns = [col.name for col in cur.description]

In [165]:
temp_df.head()

Unnamed: 0,index,publisher,item_id,title_and_num,qty_sold,date_sold,account_num,comic_title
0,33,Archie Comics,DCD617897,Afterlife With Archie #1 Franc,1,2013-10-30 15:14:23,161,Afterlife With Archie (Archie)
1,43,Archie Comics,DCD617564,Afterlife With Archie #1 Reg C,1,2013-10-30 15:14:23,161,Afterlife With Archie (Archie)
2,54,Archie Comics,DCDL012758,Afterlife With Archie #10 Cvr,1,2016-09-04 11:08:02,161,Afterlife With Archie (Archie)
3,104,Archie Comics,DCD622673,Afterlife With Archie #3 Reg F,1,2014-01-24 11:42:27,161,Afterlife With Archie (Archie)
4,124,Archie Comics,DCD625043,Afterlife With Archie #4 Reg F,1,2014-03-12 18:12:06,161,Afterlife With Archie (Archie)


In [167]:
# Make a list of test comic_title
already_bought = ['Paper Girls (Image)', 'Saga (Other)', 'Fade Out (Image)',
                  'Sweet Tooth (Vertigo)']

In [173]:
temp_df.loc[temp_df['comic_title'].isin(already_bought), ['comic_title']].comic_title.unique()

array(['Fade Out (Image)', 'Paper Girls (Image)', 'Saga (Other)'],
      dtype=object)

Ok, so I already knew this was the case, but just wanted to confirm.

## Don't repeat already bought comics

Let's filter out comics already bought. To support this I think I want a `json` file I could use as a pseudo-database to look up existing account-comic_title relationships.

In [61]:
#  Count records.
query = """
    SELECT
        DISTINCT 
        CAST(account_num AS INT) as account_id
        ,c.comic_id
    FROM 
        comic_trans ct
        inner join comics c on ct.comic_title = c.comic_title
    ;
"""

In [62]:
# conn.rollback()

In [63]:
# Execute the query
cur.execute(query)

In [64]:
# Check results
temp_df = pd.DataFrame(cur.fetchall())
temp_df.columns = [col.name for col in cur.description]

In [65]:
temp_df.head()

Unnamed: 0,account_id,comic_id
0,2,198
1,2,223
2,2,224
3,2,312
4,2,392


In [66]:
temp_df.to_json('support_data/acct_comics.json', orient='records'
                     ,lines=True)

In [67]:
!head support_data/acct_comics.json

{"account_id":2,"comic_id":198}
{"account_id":2,"comic_id":223}
{"account_id":2,"comic_id":224}
{"account_id":2,"comic_id":312}
{"account_id":2,"comic_id":392}
{"account_id":2,"comic_id":455}
{"account_id":2,"comic_id":481}
{"account_id":2,"comic_id":482}
{"account_id":2,"comic_id":828}
{"account_id":2,"comic_id":841}


## Recommendation function 2.0

Let's bring back the dev on returning recommendations:

In [36]:
def get_top_n_recs_for_user(spark, model, topn=10):
    """
    Given requested n and ALS model, returns top n recommended comics
    """
    tgt_acct_id = input()

    # Create spark df manually
    a_schema = StructType([StructField("account_id", LongType())])

    # Init lists
    tgt_list = []
    acct_list = []
    tgt_list.append(int(tgt_acct_id))
    acct_list.append(tgt_list)

    # Create one-row spark df
    tgt_accts = spark.createDataFrame(acct_list, schema=a_schema) 

    # Get recommendations for user
    userSubsetRecs = model.recommendForUserSubset(tgt_accts, topn)
    userSubsetRecs.persist()

    # Flatten the recs list
    top_n_exploded = (userSubsetRecs.withColumn('tmp',explode('recommendations'))
            .select('account_id', col("tmp.comic_id"), col("tmp.rating")))
    top_n_exploded.persist()

    # Get comics titles
    comics = spark.read.json('raw_data/comics.json')
    comics.persist()
    
    # shorten with alias
    top_n = top_n_exploded.alias('topn')
    com = comics.alias('com')

    # Clean up the spark df to list of titles
    top_n_titles = (top_n.join(com.select('comic_id','comic_title')
                          ,top_n.comic_id==com.comic_id)
                 .select('comic_title'))
    top_n_titles.persist()

    # Cast to pandas df and return it
    top_n_df = top_n_titles.select('*').toPandas()
    top_n_df.index += 1
    
    return top_n_df

In [142]:
top_n_req = 10

In [143]:
top_n_df = get_top_n_recs_for_user(spark=spark, model=comic_rec_model, topn=top_n_req)
top_n_df

161


Unnamed: 0,comic_title
1,Criminal (Image)
2,Bitch Planet (Image)
3,Royal City (Image)
4,Black Widow (Marvel)
5,All New Hawkeye (Marvel)
6,Shipwreck (Other)
7,Sex Criminals (Image)
8,Neil Gaiman American Gods Sha (Dark Horse)
9,Spider-Gwen (Marvel)
10,Sweet Tooth (Vertigo)


In [38]:
tgt_acct_id = input()

161


In [39]:
tgt_acct_id

'161'

In [40]:
# Create spark df manually
a_schema = StructType([StructField("account_id", LongType())])

# Init lists
tgt_list = []
acct_list = []
tgt_list.append(int(tgt_acct_id))
acct_list.append(tgt_list)

# Create one-row spark df
tgt_accts = spark.createDataFrame(acct_list, schema=a_schema) 
tgt_accts.show()

+----------+
|account_id|
+----------+
|       161|
+----------+



In [144]:
# Get recommendations for user
userSubsetRecs = comic_rec_model.recommendForUserSubset(tgt_accts, 3 * top_n_req)
userSubsetRecs.persist()
userSubsetRecs.show()

+----------+--------------------+
|account_id|     recommendations|
+----------+--------------------+
|       161|[[1401, 1.3040708...|
+----------+--------------------+



In [146]:
# Flatten the recs list
top_n_exploded = (userSubsetRecs.withColumn('tmp',explode('recommendations'))
        .select('account_id', col("tmp.comic_id"), col("tmp.rating")))
top_n_exploded.persist()
top_n_exploded.show(top_n_req*3)

+----------+--------+---------+
|account_id|comic_id|   rating|
+----------+--------+---------+
|       161|    1401|1.3040708|
|       161|     773|1.2929938|
|       161|    5177|1.2767146|
|       161|     843|1.2509787|
|       161|     184|1.2063307|
|       161|    5429|1.1961331|
|       161|    5348| 1.163892|
|       161|    4441|1.1624106|
|       161|    5638|1.1442554|
|       161|    6105|1.1323442|
|       161|    4701|1.1222775|
|       161|    5199| 1.121197|
|       161|    2173|1.1172707|
|       161|     821|1.0972763|
|       161|    4677|1.0927831|
|       161|     542|1.0840032|
|       161|    6254|1.0827938|
|       161|    1759|1.0721922|
|       161|    3700|1.0625639|
|       161|     700|1.0597908|
|       161|    4248| 1.054276|
|       161|    2677|1.0526923|
|       161|    1181|1.0446754|
|       161|    3882|1.0403907|
|       161|    2286| 1.032107|
|       161|    4523|1.0250592|
|       161|    3123|1.0216975|
|       161|    1099|1.0135539|
|       

In [147]:
# Get comics titles
comics = spark.read.json('support_data/comics.json')
comics.persist()
comics.show(10)

+--------+--------------------+
|comic_id|         comic_title|
+--------+--------------------+
|       1|0Secret Wars (Mar...|
|       2|100 Bullets Broth...|
|       3|100 Penny Press L...|
|       4|100 Penny Press S...|
|       5|100 Penny Press T...|
|       6|100 Penny Press T...|
|       7|100th Anniversary...|
|       8|12 Reasons To Die...|
|       9|    13 Coins (Other)|
|      10|13th Artifact One...|
+--------+--------------------+
only showing top 10 rows



In [148]:
# Get account to comics xwalk
acct_comics = spark.read.json('support_data/acct_comics.json')
# acct_comics = acct_comics.select('account_id')
acct_comics = (
                acct_comics.withColumnRenamed('account_id','acct_id')
                .withColumnRenamed('comic_id', 'cmc_id')
              )
acct_comics.persist()
acct_comics.show(10)

+-------+------+
|acct_id|cmc_id|
+-------+------+
|      2|   198|
|      2|   223|
|      2|   224|
|      2|   312|
|      2|   392|
|      2|   455|
|      2|   481|
|      2|   482|
|      2|   828|
|      2|   841|
+-------+------+
only showing top 10 rows



In [150]:
# shorten with alias
top_n = top_n_exploded.alias('topn')
com = comics.alias('com')
ac = acct_comics.alias('ac')

In [140]:
# Arleady bought
recs_prev_bought = (
                    top_n.join(ac, [top_n.account_id==ac.acct_id,
                                    top_n.comic_id==ac.cmc_id], 'left')
                    .filter('ac.acct_id is null') 
                    .select('account_id', 'comic_id')
)
recs_prev_bought.persist()
recs_prev_bought.show()

+----------+--------+
|account_id|comic_id|
+----------+--------+
|       161|     773|
|       161|     843|
|       161|     184|
|       161|    5429|
|       161|    5348|
|       161|    4441|
|       161|    5638|
|       161|    6105|
|       161|     821|
|       161|    4677|
|       161|    6254|
|       161|    1759|
|       161|     700|
+----------+--------+



In [178]:
# Clean up the spark df to list of titles
top_n_titles = (
                top_n.join(com.select('comic_id','comic_title')
                          ,top_n.comic_id==com.comic_id, "left")
                     .join(ac, [top_n.account_id==ac.acct_id,
                                    top_n.comic_id==ac.cmc_id], 'left')
                     .filter('ac.acct_id is null') 
                     .select('comic_title')
               )
top_n_titles.persist()
top_n_titles.show(top_n_req)


+--------------------+
|         comic_title|
+--------------------+
|Bitch Planet (Image)|
|Black Widow (Marvel)|
|All New Hawkeye (...|
|   Shipwreck (Other)|
|Sex Criminals (Im...|
|Neil Gaiman Ameri...|
|Spider-Gwen (Marvel)|
|Sweet Tooth (Vert...|
|Black Monday Murd...|
|Outcast By Kirkma...|
+--------------------+
only showing top 10 rows



In [179]:
# Cast to pandas df and return it
top_n_df = top_n_titles.select('*').toPandas()
top_n_df = top_n_df.head(top_n_req)
top_n_df.index += 1

In [180]:
top_n_df

Unnamed: 0,comic_title
1,Bitch Planet (Image)
2,Black Widow (Marvel)
3,All New Hawkeye (Marvel)
4,Shipwreck (Other)
5,Sex Criminals (Image)
6,Neil Gaiman American Gods Sha (Dark Horse)
7,Spider-Gwen (Marvel)
8,Sweet Tooth (Vertigo)
9,Black Monday Murders (Image)
10,Outcast By Kirkman & Azaceta (Image)


Ok this seems like it could work. Let's roll it into the new function.

In [181]:
def get_top_n_new_recs(spark, model, topn=10):
    """
    Given requested n and ALS model, returns top n recommended comics
    """
    
    # Multiplicative buffer
    # Get n x topn, because we will screen out previously bought
    buffer = 3
    
    # Get account number from user
    tgt_acct_id = input()

    # To 'save' the account number, will put it into a spark dataframe
    # Create spark df manually
    a_schema = StructType([StructField("account_id", LongType())])

    # Init lists
    tgt_list = []
    acct_list = []
    tgt_list.append(int(tgt_acct_id))
    acct_list.append(tgt_list)

    # Create one-row spark df
    tgt_accts = spark.createDataFrame(acct_list, schema=a_schema) 
    
    # Get recommendations for user
    userSubsetRecs = model.recommendForUserSubset(tgt_accts, (topn*buffer))
    userSubsetRecs.persist()

    # Flatten the recs list
    top_n_exploded = (userSubsetRecs.withColumn('tmp',explode('recommendations'))
            .select('account_id', col("tmp.comic_id"), col("tmp.rating")))
    top_n_exploded.persist()

    # Get comics titles
    comics = spark.read.json('raw_data/comics.json')
    comics.persist()
    
    # Get account-comics summary (already bought)
    acct_comics = spark.read.json('support_data/acct_comics.json')
    acct_comics = (
                    acct_comics.withColumnRenamed('account_id','acct_id')
                    .withColumnRenamed('comic_id', 'cmc_id')
                  )
    acct_comics.persist()

    # shorten with alias
    top_n = top_n_exploded.alias('topn')
    com = comics.alias('com')
    ac = acct_comics.alias('ac')
    

    # Clean up the spark df to list of titles, and only include these
    # that are NOT on bought list
    top_n_titles = (
                    top_n.join(com.select('comic_id','comic_title')
                              ,top_n.comic_id==com.comic_id, "left")
                         .join(ac, [top_n.account_id==ac.acct_id,
                                        top_n.comic_id==ac.cmc_id], 'left')
                         .filter('ac.acct_id is null') 
                         .select('comic_title')
                   )
    top_n_titles.persist()
    top_n_titles.show(top_n_req)

    # Cast to pandas df and return it
    top_n_df = top_n_titles.select('*').toPandas()
    top_n_df = top_n_df.head(top_n)
    top_n_df.index += 1
    
    return top_n_df

Let's test it!

**Version 1:**

In [10]:
top_n_req = 10

In [6]:
top_n_df = cr.get_top_n_recs_for_user(spark=spark, model=comic_rec_model
                                   , topn=top_n_req)
top_n_df

161


Unnamed: 0,comic_title
1,Criminal (Image)
2,Bitch Planet (Image)
3,Royal City (Image)
4,Black Widow (Marvel)
5,All New Hawkeye (Marvel)
6,Shipwreck (Other)
7,Sex Criminals (Image)
8,Neil Gaiman American Gods Sha (Dark Horse)
9,Spider-Gwen (Marvel)
10,Sweet Tooth (Vertigo)


**Version 2:**

In [8]:
top_n_df = cr.get_top_n_new_recs(spark=spark, model=comic_rec_model
                                 , topn=top_n_req)
top_n_df

161


Unnamed: 0,comic_title
1,Bitch Planet (Image)
2,Black Widow (Marvel)
3,All New Hawkeye (Marvel)
4,Shipwreck (Other)
5,Sex Criminals (Image)
6,Neil Gaiman American Gods Sha (Dark Horse)
7,Spider-Gwen (Marvel)
8,Sweet Tooth (Vertigo)
9,Black Monday Murders (Image)
10,Outcast By Kirkman & Azaceta (Image)


# YES

How about someone new?

In [11]:
newbie_df = cr.get_top_n_new_recs(spark=spark, model=comic_rec_model
                                 , topn=top_n_req)
newbie_df

9999


Unnamed: 0,comic_title


Nope. But that's what we expected.