#### NOTE: Should we depreciate? I don't think it's relevant anymore with the prefiltered data

## File 02 - Month to Month SQL Comparison

In this file, we look into comparing users and their characteristics over two months.
NOTE: This is currently October 2019 and November 2019. In case we want to use this for anything in the future, we should change the months to unify with the months we are actually using. 

### Set up Spark session and data schema

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

schema = "`event_time` TIMESTAMP,`event_type` STRING,`product_id` INT,`category_id` BIGINT,`category_code` STRING,`brand` STRING,`price` FLOAT,`user_id` INT,`user_session` STRING"
ddl_schema = T._parse_datatype_string(schema)

CPU times: user 216 ms, sys: 145 ms, total: 361 ms
Wall time: 5.22 s


### Read in dataframes for two months

In [2]:
%%time
df10 = spark.read.option("header","true") \
        .schema(ddl_schema) \
        .csv("./processed_data/month_01_filtered.parquet")

df11 = spark.read.option("header","true") \
        .schema(ddl_schema) \
        .csv("./processed_data/month_02_filtered.parquet")

CPU times: user 4.43 ms, sys: 1.56 ms, total: 5.99 ms
Wall time: 1.75 s


### Limit number of records in dataframes

We can limit each dataframe to a smaller subset. Notably, the dataframe is arranged by time, so this is how the subset will be biased.

In [3]:
# df10=df10.limit(10000)
df10.createOrReplaceTempView("r10")

# df11=df11.limit(10000)
df11.createOrReplaceTempView("r11")

### See how many users are the same

##### Small sample

- Over the first 100,000 records from each month: 340 are the same users
- Over the first 1,000,000 records from each month: 11,891 are the same users

##### Full dataset

- Over all records from each month: 1,401,758 are the same users
- This is out of 3,022,290 users in October and 3,696,117 users in November
- So about 2/5 to 1/2 of users are the same from month to month

In [4]:
%%time
spark.sql("SELECT DISTINCT r10.user_id FROM r10 INNER JOIN r11 on r10.user_id=r11.user_id").count()

CPU times: user 2.06 ms, sys: 1.16 ms, total: 3.22 ms
Wall time: 4.47 s


0

In [5]:
%%time
spark.sql("SELECT DISTINCT r10.user_id FROM r10").count()

CPU times: user 1.21 ms, sys: 1.21 ms, total: 2.43 ms
Wall time: 1.29 s


1

In [6]:
%%time
spark.sql("SELECT DISTINCT r11.user_id FROM r11").count()

CPU times: user 1.3 ms, sys: 1.3 ms, total: 2.59 ms
Wall time: 1.13 s


1

### See how many users made purchases in both months

##### Full dataset

- Over all records from each month: 91,286 are the same purchasers
- This is out of 347,118 purchasers in October and 441,638 purchasers in November
- This is out of 3,022,290 users in October and 3,696,117 users in November
- So about 20%-24% of purchasers are the same from month to month
- And about 2%-3% of purchasing users are the same from month to month

This means that given a set of purchasing and non-purchasing users, we want to predict:
- (a) which purchasers in October do and do not go on to purchase again and
- (b) which non-purchasers in October do and do not go on to purchase

In [7]:
%%time
spark.sql("""SELECT DISTINCT r10.user_id FROM r10 INNER JOIN r11 on r10.user_id=r11.user_id WHERE r10.event_type="purchase" and r11.event_type="purchase" """).count()

CPU times: user 2.04 ms, sys: 1.03 ms, total: 3.07 ms
Wall time: 2.08 s


0

In [8]:
%%time
spark.sql("""SELECT DISTINCT r10.user_id FROM r10 WHERE r10.event_type="purchase" """).count()

CPU times: user 2.79 ms, sys: 0 ns, total: 2.79 ms
Wall time: 958 ms


0

In [9]:
%%time
spark.sql("""SELECT DISTINCT r11.user_id FROM r11 WHERE r11.event_type="purchase" """).count()

CPU times: user 1.95 ms, sys: 381 µs, total: 2.33 ms
Wall time: 683 ms


0

### See some similar user behavior

Let's look at the similarity of products purchased between users in each month. Takes about 1m30s to run.

We can see that many products purchased in Month 10 are in the same category as products purchased in Month 11. Lots of nulls tend to clog up the dataset, however.

In [10]:
%%time
spark.sql("""SELECT uid, "10" AS month, category_code, event_type FROM (
             SELECT DISTINCT r10.user_id AS uid FROM r10 INNER JOIN r11 ON r10.user_id=r11.user_id WHERE r10.event_type="purchase" and r11.event_type="purchase"
              ) LEFT JOIN r10 ON uid=r10.user_id WHERE r10.event_type="purchase"
              
              UNION ALL
              
              SELECT uid, "11" AS month, category_code, event_type FROM (
              SELECT DISTINCT r10.user_id AS uid FROM r10 INNER JOIN r11 ON r10.user_id=r11.user_id WHERE r10.event_type="purchase" and r11.event_type="purchase"
              ) LEFT JOIN r11 ON uid=r11.user_id WHERE r11.event_type="purchase"
              
              ORDER BY uid, month ASC
              
              """).show(1000,False)
# spark.sql("SELECT DISTINCT r10.user_id FROM r10 INNER JOIN r11 on r10.user_id=r11.user_id").count()

+---+-----+-------------+----------+
|uid|month|category_code|event_type|
+---+-----+-------------+----------+
+---+-----+-------------+----------+

CPU times: user 142 µs, sys: 3.2 ms, total: 3.35 ms
Wall time: 3.45 s


In [54]:
# %%time

# spark.sql("DROP TABLE IF EXISTS r_all")
# spark.sql("CREATE TABLE r_all LIKE r10").count()
# spark.sql("INSERT INTO r_all TABLE r10")
# spark.sql("INSERT INTO r_all TABLE r11")
# spark.sql("SELECT * FROM r_all").count()

CPU times: user 1.51 ms, sys: 1.61 ms, total: 3.12 ms
Wall time: 3.83 s


20000