## File 04 - Month to Month SQL Comparison
##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we look into comparing users and their characteristics over two months. The months we chose are January and February 2020. For the purposes of our analysis, month 1 is January and Month 2 is February. We are predicting spend in month 2 based on characteristics from month 1. 

### Set up Spark session and data schema

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

schema = "`event_time` TIMESTAMP,`event_type` STRING,`product_id` INT,`category_id` BIGINT,`category_code` STRING,`brand` STRING,`price` FLOAT,`user_id` INT,`user_session` STRING"
ddl_schema = T._parse_datatype_string(schema)

CPU times: user 168 ms, sys: 178 ms, total: 346 ms
Wall time: 4.24 s


### Read in dataframes for two months

In [2]:
%%time
df1 = spark.read.schema(ddl_schema).csv("/project/ds5559/group12/raw_data/2020-01.csv")
df2 = spark.read.schema(ddl_schema).csv("/project/ds5559/group12/raw_data/2020-02.csv")

CPU times: user 3.49 ms, sys: 99 µs, total: 3.59 ms
Wall time: 1.24 s


### Limit number of records in dataframes

We can limit each dataframe to a smaller subset. Notably, the dataframe is arranged by time, so this is how the subset will be biased.

In [3]:
# df1=df1.limit(10000)
df1.createOrReplaceTempView("r1")

# df2=df2.limit(10000)
df2.createOrReplaceTempView("r2")

### See how many users are the same

##### Full dataset

- Over all interactions from each month: 1,702,723 are the same users
- This is out of 4,385,986 users in month 1 and 4,233,207 users in month 2
- So about 38% to 40% of users are the same from month to month

In [4]:
%%time
spark.sql("SELECT DISTINCT r1.user_id FROM r1 INNER JOIN r2 on r1.user_id=r2.user_id").count()

CPU times: user 12.3 ms, sys: 15.6 ms, total: 28 ms
Wall time: 2min 40s


1702723

In [5]:
%%time
spark.sql("SELECT DISTINCT r1.user_id FROM r1").count()

CPU times: user 0 ns, sys: 27.8 ms, total: 27.8 ms
Wall time: 15.3 s


4385986

In [6]:
%%time
spark.sql("SELECT DISTINCT r2.user_id FROM r2").count()

CPU times: user 0 ns, sys: 33.8 ms, total: 33.8 ms
Wall time: 14.6 s


4233207

### See how many users made purchases in both months

##### Full dataset

- Over all interactions from each month: 93,209 are the same purchasers
- This is out of 359,105 purchasers in month 1 and 392,356 purchasers in month 2
- This is out of 4,385,986 users in month 1 and 4,233,207 users in month 2

- So about 24%-26% of purchasing users are the same from month to month
- And about 2% of any users purchase in both months


In [7]:
%%time
spark.sql("""SELECT DISTINCT r1.user_id FROM r1 INNER JOIN r2 on r1.user_id=r2.user_id WHERE r1.event_type="purchase" and r2.event_type="purchase" """).count()

CPU times: user 1.05 ms, sys: 33.4 ms, total: 34.5 ms
Wall time: 24.3 s


93209

In [8]:
%%time
spark.sql("""SELECT DISTINCT r1.user_id FROM r1 WHERE r1.event_type="purchase" """).count()

CPU times: user 537 µs, sys: 34.6 ms, total: 35.2 ms
Wall time: 11.8 s


359105

In [9]:
%%time
spark.sql("""SELECT DISTINCT r2.user_id FROM r2 WHERE r2.event_type="purchase" """).count()

CPU times: user 0 ns, sys: 34.3 ms, total: 34.3 ms
Wall time: 11.7 s


392356

### See some similar user behavior

Let's look at the similarity of products purchased between users in each month. Takes about 1m30s to run.

We can see that many products purchased in Month 10 are in the same category as products purchased in Month 11. Lots of nulls tend to clog up the dataset, however.

In [12]:
%%time
spark.sql("""SELECT uid, "1" AS month, category_code, event_type FROM (
             SELECT DISTINCT r1.user_id AS uid FROM r1 INNER JOIN r2 ON r1.user_id=r2.user_id WHERE r1.event_type="purchase" and r2.event_type="purchase"
              ) LEFT JOIN r1 ON uid=r1.user_id WHERE r1.event_type="purchase"
              
              UNION ALL
              
              SELECT uid, "2" AS month, category_code, event_type FROM (
              SELECT DISTINCT r1.user_id AS uid FROM r1 INNER JOIN r2 ON r1.user_id=r2.user_id WHERE r1.event_type="purchase" and r2.event_type="purchase"
              ) LEFT JOIN r2 ON uid=r2.user_id WHERE r2.event_type="purchase"
              
              ORDER BY uid, month ASC
              
              """).show(10,False)
# spark.sql("SELECT DISTINCT r1.user_id FROM r1 INNER JOIN r2 on r1.user_id=r2.user_id").count()

+---------+-----+-----------------------------+----------+
|uid      |month|category_code                |event_type|
+---------+-----+-----------------------------+----------+
|378879891|1    |construction.tools.light     |purchase  |
|378879891|1    |sport.bicycle                |purchase  |
|378879891|2    |furniture.kitchen.table      |purchase  |
|378879891|2    |apparel.scarf                |purchase  |
|378879891|2    |furniture.living_room.cabinet|purchase  |
|383787337|1    |construction.tools.light     |purchase  |
|383787337|2    |construction.tools.light     |purchase  |
|393237889|1    |construction.tools.light     |purchase  |
|393237889|2    |construction.tools.light     |purchase  |
|404851685|1    |electronics.video.tv         |purchase  |
+---------+-----+-----------------------------+----------+
only showing top 10 rows

CPU times: user 3.09 ms, sys: 30.1 ms, total: 33.2 ms
Wall time: 47.1 s


In [11]:
# %%time

# spark.sql("DROP TABLE IF EXISTS r_all")
# spark.sql("CREATE TABLE r_all LIKE r1").count()
# spark.sql("INSERT INTO r_all TABLE r1")
# spark.sql("INSERT INTO r_all TABLE r2")
# spark.sql("SELECT * FROM r_all").count()