## File 03 - Basic Preprocessed Data Output

In this file, we create a small preprocessed data table.

### Set up Spark session and data schema

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [25]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

schema = "`event_time` TIMESTAMP,`event_type` STRING,`product_id` INT,`category_id` BIGINT,`category_code` STRING,`brand` STRING,`price` DECIMAL(7,2),`user_id` INT,`user_session` STRING"
ddl_schema = T._parse_datatype_string(schema)

CPU times: user 1.01 ms, sys: 948 µs, total: 1.96 ms
Wall time: 8.71 ms


### Read in test & train dataframes

In [107]:
%%time
'''
df1 = spark.read.option("header","true") \
        .schema(ddl_schema) \
        .csv("/project/ds5559/group12/raw_data/2020-01.csv")

df2 = spark.read.option("header","true") \
        .schema(ddl_schema) \
        .csv("/project/ds5559/group12/raw_data/2020-02.csv")
'''
train = spark.read.parquet

CPU times: user 2.99 ms, sys: 0 ns, total: 2.99 ms
Wall time: 58.7 ms


### Limit number of records in dataframes

We can limit each dataframe to a smaller subset. Notably, the dataframe is arranged by time, so this is how the subset will be biased.

In [108]:
# df1=df1.limit(100000)
df1.createOrReplaceTempView("m1")

# df2=df11.limit(100000)
df2.createOrReplaceTempView("m2")

### Create a transformed table containing elements of interest for our model

We propose a basic table format (see https://docs.google.com/document/d/1NG4KGticBXn0D3PL5_zMxLV2Pr7A8PQtLcasxCOd1nA/edit).

Every row is a user_id who exists in M1 and may or may not exist in M2.
Columns include:
- `user_id` (ID)
- `m2_total_spend` (sum among all purchase events, NB: month 2. Will be 0 is user does not exist in month 2 or makes no purchase events)
- `m1_total_spend` (sum among all purchase events)
- `m1_user_sessions` (count of distinct user sessions/browsing sessions)
- `m1_purchase_events` (count of distinct purchase events)

Additional rows may have been added below.


In [121]:
%%time
df = spark.sql("""SELECT

               /* ID */
               m1.user_id AS user_id,
               
               /* Output: Total spend in month 2; Will be 0 is user does not exist in month 2 or makes no purchase events */
               IFNULL(SUM(m2.price),0) AS m2_total_spend,
               
               
               /* Total spend in month 1: we sum the price of any 'purchase' events */
               SUM(CAST(m1.event_type=='purchase' AS INT) * m1.price) AS m1_total_spend,
               
               /* Total events in month 1: includes all event types (multiple per session) */
               COUNT(m1.event_type) AS m1_total_events,
               
               /* Total purchase events in month 1: we sum the occurence of any 'purchase' events */
               SUM(CAST(m1.event_type=='purchase' AS INT)) AS m1_purchase_events,
               
               /* Total user sessions in month 1: we count all distinct user sessions */
               COUNT(DISTINCT m1.user_session) AS m1_user_sessions
               
               FROM m1
               
            LEFT JOIN 
            (
               SELECT * FROM m2
               WHERE event_type='purchase'
            ) m2
            
            ON m1.user_id=m2.user_id
            
            GROUP BY m1.user_id ORDER BY m1_purchase_events DESC""")
df.show(50)

+---------+--------------+--------------+---------------+------------------+----------------+
|  user_id|m2_total_spend|m1_total_spend|m1_total_events|m1_purchase_events|m1_user_sessions|
+---------+--------------+--------------+---------------+------------------+----------------+
|568782581|  238858570.17|   58217352.48|         926856|            227808|             532|
|563599039|  238022530.48|   68869280.34|         442758|            133743|             108|
|549109608|  182246880.00|   57237229.77|         343000|             88151|             132|
|513230794|  205879787.63|   41713336.30|         412115|             85845|              33|
|582826305|  196941012.36|   17104022.28|         605784|             61404|             123|
|598224683|  157533358.56|   29562396.50|         270300|             58650|             156|
|572994775|   81753726.18|   15971794.65|         278478|             52731|             169|
|513320236|  115081686.24|   22369278.60|         217890|   

In [117]:
%%time
df.write.mode("overwrite").parquet("/project/ds5559/group12/processed_data/output1.parquet")

CPU times: user 18.6 ms, sys: 33.6 ms, total: 52.2 ms
Wall time: 3min 33s
