## File 01 - Preprocessed Data Output (Step 1)
##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create the scaffolding of the preprocessed user-level data from event-level data, with a few features. The event-level data is given with 1 record equal to one interaction, while the user-level data is given with 1 record equal to one user. Each row in the output data represents one user who exists in the first month.

Outputs:
- The preprocessed data file is output to `processed_data/preprocessed_01.parquet`.
- Additionally, the raw data is filtered on the preprocessed data user ids: `processed_data/month_01_filtered.parquet`, 

### Set up Spark session and data schema

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql.functions import col

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

schema = "`event_time` TIMESTAMP,`event_type` STRING,`product_id` INT,`category_id` BIGINT,`category_code` STRING,`brand` STRING,`price` FLOAT,`user_id` INT,`user_session` STRING"
#ddl_schema = T._parse_datatype_string(schema)

CPU times: user 188 ms, sys: 157 ms, total: 345 ms
Wall time: 4.72 s


### Read in dataframes

In [2]:
%%time
# Be sure that you have downloaded the data from either
# - https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store
# - https://drive.google.com/drive/folders/1Nan8X33H8xrXS5XhCKZmSpClFTCJsSpE
# Then we are using January and Febrary 2020.

df1 = spark.read.schema(schema).csv("/project/ds5559/group12/raw_data/2020-01.csv")
df2 = spark.read.schema(schema).csv("/project/ds5559/group12/raw_data/2020-02.csv")

CPU times: user 3.1 ms, sys: 1.36 ms, total: 4.46 ms
Wall time: 1.66 s


### Create temp views

In [3]:
# For Spark SQL
df1.createOrReplaceTempView("m1")
df2.createOrReplaceTempView("m2")

### Create a transformed table containing elements of interest for our model

We propose a basic table format (see https://docs.google.com/document/d/1NG4KGticBXn0D3PL5_zMxLV2Pr7A8PQtLcasxCOd1nA/edit).

Every row is a user_id who exists in M1 and may or may not exist in M2.
Columns include:
- `user_id` (ID)
- `total_spend` (sum among all purchase events in month 1)
- `total_events` (count of distinct user actions during month 1)
- `user_sessions` (count of distinct user sessions/browsing sessions during month 1)





In [4]:
%%time
df = spark.sql("""SELECT

               /* ID */
               m1.user_id AS user_id,
               
               /* Total spend in month 1: we sum the price of any 'purchase' events */
               SUM(CAST(m1.event_type=='purchase' AS INT) * m1.price) AS total_spend,
               
               /* Total events in month 1: includes all event types (multiple per session) */
               COUNT(m1.event_type) AS total_events,
               
               /* Total user sessions in month 1: we count all distinct user sessions */
               COUNT(DISTINCT m1.user_session) AS total_sessions
               
               FROM m1
               
            /* Note: This is a left join, so purchasers in month 2 must be in month 1 to be included in the output */
            LEFT JOIN
            (
               SELECT
                 user_id,
                 SUM(m2.price) AS price FROM m2
                 
               WHERE event_type='purchase'
               GROUP BY m2.user_id
            ) m2
           
            ON m1.user_id=m2.user_id
           
            /* Prevent adding bad data where user_id is null */
            WHERE ISNULL(m1.user_id)<>1

            GROUP BY m1.user_id ORDER BY total_events DESC
           
            """)
df.show(5)
df.count()

+---------+-----------------+------------+--------------+
|  user_id|      total_spend|total_events|total_sessions|
+---------+-----------------+------------+--------------+
|597644399|              0.0|       41280|         40188|
|569335945|              0.0|       23058|         23057|
|594718064|              0.0|       16347|         15890|
|597514055|              0.0|       13717|         12952|
|568804062|172.9600067138672|       11479|          9149|
+---------+-----------------+------------+--------------+
only showing top 5 rows

CPU times: user 7.55 ms, sys: 70.5 ms, total: 78 ms
Wall time: 1min 24s


4385985

#### Remove customers who did not make a purchase in month 1.
We do this because it is unlikely that we will be able to actionably influence customers who don't make a purchase in month 1.

In [5]:
df = df.filter(col('total_spend') > 0)

In [6]:
df.count()

359105

#### Create response variable (individual level) of total spend in month 2 (T_total_spend), join with month 1 data, and fill nulls with 0

In [7]:
total_spend_response = df2.filter(col('event_type') == "purchase").groupBy(col('user_id')).sum('price').withColumnRenamed('sum(price)', "T_total_spend")

In [8]:
%%time
df = df.join(total_spend_response, df.user_id == total_spend_response.user_id, 'leftouter').drop(total_spend_response.user_id)

#df.show()
#df.count()

CPU times: user 567 µs, sys: 1.83 ms, total: 2.4 ms
Wall time: 42.4 ms


In [9]:
df = df.fillna(0.0, "T_total_spend")
#df.show()

In [10]:
df.write.mode("overwrite").parquet("./processed_data/preprocessed_01.parquet")
#print(df.count())
#df.show(5)

#### Write to CSV file

In [11]:
%%time
# Not necessary at this time, but this CSV can be written if desired
# kept_df.coalesce(1).write.option("header", "true").csv("./processed_data/temp_preprocessed_01.csv")

CPU times: user 1e+03 ns, sys: 2 µs, total: 3 µs
Wall time: 6.44 µs


#### Write the raw data, filtered on the appropriate user-ids, to files.

In [12]:
%%time
# month_01_filtered = df1.join(kept_df,'user_id','leftsemi')
month_01_filtered = df1.join(df,'user_id','leftsemi')
#print(month_01_filtered.count())
#month_01_filtered.show(5)

CPU times: user 0 ns, sys: 904 µs, total: 904 µs
Wall time: 12.7 ms


In [13]:
month_01_filtered.count()

15923973

In [14]:
%%time
month_01_filtered.write.mode("overwrite").parquet("./processed_data/month_01_filtered.parquet")

CPU times: user 10.3 ms, sys: 45.5 ms, total: 55.7 ms
Wall time: 3min 9s
