## File 04 - Exploratory Data Analysis of Interaction-level Data

In this file, we open the CSV in which the interaction-level ecommerce data is stored and we run some Spark SQL queries on it.

### Set up Spark session and data schema

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

# In the case of needing to specify a schema (recommended behavior), we can use this code to help design it rapidly
#   by using the default Spark settings and then making our own modifications.

# To infer schema without using all data:
# df = spark.read.option("header","true") \
#         .option("samplingRatio",0.001) \
#         .csv("/project/ds5559/group12/raw_data/2019-10.csv")
# To print the basic schema as a DDL:
# print(df._jdf.schema().toDDL())

# We have completed the above step (this is our more accurate schema)
# Defining the schema is unnecessary now that we are reading in parquets. 
#schema = "`event_time` TIMESTAMP,`event_type` STRING,`product_id` INT,`category_id` BIGINT,`category_code` STRING,`brand` STRING,`price` FLOAT,`user_id` INT,`user_session` STRING"
#ddl_schema = T._parse_datatype_string(schema)

CPU times: user 215 ms, sys: 138 ms, total: 353 ms
Wall time: 5.36 s


### Read in dataframe

In [9]:
%%time
df = spark.read.parquet("./processed_data/month_01_filtered.parquet")

CPU times: user 1.09 ms, sys: 2.17 ms, total: 3.27 ms
Wall time: 500 ms


### Print number of records in dataframe

N.B.: this takes **>20 s** for perhaps one of the simplest possible operations. So we should be careful to take exploratory actions only on subsets of the data.

In [10]:
%%time
df.printSchema()
print(str(df.count())+" records")

root
 |-- user_id: integer (nullable = true)
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: float (nullable = true)
 |-- user_session: string (nullable = true)

2807167 records
CPU times: user 3.53 ms, sys: 0 ns, total: 3.53 ms
Wall time: 366 ms


### Limit number of records in dataframe

We can remove the category_id column, because it is a direct transformation of category_code (this could be turned into another table). It also helps make it easier to print the output here.

Next, we limit the dataframe to just a 5000-records subset. Notably, the dataframe is arranged by time, so this is how the subset will be biased.

In [11]:
df=df.drop("category_id")
df=df.limit(5000)
df.createOrReplaceTempView("records")

#### Print dataframe contents

In [12]:
%%time
spark.sql("SELECT * FROM records").show()

+---------+-------------------+----------+----------+--------------------+---------+------+--------------------+
|  user_id|         event_time|event_type|product_id|       category_code|    brand| price|        user_session|
+---------+-------------------+----------+----------+--------------------+---------+------+--------------------+
|416793411|2020-01-12 13:16:29|      view|   1004836|construction.tool...|  samsung|231.38|315c1383-b002-4c3...|
|465783976|2020-01-04 10:36:20|      view|  13901213|construction.comp...|   blanco|218.65|75f6bddc-41f8-497...|
|465783976|2020-01-04 10:37:14|      view|  13902800|   computers.desktop|   blanco|140.35|75f6bddc-41f8-497...|
|465783976|2020-01-04 10:38:07|      view|  13902800|   computers.desktop|   blanco|140.35|75f6bddc-41f8-497...|
|465783976|2020-01-04 10:40:08|      view|  13902647|   computers.desktop| omoikiri|185.13|75f6bddc-41f8-497...|
|465783976|2020-01-04 10:41:03|      view|  13902813|   computers.desktop|   daniel| 93.92|75f6b

#### Top prices among all event types

In [13]:
%%time
spark.sql("SELECT * FROM records ORDER BY price DESC").show(20)

+---------+-------------------+----------+----------+--------------------+---------+-------+--------------------+
|  user_id|         event_time|event_type|product_id|       category_code|    brand|  price|        user_session|
+---------+-------------------+----------+----------+--------------------+---------+-------+--------------------+
|513455983|2020-01-07 05:21:49|      view| 100004522|electronics.audio...|pinskdrev|2548.33|75a92e9b-1a04-4ce...|
|515683996|2020-01-24 07:11:38|      view| 100067576|electronics.audio...|    apple|2548.33|bc391c72-dded-4c7...|
|512802367|2020-01-02 19:39:33|      view| 100015658|construction.tool...|  samsung|2548.07|bbf03942-4c87-45c...|
|512802367|2020-01-02 19:40:44|      view|   1005284|construction.tool...|  samsung|2548.07|bbf03942-4c87-45c...|
|512802367|2020-01-02 19:41:09|      view|   1005284|construction.tool...|  samsung|2548.07|bbf03942-4c87-45c...|
|512802367|2020-01-02 19:42:32|      view|   1005284|construction.tool...|  samsung|2548

#### Who has bought how much

Note: Sum of prices purchased and number of purchases made are not directly correlated. This could lead to an interesting feature about multiple items being bought simultaneously vs. being bought one at a time over a long length of time.

In [14]:
%%time
spark.sql("SELECT COUNT(*) AS purchases_made, SUM(price), user_id FROM records WHERE event_type=='purchase' GROUP BY user_id ORDER BY COUNT(*) DESC").show(10,False)

+--------------+------------------+---------+
|purchases_made|sum(price)        |user_id  |
+--------------+------------------+---------+
|31            |7801.34001159668  |514500288|
|9             |2432.460029602051 |514362144|
|4             |698.0399780273438 |513631466|
|4             |3495.22998046875  |535172466|
|3             |1674.8999633789062|519283088|
|3             |1057.430019378662 |531584564|
|3             |1455.5000228881836|512505115|
|3             |396.15000915527344|523670929|
|3             |28.949999809265137|529323523|
|2             |235.67999267578125|516069879|
+--------------+------------------+---------+
only showing top 10 rows

CPU times: user 1.81 ms, sys: 817 µs, total: 2.63 ms
Wall time: 537 ms


#### Distribution of actions

In [15]:
%%time
spark.sql("SELECT COUNT(*), event_type FROM records GROUP BY event_type ORDER BY COUNT(*) DESC").show()

+--------+----------+
|count(1)|event_type|
+--------+----------+
|    4622|      view|
|     292|      cart|
|      86|  purchase|
+--------+----------+

CPU times: user 2.34 ms, sys: 439 µs, total: 2.78 ms
Wall time: 272 ms
