## File 02 - Feature Creation

In this file, we create new features from our interaction-level dataset, handle obvious errors/outliers, and perform PCA. 

### Set up Spark session and data schema

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
import matplotlib.pyplot as plt
from pyspark.sql.functions import col
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.sql.functions import *
import datetime as dt
from pyspark.sql.functions import translate

from pyspark.ml.feature import PCA as PCAml
from pyspark.ml.linalg import Vectors 
              
import copy
    
import sys
spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

#schema = "`event_time` TIMESTAMP,`event_type` STRING,`product_id` INT,`category_id` BIGINT,`category_code` STRING,`brand` STRING,`price` FLOAT,`user_id` INT,`user_session` STRING"
#ddl_schema = T._parse_datatype_string(schema)

CPU times: user 556 ms, sys: 334 ms, total: 890 ms
Wall time: 5.68 s


See https://docs.google.com/document/d/1NG4KGticBXn0D3PL5_zMxLV2Pr7A8PQtLcasxCOd1nA/edit for table of features.

### Read in data

In [2]:
%%time
full = spark.read.parquet("./processed_data/preprocessed_01.parquet")
m1 = spark.read.parquet("./processed_data/month_01_filtered.parquet") # This brings in the data we can create additional features from

CPU times: user 3.12 ms, sys: 1.11 ms, total: 4.23 ms
Wall time: 3.04 s


In [3]:
print(full.count())
full.show(5)

36228
+---------+------------------+------------------+------------+--------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|
+---------+------------------+------------------+------------+--------------+
|416898622|               0.0|1197.9500122070312|          14|             2|
|512431250|               0.0|  964.760009765625|          22|             7|
|512432602|               0.0| 1253.659984588623|          19|             2|
|512440297|               0.0| 65.86000061035156|          29|             9|
|512444835|12648.860717773438| 46.04999923706055|          98|            20|
+---------+------------------+------------------+------------+--------------+
only showing top 5 rows



In [4]:
print(m1.count())
m1.show(5)

1598123
+---------+-------------------+----------+----------+-------------------+-------------+-----+------+--------------------+
|  user_id|         event_time|event_type|product_id|        category_id|category_code|brand| price|        user_session|
+---------+-------------------+----------+----------+-------------------+-------------+-----+------+--------------------+
|512435762|2020-01-02 03:25:07|      view|   1500440|2232732071460078545|    kids.toys|epson|174.73|f044f5eb-15b6-458...|
|512435762|2020-01-02 03:25:41|      view|   1500440|2232732071460078545|    kids.toys|epson|174.73|f044f5eb-15b6-458...|
|512435762|2020-01-02 03:25:49|      view|   1500021|2232732071460078545|    kids.toys|epson|117.84|f044f5eb-15b6-458...|
|512435762|2020-01-02 03:26:15|      view|   1500021|2232732071460078545|    kids.toys|epson|117.84|f044f5eb-15b6-458...|
|512435762|2020-01-13 14:51:59|      view|  54900006|2232732128041238700|apparel.shirt| null| 51.48|e38bf463-f852-4b0...|
+---------+-----

## Begin Creating Features
### Create each on an individual level, then join to full
##### NOTE: Must rename all features so that they do not contain parenthesis - not compatible with saving to parquet

_________________

#### Average Session Duration (avg_session_length)

In [5]:
session_ends = m1.groupBy('user_id', 'user_session').agg(max('event_time'), min('event_time'))

In [6]:
session_ends.show(5)

+---------+--------------------+-------------------+-------------------+
|  user_id|        user_session|    max(event_time)|    min(event_time)|
+---------+--------------------+-------------------+-------------------+
|526912795|eca36803-522d-448...|2020-01-21 19:09:26|2020-01-21 19:06:49|
|539924996|37d03579-4eed-4cc...|2020-01-06 08:06:32|2020-01-06 08:04:14|
|541839658|4524b88b-41c5-4d1...|2020-01-15 14:59:12|2020-01-15 13:05:52|
|554474132|c77d39b2-b1fa-4cc...|2020-01-08 12:08:16|2020-01-08 12:08:16|
|561060528|d0b3c766-9c8a-4c4...|2020-01-09 06:15:54|2020-01-09 06:11:02|
+---------+--------------------+-------------------+-------------------+
only showing top 5 rows



In [7]:
session_ends = session_ends.withColumn('session_length', (col("max(event_time)").cast('long') - col("min(event_time)").cast('long')))

In [8]:
session_ends.orderBy(col("session_length").desc()).show(5)
# NOTE: Lots of these sessions are unreasonably long

+---------+--------------------+-------------------+-------------------+--------------+
|  user_id|        user_session|    max(event_time)|    min(event_time)|session_length|
+---------+--------------------+-------------------+-------------------+--------------+
|594413646|dc9d1ec0-6440-467...|2020-01-31 19:31:07|2020-01-01 02:30:17|       2653250|
|563182846|f5943245-7eb9-410...|2020-01-31 20:14:40|2020-01-01 05:04:00|       2646640|
|519038485|810387f0-d371-425...|2020-01-31 17:44:11|2020-01-01 06:12:07|       2633524|
|584249812|1940c594-f169-48c...|2020-01-31 15:09:53|2020-01-01 04:19:19|       2631034|
|584249812|c1735f57-dc30-405...|2020-01-31 15:09:53|2020-01-01 04:22:54|       2630819|
+---------+--------------------+-------------------+-------------------+--------------+
only showing top 5 rows



In [9]:
avg_sess = session_ends.groupBy('user_id').avg('session_length').withColumnRenamed('avg(session_length)', "avg_session_length")

In [10]:
avg_sess.show(5)

+---------+------------------+
|  user_id|avg_session_length|
+---------+------------------+
|598364094|  555.972972972973|
|518756087|             100.0|
|539305139|              84.2|
|555510639| 6160.772727272727|
|566350946|             185.5|
+---------+------------------+
only showing top 5 rows



In [11]:
full = full.join(avg_sess, full.user_id == avg_sess.user_id).drop(avg_sess.user_id)
print(full.count())
full.show(5)

36228
+---------+-----------------+------------------+------------+--------------+------------------+
|  user_id|    T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|
+---------+-----------------+------------------+------------+--------------+------------------+
|598364094|              0.0| 8519.010177612305|         418|            37|  555.972972972973|
|518756087|              0.0| 8.210000038146973|          29|             9|             100.0|
|539305139|              0.0|10653.269989013672|         111|            30|              84.2|
|555510639|68302.43673706055| 394.3600082397461|         972|            44| 6160.772727272727|
|566350946|              0.0|25.459999084472656|           9|             2|             185.5|
+---------+-----------------+------------------+------------+--------------+------------------+
only showing top 5 rows



#### Std Deviation of session duration by person (sd_session_length)

In [12]:
session_ends.show(5)

+---------+--------------------+-------------------+-------------------+--------------+
|  user_id|        user_session|    max(event_time)|    min(event_time)|session_length|
+---------+--------------------+-------------------+-------------------+--------------+
|526912795|eca36803-522d-448...|2020-01-21 19:09:26|2020-01-21 19:06:49|           157|
|539924996|37d03579-4eed-4cc...|2020-01-06 08:06:32|2020-01-06 08:04:14|           138|
|541839658|4524b88b-41c5-4d1...|2020-01-15 14:59:12|2020-01-15 13:05:52|          6800|
|554474132|c77d39b2-b1fa-4cc...|2020-01-08 12:08:16|2020-01-08 12:08:16|             0|
|561060528|d0b3c766-9c8a-4c4...|2020-01-09 06:15:54|2020-01-09 06:11:02|           292|
+---------+--------------------+-------------------+-------------------+--------------+
only showing top 5 rows



In [13]:
sd_session_length = session_ends.groupBy('user_id') \
                                 .agg(stddev('session_length')) \
                                 .withColumnRenamed("stddev_samp(session_length)", 'sd_session_length')

In [14]:
sd_session_length.show(5)

+---------+------------------+
|  user_id| sd_session_length|
+---------+------------------+
|598364094|1207.6052631019259|
|518756087|247.70193782043773|
|539305139| 88.46327895877435|
|555510639| 28985.71549547044|
|566350946|249.60869375885127|
+---------+------------------+
only showing top 5 rows



In [15]:
full = full.join(sd_session_length, full.user_id == sd_session_length.user_id).drop(sd_session_length.user_id)

#### Average number of interactions per session (avg_interactions_per_session)

In [16]:
interactions_per_session = m1.groupBy('user_id', 'user_session').agg(count('event_type'))

In [17]:
interactions_per_session.show(5)

+---------+--------------------+-----------------+
|  user_id|        user_session|count(event_type)|
+---------+--------------------+-----------------+
|526912795|eca36803-522d-448...|                2|
|539924996|37d03579-4eed-4cc...|                6|
|541839658|4524b88b-41c5-4d1...|               55|
|554474132|c77d39b2-b1fa-4cc...|                1|
|561060528|d0b3c766-9c8a-4c4...|                6|
+---------+--------------------+-----------------+
only showing top 5 rows



In [18]:
avg_interactions_per_session = interactions_per_session.groupBy('user_id').avg('count(event_type)')

In [19]:
avg_interactions_per_session = avg_interactions_per_session.withColumnRenamed('avg(count(event_type))', "avg_interactions_per_session")

In [20]:
full = full.join(avg_interactions_per_session, full.user_id == avg_interactions_per_session.user_id).drop(avg_interactions_per_session.user_id)
full.show(5)

+---------+-----------------+------------------+------------+--------------+------------------+------------------+----------------------------+
|  user_id|    T_total_spend|       total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|
+---------+-----------------+------------------+------------+--------------+------------------+------------------+----------------------------+
|598364094|              0.0| 8519.010177612305|         418|            37|  555.972972972973|1207.6052631019259|          11.297297297297296|
|518756087|              0.0| 8.210000038146973|          29|             9|             100.0|247.70193782043773|          3.2222222222222223|
|539305139|              0.0|10653.269989013672|         111|            30|              84.2| 88.46327895877435|                         3.7|
|555510639|68302.43673706055| 394.3600082397461|         972|            44| 6160.772727272727| 28985.71549547044|          11.045454545

#### Std Deviation of number of interactions per session per person (stddev_int_per_session)

In [21]:
std_interactions_per_session = interactions_per_session.groupBy('user_id') \
                                                       .agg(stddev('count(event_type)')) \
                                                       .withColumnRenamed("stddev_samp(count(event_type))", 'sd_interactions_per_session')
std_interactions_per_session.show(5)

+---------+---------------------------+
|  user_id|sd_interactions_per_session|
+---------+---------------------------+
|598364094|         26.770490454213935|
|518756087|          4.867693955503411|
|539305139|         1.9145540580979452|
|555510639|         12.279600299257046|
|566350946|         3.5355339059327378|
+---------+---------------------------+
only showing top 5 rows



In [22]:
full = full.join(std_interactions_per_session, full.user_id == std_interactions_per_session.user_id).drop(std_interactions_per_session.user_id)
full.show(5)

+---------+-----------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+
|  user_id|    T_total_spend|       total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|
+---------+-----------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+
|598364094|              0.0| 8519.010177612305|         418|            37|  555.972972972973|1207.6052631019259|          11.297297297297296|         26.770490454213935|
|518756087|              0.0| 8.210000038146973|          29|             9|             100.0|247.70193782043773|          3.2222222222222223|          4.867693955503411|
|539305139|              0.0|10653.269989013672|         111|            30|              84.2| 88.46327895877435|                         3

#### Max number of interactions within one session (max_interactions_one_session)

In [23]:
max_interactions_per_session = interactions_per_session.groupBy('user_id').max('count(event_type)')

In [24]:
max_interactions_per_session = max_interactions_per_session.withColumnRenamed('max(count(event_type))', "max_interactions_per_session")

In [25]:
max_interactions_per_session.show(1)

+---------+----------------------------+
|  user_id|max_interactions_per_session|
+---------+----------------------------+
|598364094|                         150|
+---------+----------------------------+
only showing top 1 row



In [26]:
full = full.join(max_interactions_per_session, full.user_id == max_interactions_per_session.user_id).drop(max_interactions_per_session.user_id)
full.show(5)

+---------+-----------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+
|  user_id|    T_total_spend|       total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|
+---------+-----------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+
|598364094|              0.0| 8519.010177612305|         418|            37|  555.972972972973|1207.6052631019259|          11.297297297297296|         26.770490454213935|                         150|
|518756087|              0.0| 8.210000038146973|          29|             9|             100.0|247.70193782043773|          3.2222222222222223|          4.867693955503411|                         

#### Percent of total events that are x (Purchase, Cart, View) ('purchase_pct_of_total_events', 'cart_pct_of_total_events', 'view_pct_of_total_events')

In [27]:
event_counts = m1.groupBy('user_id', 'user_session').pivot('event_type').agg(count('event_type'))
# Here the three types of event count are pivoted out for later tabulation

In [28]:
event_counts = event_counts.fillna(0) #replace nulls with 0 for math
event_counts.show(5)

+---------+--------------------+----+--------+----+
|  user_id|        user_session|cart|purchase|view|
+---------+--------------------+----+--------+----+
|605451879|35dce21a-47dd-45f...|   2|       1|   2|
|571736229|3d568215-c626-462...|   0|       0|   1|
|596589270|5dbba8f0-9904-498...|   3|       1|   7|
|606099979|83be3ab8-b47c-409...|   0|       0|   2|
|563099954|f718a878-5191-402...|   1|       0|   1|
+---------+--------------------+----+--------+----+
only showing top 5 rows



In [29]:
events_per_session = event_counts.withColumn('events_per_session_total', col('cart') + col('purchase') + col('view')) 
# Get total number of events per session

In [30]:
events_per_session.show(5)

+---------+--------------------+----+--------+----+------------------------+
|  user_id|        user_session|cart|purchase|view|events_per_session_total|
+---------+--------------------+----+--------+----+------------------------+
|605451879|35dce21a-47dd-45f...|   2|       1|   2|                       5|
|571736229|3d568215-c626-462...|   0|       0|   1|                       1|
|596589270|5dbba8f0-9904-498...|   3|       1|   7|                      11|
|606099979|83be3ab8-b47c-409...|   0|       0|   2|                       2|
|563099954|f718a878-5191-402...|   1|       0|   1|                       2|
+---------+--------------------+----+--------+----+------------------------+
only showing top 5 rows



In [31]:
pct_events = events_per_session.groupBy('user_id').sum()

In [32]:
pct_totalevents = pct_events.withColumn('purchase_pct_of_total_events', col('sum(purchase)')/col('sum(events_per_session_total)')) \
                  .withColumn('view_pct_of_total_events', col('sum(view)')/col('sum(events_per_session_total)')) \
                  .withColumn('cart_pct_of_total_events', col('sum(cart)')/col('sum(events_per_session_total)'))

In [33]:
merge_me = pct_totalevents.select('user_id', 'purchase_pct_of_total_events', 'view_pct_of_total_events', 'cart_pct_of_total_events')

In [34]:
full = full.join(merge_me, full.user_id == merge_me.user_id).drop(merge_me.user_id)
full.show(5)

+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|
+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+
|512479812|50881.730712890625|166.19000244140625|          73|             9| 533.1111111111111|649.9266197895814|            8.11111111111111|          7.37299

#### Average number of purchases per session (avg_purchases_per_session)

In [35]:
avg_purchases_per_session = events_per_session.groupBy('user_id').avg('purchase').withColumnRenamed('avg(purchase)', "avg_purchases_per_session")

In [36]:
avg_purchases_per_session.show(5)

+---------+-------------------------+
|  user_id|avg_purchases_per_session|
+---------+-------------------------+
|598364094|       0.2702702702702703|
|518756087|       0.1111111111111111|
|555510639|      0.09090909090909091|
|539305139|       0.5333333333333333|
|566350946|                      0.5|
+---------+-------------------------+
only showing top 5 rows



In [37]:
full = full.join(avg_purchases_per_session, full.user_id == avg_purchases_per_session.user_id).drop(avg_purchases_per_session.user_id)
full.show(5)

+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|
+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+
|512479812|50881.730712890625|166.19000244140625|          73|             9| 533.

#### STD of number of purchases per session per person (std_purchases_per_session)

In [38]:
std_purchases_per_session = events_per_session.groupBy('user_id') \
                                              .agg(stddev('purchase')) \
                                              .withColumnRenamed('stddev_samp(purchase)', "sd_purchases_per_session")
std_purchases_per_session.show(5)

+---------+------------------------+
|  user_id|sd_purchases_per_session|
+---------+------------------------+
|598364094|      0.6518626580230877|
|518756087|      0.3333333333333333|
|555510639|     0.29080336345115265|
|539305139|        0.68144538746106|
|566350946|      0.7071067811865476|
+---------+------------------------+
only showing top 5 rows



In [39]:
full = full.join(std_purchases_per_session, full.user_id == std_purchases_per_session.user_id).drop(std_purchases_per_session.user_id)
full.show(5)

+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|
+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+
|512479

#### Total number of each type of event over whole month (monthlyCartTotal, monthlyPurchaseTotal, monthlyViewTotal)

In [40]:
event_counts_month = event_counts.groupBy('user_id').sum('cart', 'purchase', 'view')\
                     .withColumnRenamed('sum(cart)', 'cart_events') \
                     .withColumnRenamed('sum(purchase)', 'purchase_events') \
                     .withColumnRenamed('sum(view)', 'view_events')

In [41]:
event_counts_month.show(5)

+---------+-----------+---------------+-----------+
|  user_id|cart_events|purchase_events|view_events|
+---------+-----------+---------------+-----------+
|598364094|         13|             10|        395|
|518756087|          3|              1|         25|
|555510639|         13|              4|        469|
|539305139|         20|             16|         75|
|566350946|          1|              1|          7|
+---------+-----------+---------------+-----------+
only showing top 5 rows



In [42]:
full = full.join(event_counts_month, full.user_id == event_counts_month.user_id).drop(event_counts_month.user_id)
full.show(5)

+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|
+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+----

#### Total number of sessions that contain event over whole month (NumSessWithPurchases, NumSessWithCart, NumSessWithView)

In [43]:
events_over_month = events_per_session.withColumn('purchase_events', when(col('purchase') == 0, 0).otherwise(1)) \
                                      .withColumn('cart_events', when(col('cart')==0, 0).otherwise(1)) \
                                      .withColumn('view_events', when(col('view')==0, 0).otherwise(1))

In [44]:
num_sesh_containing_event = events_over_month.groupBy('user_id').sum('purchase_events', "cart_events", "view_events") \
                            .withColumnRenamed("sum(purchase_events)", "sessions_with_purchase") \
                            .withColumnRenamed("sum(cart_events)", "sessions_with_cart") \
                            .withColumnRenamed("sum(view_events)", "sessions_with_view")

In [45]:
num_sesh_containing_event.show(5)

+---------+----------------------+------------------+------------------+
|  user_id|sessions_with_purchase|sessions_with_cart|sessions_with_view|
+---------+----------------------+------------------+------------------+
|598364094|                     7|                 7|                37|
|518756087|                     1|                 1|                 9|
|555510639|                     4|                 7|                44|
|539305139|                    14|                17|                30|
|566350946|                     1|                 1|                 2|
+---------+----------------------+------------------+------------------+
only showing top 5 rows



In [46]:
full = full.join(num_sesh_containing_event, full.user_id == num_sesh_containing_event.user_id).drop(num_sesh_containing_event.user_id)
full.show(5)

+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|
+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------

#### Percent of individual's sessions that end in cart/purchase (ses_end_purch, ses_end_cart)

In [47]:
session_ends2 = event_counts.withColumn('end_purchase', \
                                when(col('purchase') != 0, 1) \
                                .otherwise(0)) \
                            .withColumn('end_cart', \
                                when((col("purchase") == 0) & (col("cart") != 0), 1) \
                                .otherwise(0))
session_ends2.show(5)

+---------+--------------------+----+--------+----+------------+--------+
|  user_id|        user_session|cart|purchase|view|end_purchase|end_cart|
+---------+--------------------+----+--------+----+------------+--------+
|605451879|35dce21a-47dd-45f...|   2|       1|   2|           1|       0|
|571736229|3d568215-c626-462...|   0|       0|   1|           0|       0|
|596589270|5dbba8f0-9904-498...|   3|       1|   7|           1|       0|
|606099979|83be3ab8-b47c-409...|   0|       0|   2|           0|       0|
|563099954|f718a878-5191-402...|   1|       0|   1|           0|       1|
+---------+--------------------+----+--------+----+------------+--------+
only showing top 5 rows



In [48]:
session_sum = session_ends2.groupBy('user_id').agg(count('user_session'), sum('end_purchase'), sum('end_cart'))
session_sum.show(5)

+---------+-------------------+-----------------+-------------+
|  user_id|count(user_session)|sum(end_purchase)|sum(end_cart)|
+---------+-------------------+-----------------+-------------+
|598364094|                 37|                7|            0|
|518756087|                  9|                1|            0|
|555510639|                 44|                4|            3|
|539305139|                 30|               14|            3|
|566350946|                  2|                1|            0|
+---------+-------------------+-----------------+-------------+
only showing top 5 rows



In [49]:
session_sum = session_sum.withColumn('pct_sessions_end_purchase', col('sum(end_purchase)')/col('count(user_session)')) \
                         .withColumn('pct_sessions_end_cart', col('sum(end_cart)')/col('count(user_session)'))
session_sum.show(5)

+---------+-------------------+-----------------+-------------+-------------------------+---------------------+
|  user_id|count(user_session)|sum(end_purchase)|sum(end_cart)|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+-------------------+-----------------+-------------+-------------------------+---------------------+
|598364094|                 37|                7|            0|       0.1891891891891892|                  0.0|
|518756087|                  9|                1|            0|       0.1111111111111111|                  0.0|
|555510639|                 44|                4|            3|      0.09090909090909091|  0.06818181818181818|
|539305139|                 30|               14|            3|       0.4666666666666667|                  0.1|
|566350946|                  2|                1|            0|                      0.5|                  0.0|
+---------+-------------------+-----------------+-------------+-------------------------+---------------

In [50]:
temp = session_sum.select('user_id', "pct_sessions_end_purchase", "pct_sessions_end_cart")
temp.show(5)

+---------+-------------------------+---------------------+
|  user_id|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+-------------------------+---------------------+
|598364094|       0.1891891891891892|                  0.0|
|518756087|       0.1111111111111111|                  0.0|
|555510639|      0.09090909090909091|  0.06818181818181818|
|539305139|       0.4666666666666667|                  0.1|
|566350946|                      0.5|                  0.0|
+---------+-------------------------+---------------------+
only showing top 5 rows



In [51]:
full = full.join(temp, full.user_id == temp.user_id).drop(temp.user_id)
full.show(5)

+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+------------------+----------

### Preview full dataframe

In [52]:
full.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- T_total_spend: double (nullable = true)
 |-- total_spend: double (nullable = true)
 |-- total_events: long (nullable = true)
 |-- total_sessions: long (nullable = true)
 |-- avg_session_length: double (nullable = true)
 |-- sd_session_length: double (nullable = true)
 |-- avg_interactions_per_session: double (nullable = true)
 |-- sd_interactions_per_session: double (nullable = true)
 |-- max_interactions_per_session: long (nullable = true)
 |-- purchase_pct_of_total_events: double (nullable = true)
 |-- view_pct_of_total_events: double (nullable = true)
 |-- cart_pct_of_total_events: double (nullable = true)
 |-- avg_purchases_per_session: double (nullable = true)
 |-- sd_purchases_per_session: double (nullable = true)
 |-- cart_events: long (nullable = true)
 |-- purchase_events: long (nullable = true)
 |-- view_events: long (nullable = true)
 |-- sessions_with_purchase: long (nullable = true)
 |-- sessions_with_cart: long (nullable =

In [53]:
full.show(1)

+---------+------------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+------------------+----------

In [54]:
full.count()

36228

#### Coerce NAs in standard deviation columnns to 0 (If there's only one, after all, the standard deviation IS 0!)

In [55]:
full = full.fillna(0, subset=['sd_session_length', 'sd_interactions_per_session', 'sd_purchases_per_session'])

### Remove errors/outliers

In [56]:
# Look at abnormally large number of sessions
full.select('total_sessions').sort(desc('total_sessions')).show(30)

+--------------+
|total_sessions|
+--------------+
|           532|
|           514|
|           317|
|           310|
|           305|
|           239|
|           229|
|           226|
|           219|
|           199|
|           172|
|           149|
|           142|
|           139|
|           137|
|           137|
|           136|
|           135|
|           135|
|           133|
|           132|
|           127|
|           127|
|           124|
|           123|
|           123|
|           122|
|           121|
|           120|
|           116|
+--------------+
only showing top 30 rows



In [57]:
# More than 10 sessions a day for a month (300) seems likely to be an error. Removing those rows. 
full = full.filter(col('total_sessions') <= 300)
full.select('total_sessions').sort(desc('total_sessions')).show(5)
full.count()

+--------------+
|total_sessions|
+--------------+
|           239|
|           229|
|           226|
|           219|
|           199|
+--------------+
only showing top 5 rows



36223

In [58]:
# Look at abnormally long sessions
# Average session length greater than 8 hours,or 28800 seconds, is almost certainly an error or a bot. 
full.select('avg_session_length').sort(desc('avg_session_length')).show(10)

+------------------+
|avg_session_length|
+------------------+
|         1907547.0|
|1634483.3333333333|
|1576443.6666666667|
|        1388163.75|
|1380737.3333333333|
|         1363813.0|
|        1281766.75|
|        1209332.75|
|         1125089.0|
|         1102053.5|
+------------------+
only showing top 10 rows



In [59]:
# Remove individuals with average session length greater than 8 hours (28800 seconds)
full = full.filter(col('avg_session_length') <= 28800)
full.select('avg_session_length').sort(desc('avg_session_length')).show(5)
full.count()

+------------------+
|avg_session_length|
+------------------+
|28758.166666666668|
|28700.155172413793|
|28679.555555555555|
|          28589.75|
| 28549.22033898305|
+------------------+
only showing top 5 rows



35029

In [60]:
# Look at outliers by total spend. 
full.select('total_spend', 'purchase_events', 'T_total_spend').sort(desc('total_spend')).show(50)
full.select('total_spend').summary().show()

+------------------+---------------+--------------------+
|       total_spend|purchase_events|       T_total_spend|
+------------------+---------------+--------------------+
| 5659847.821716309|             91|2.9127026746307373E7|
|4188340.5565185547|             39| 2.177730318988037E7|
| 4129082.121032715|            105| 2.901806042225647E7|
|2105782.1861572266|             60|   6999343.371139526|
|2076557.8867492676|             41|   8400942.108535767|
|1998014.7842254639|            101|     8270191.4427948|
|1979694.9905753136|            153| 2.891635742661667E7|
|1746181.7509765625|             69|   7444496.798171997|
|1630467.5068426132|             51|   7747082.560165405|
|1517930.4067382812|             26|   6770653.144775391|
|1516287.3457641602|             40|   5667982.841827393|
| 1239730.245552063|             76|  6686890.5963897705|
|1150882.8065872192|             55|   8165252.666542053|
|1105710.3974609375|             49|  6337675.5344696045|
| 1091065.0909

In [61]:
# Remove those that spend > 100k a month... they probably aren't our 'usual' customer.  
full = full.filter(col('total_spend') <= 100000)
full.select('total_spend').sort(desc('total_spend')).show(5)
full.count()

+-----------------+
|      total_spend|
+-----------------+
|99683.20916748047|
|99633.82038879395|
|98847.98069000244|
|98261.04052734375|
|97949.04107666016|
+-----------------+
only showing top 5 rows



34859

In [62]:
# Look at abnormally large numbers of events
# Greater than 3000 events is almost certainly a mistake. That's more than 100 events every day of the month
full.select('total_events').sort(desc('total_events')).show(50)
full.select('total_events').summary().show()

+------------+
|total_events|
+------------+
|       25095|
|       19958|
|       19770|
|       15515|
|       13680|
|       13552|
|       12870|
|       11250|
|       11125|
|       11000|
|       10989|
|       10803|
|       10200|
|        9724|
|        9720|
|        9450|
|        8825|
|        8484|
|        8437|
|        8154|
|        7994|
|        7975|
|        7672|
|        7545|
|        7220|
|        7140|
|        6399|
|        6372|
|        6356|
|        6355|
|        6330|
|        6190|
|        6036|
|        5577|
|        5460|
|        5382|
|        5348|
|        5340|
|        5300|
|        5268|
|        5130|
|        5100|
|        4964|
|        4914|
|        4784|
|        4725|
|        4653|
|        4640|
|        4620|
|        4564|
+------------+
only showing top 50 rows

+-------+-----------------+
|summary|     total_events|
+-------+-----------------+
|  count|            34859|
|   mean|91.43084999569695|
| stddev|426.20159393263

In [63]:
# Remove extreme> 3000 events a month
full = full.filter(col('total_events') <= 3000)
full.select('total_events').sort(desc('total_events')).show(5)
full.count()

+------------+
|total_events|
+------------+
|        2928|
|        2928|
|        2928|
|        2916|
|        2900|
+------------+
only showing top 5 rows



34769

#### Save as parquet. (If saving in project group12 folder - Make sure to change permissions in bash using chmod 777 filename)

In [64]:
%%time
full.write.mode("overwrite").parquet("./processed_data/engineered_features.parquet")

CPU times: user 3.34 ms, sys: 2.17 ms, total: 5.52 ms
Wall time: 24.9 s


In [65]:
%%time
train, test = full.randomSplit([.8, .2], seed=42)

CPU times: user 1.96 ms, sys: 677 µs, total: 2.64 ms
Wall time: 18.6 ms


#### Purchased items in month 1, converted to PCA (pca_purchases)

Note: Unlike all of the other preprocessing, we need to train the PCA model on the training set, then implement it on the test set. For this reason it comes after the train/test split.

In [66]:
%%time

# Create a function that prepares a dataset for PCA.

def pca_prepare_on_subset(subset_df, limited_columns=[]):
    # Only get this data from the training (or test) set
    m1_subset = m1.join(subset_df,'user_id','leftsemi')

    # Remove the periods from the dataframe category_code and replace with dashes. PySpark does not do well with periods in column
    #  names, for some reason
    m1_stripped = m1.withColumn('category_code_s', translate('category_code', '.', '-'))

    # Pivot so that each category of purchase becomes a colummn
    cats = m1_stripped.filter(m1.event_type == "purchase").groupBy('user_id').pivot('category_code_s').count().na.fill(0)

    pca_input_cols = [cols for cols in cats.columns if cols!='user_id' and cols!='null']
        
    # Make a new copy of columns (this is from the training set to the test set, in order to filter out other columns)
    if(limited_columns==[]):
        limited_columns = copy.deepcopy(pca_input_cols)
        limited_columns.append('user_id')
    else:
        cats = cats.select(*limited_columns) # This is for the test set 
        # print(cats.schema)

    # Transform columns into a sparse vector (prepare for PCA)
    assembler = VectorAssembler(
        inputCols=pca_input_cols,
        outputCol="to_pca_columns")
    
    # Create sparse vector
    pca_df = assembler.transform(cats)
    return limited_columns, pca_df
    

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 6.91 µs


In [67]:
# Get columns from training set and get training df
limited_columns, train_pre_pca = pca_prepare_on_subset(train)
# Limit to these columns on the test set and get test df
_, test_pre_pca = pca_prepare_on_subset(test, limited_columns=limited_columns)

# Visualize what this looks like
train_pre_pca.select(["user_id","to_pca_columns"]).show(2, truncate=False)

# Create new PCA instance
pca = PCAml(k=30, inputCol="to_pca_columns", outputCol="pca_purchases")
# Fit on training data
model = pca.fit(train_pre_pca)

# Transform training and test sets
train_with_pca = model.transform(train_pre_pca)
test_with_pca = model.transform(test_pre_pca)


+---------+--------------+
|user_id  |to_pca_columns|
+---------+--------------+
|513240274|(132,[],[])   |
|513343186|(132,[],[])   |
+---------+--------------+
only showing top 2 rows



In [68]:
# Merge PCA df back into full training set
join_train_df = train_with_pca.select(["user_id","pca_purchases"])
train = train.join(join_train_df, train.user_id == join_train_df.user_id).drop(join_train_df.user_id)

# Merge PCA df back into full test set
join_test_df = test_with_pca.select(["user_id","pca_purchases"])
test = test.join(join_test_df, test.user_id == join_test_df.user_id).drop(join_test_df.user_id)

In [69]:
train.show(5, truncate=False)
test.show(5, truncate=False)

+---------+------------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### Write train and test

In [70]:
%%time
train.write.mode("overwrite").parquet("./processed_data/train.parquet")
test.write.mode("overwrite").parquet("./processed_data/test.parquet")

CPU times: user 7.25 ms, sys: 3.61 ms, total: 10.9 ms
Wall time: 58.3 s
