## File 02 - Feature Creation

In this file, we create new features from our interaction-level dataset, handle obvious errors/outliers, and perform PCA. 

### Set up Spark session and data schema

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [172]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
import matplotlib.pyplot as plt
from pyspark.sql.functions import col
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.sql.functions import *
import datetime as dt
from pyspark.sql.functions import translate

from pyspark.ml.feature import PCA as PCAml
from pyspark.ml.linalg import Vectors 
              
import copy
    
import sys
spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

#schema = "`event_time` TIMESTAMP,`event_type` STRING,`product_id` INT,`category_id` BIGINT,`category_code` STRING,`brand` STRING,`price` FLOAT,`user_id` INT,`user_session` STRING"
#ddl_schema = T._parse_datatype_string(schema)

CPU times: user 831 µs, sys: 485 µs, total: 1.32 ms
Wall time: 942 µs


See https://docs.google.com/document/d/1NG4KGticBXn0D3PL5_zMxLV2Pr7A8PQtLcasxCOd1nA/edit for table of features.

### Read in data

In [173]:
%%time
full = spark.read.parquet("./processed_data/preprocessed_01.parquet")
m1 = spark.read.parquet("./processed_data/month_01_filtered.parquet") # This brings in the data we can create additional features from

CPU times: user 3.7 ms, sys: 168 µs, total: 3.87 ms
Wall time: 519 ms


In [174]:
print(full.count())
full.show(5)

219080
+---------+--------------+------------------+------------+--------------+
|  user_id| T_total_spend|       total_spend|total_events|total_sessions|
+---------+--------------+------------------+------------+--------------+
|413580824|           0.0|               0.0|           3|             2|
|428963270|           0.0|               0.0|          68|             8|
|501050566|           0.0|               0.0|           2|             1|
|512378217|           0.0|               0.0|          25|             6|
|512394054|39439.12109375|236.82000732421875|          56|            16|
+---------+--------------+------------------+------------+--------------+
only showing top 5 rows



In [175]:
print(m1.count())
m1.show(5)

2807167
+---------+-------------------+----------+----------+-------------------+--------------------+--------+------+--------------------+
|  user_id|         event_time|event_type|product_id|        category_id|       category_code|   brand| price|        user_session|
+---------+-------------------+----------+----------+-------------------+--------------------+--------+------+--------------------+
|416793411|2020-01-12 13:16:29|      view|   1004836|2232732093077520756|construction.tool...| samsung|231.38|315c1383-b002-4c3...|
|465783976|2020-01-04 10:36:20|      view|  13901213|2053013557343158789|construction.comp...|  blanco|218.65|75f6bddc-41f8-497...|
|465783976|2020-01-04 10:37:14|      view|  13902800|2053013561092866779|   computers.desktop|  blanco|140.35|75f6bddc-41f8-497...|
|465783976|2020-01-04 10:38:07|      view|  13902800|2053013561092866779|   computers.desktop|  blanco|140.35|75f6bddc-41f8-497...|
|465783976|2020-01-04 10:40:08|      view|  13902647|205301356109286

## Begin Creating Features
### Create each on an individual level, then join to full
##### NOTE: Must rename all features so that they do not contain parenthesis - not compatible with saving to parquet

_________________

#### Average Session Duration (avg_session_length)

In [176]:
session_ends = m1.groupBy('user_id', 'user_session').agg(max('event_time'), min('event_time'))

In [177]:
session_ends.show(5)

+---------+--------------------+-------------------+-------------------+
|  user_id|        user_session|    max(event_time)|    min(event_time)|
+---------+--------------------+-------------------+-------------------+
|514283509|ba808472-24ca-4a5...|2020-01-19 12:30:36|2020-01-19 12:30:36|
|524368563|c00a9199-98cb-412...|2020-01-08 11:43:08|2020-01-08 11:41:06|
|554064616|04d50e0d-b80c-44e...|2020-01-08 10:02:28|2020-01-08 10:02:28|
|554426664|a6aebb81-e68d-4fc...|2020-01-20 11:57:27|2020-01-20 11:48:18|
|560605592|535c77ff-60a7-432...|2020-01-21 13:03:42|2020-01-21 12:53:09|
+---------+--------------------+-------------------+-------------------+
only showing top 5 rows



In [178]:
session_ends = session_ends.withColumn('session_length', (col("max(event_time)").cast('long') - col("min(event_time)").cast('long')))

In [179]:
session_ends.orderBy(col("session_length").desc()).show(5)
# NOTE: Lots of these sessions are unreasonably long

+---------+--------------------+-------------------+-------------------+--------------+
|  user_id|        user_session|    max(event_time)|    min(event_time)|session_length|
+---------+--------------------+-------------------+-------------------+--------------+
|550527121|9124b2c1-02e4-4cc...|2020-01-31 22:39:05|2020-01-01 07:10:21|       2647724|
|593313269|bcaf86f2-1c1d-420...|2020-01-31 16:40:45|2020-01-01 05:36:13|       2631872|
|516733273|2b0fc08b-bd1d-439...|2020-01-31 14:24:46|2020-01-01 07:09:16|       2618130|
|566985224|9d0368d8-c6ac-42d...|2020-01-31 14:45:42|2020-01-01 07:41:22|       2617460|
|542394994|1b9f919f-f044-4b5...|2020-01-31 14:14:10|2020-01-01 08:02:12|       2614318|
+---------+--------------------+-------------------+-------------------+--------------+
only showing top 5 rows



In [180]:
avg_sess = session_ends.groupBy('user_id').avg('session_length').withColumnRenamed('avg(session_length)', "avg_session_length")

In [181]:
avg_sess.show(5)

+---------+------------------+
|  user_id|avg_session_length|
+---------+------------------+
|512700240| 336.2857142857143|
|514013554|         44762.375|
|514338009| 161.1595744680851|
|514747635|               0.0|
|516092497|108.26666666666667|
+---------+------------------+
only showing top 5 rows



In [182]:
full = full.join(avg_sess, full.user_id == avg_sess.user_id).drop(avg_sess.user_id)
print(full.count())
full.show(5)

219080
+---------+------------------+------------------+------------+--------------+------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|
+---------+------------------+------------------+------------+--------------+------------------+
|512700240|14504.489959716797|1436.1799926757812|          66|             7| 336.2857142857143|
|514013554| 7735.519714355469|               0.0|          52|             8|         44762.375|
|514338009| 2786062.823135376| 269863.6780014038|       23472|            94| 161.1595744680851|
|514747635|               0.0|               0.0|           1|             1|               0.0|
|516092497|               0.0|               0.0|          38|            15|108.26666666666667|
+---------+------------------+------------------+------------+--------------+------------------+
only showing top 5 rows



#### Std Deviation of session duration by person (sd_session_length)

In [183]:
session_ends.show(5)

+---------+--------------------+-------------------+-------------------+--------------+
|  user_id|        user_session|    max(event_time)|    min(event_time)|session_length|
+---------+--------------------+-------------------+-------------------+--------------+
|514283509|ba808472-24ca-4a5...|2020-01-19 12:30:36|2020-01-19 12:30:36|             0|
|524368563|c00a9199-98cb-412...|2020-01-08 11:43:08|2020-01-08 11:41:06|           122|
|554064616|04d50e0d-b80c-44e...|2020-01-08 10:02:28|2020-01-08 10:02:28|             0|
|554426664|a6aebb81-e68d-4fc...|2020-01-20 11:57:27|2020-01-20 11:48:18|           549|
|560605592|535c77ff-60a7-432...|2020-01-21 13:03:42|2020-01-21 12:53:09|           633|
+---------+--------------------+-------------------+-------------------+--------------+
only showing top 5 rows



In [184]:
sd_session_length = session_ends.groupBy('user_id') \
                                 .agg(stddev('session_length')) \
                                 .withColumnRenamed("stddev_samp(session_length)", 'sd_session_length')

In [185]:
sd_session_length.show(5)

+---------+------------------+
|  user_id| sd_session_length|
+---------+------------------+
|512700240|358.21209466167494|
|514013554|126444.70285909568|
|514338009| 255.1480827963551|
|514747635|              null|
|516092497|201.15149467384978|
+---------+------------------+
only showing top 5 rows



In [186]:
full = full.join(sd_session_length, full.user_id == sd_session_length.user_id).drop(sd_session_length.user_id)

#### UNFINISHED Distance from last interaction to end of month (seconds)


In [187]:
# new_month = dt.datetime(2020,2,1,0,0).timestamp() # This is the epoch seconds of Feb 1, 2020 at midnight
# new_month

In [188]:
# last_interaction = m1.groupBy('user_id').agg(max('event_time'))
# last_interaction.show()

In [189]:
## last_int_dist = last_interaction.withColumn('time_from_end_month', timestamp(1580533200) - col('max(event_time)'))

#### Average number of interactions per session (avg_interactions_per_session)

In [190]:
interactions_per_session = m1.groupBy('user_id', 'user_session').agg(count('event_type'))

In [191]:
interactions_per_session.show(5)

+---------+--------------------+-----------------+
|  user_id|        user_session|count(event_type)|
+---------+--------------------+-----------------+
|514283509|ba808472-24ca-4a5...|                1|
|524368563|c00a9199-98cb-412...|                4|
|554064616|04d50e0d-b80c-44e...|                1|
|554426664|a6aebb81-e68d-4fc...|                9|
|560605592|535c77ff-60a7-432...|                6|
+---------+--------------------+-----------------+
only showing top 5 rows



In [192]:
avg_interactions_per_session = interactions_per_session.groupBy('user_id').avg('count(event_type)')

In [193]:
avg_interactions_per_session = avg_interactions_per_session.withColumnRenamed('avg(count(event_type))', "avg_interactions_per_session")

In [194]:
full = full.join(avg_interactions_per_session, full.user_id == avg_interactions_per_session.user_id).drop(avg_interactions_per_session.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+
|405614124|          0.0|        0.0|           2|             2|               0.0|               0.0|                         1.0|
|485991194|          0.0|        0.0|           3|             1|             398.0|              null|                         3.0|
|496765250|          0.0|        0.0|           1|             1|               0.0|              null|                         1.0|
|501980918|          0.0|        0.0|        2141|            40|          1955.925|3099.7057626077335|                      53.525|
|502621333|          0.0|        0.0|           5|             3|221.

#### Std Deviation of number of interactions per session per person (stddev_int_per_session)

In [195]:
std_interactions_per_session = interactions_per_session.groupBy('user_id') \
                                                       .agg(stddev('count(event_type)')) \
                                                       .withColumnRenamed("stddev_samp(count(event_type))", 'sd_interactions_per_session')
std_interactions_per_session.show(5)

+---------+---------------------------+
|  user_id|sd_interactions_per_session|
+---------+---------------------------+
|512700240|          2.138089935299395|
|514013554|         3.4121631178560534|
|514338009|           6.08820433161764|
|514747635|                       null|
|516092497|         1.8464895909600494|
+---------+---------------------------+
only showing top 5 rows



In [196]:
full = full.join(std_interactions_per_session, full.user_id == std_interactions_per_session.user_id).drop(std_interactions_per_session.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+
|405614124|          0.0|        0.0|           2|             2|               0.0|               0.0|                         1.0|                        0.0|
|485991194|          0.0|        0.0|           3|             1|             398.0|              null|                         3.0|                       null|
|496765250|          0.0|        0.0|           1|             1|               0.0|              null|                         1.0|                       null|
|501980918|          0.0|        0

#### Max number of interactions within one session (max_interactions_one_session)

In [197]:
max_interactions_per_session = interactions_per_session.groupBy('user_id').max('count(event_type)')

In [198]:
max_interactions_per_session = max_interactions_per_session.withColumnRenamed('max(count(event_type))', "max_interactions_per_session")

In [199]:
max_interactions_per_session.show(1)

+---------+----------------------------+
|  user_id|max_interactions_per_session|
+---------+----------------------------+
|512700240|                           7|
+---------+----------------------------+
only showing top 1 row



In [200]:
full = full.join(max_interactions_per_session, full.user_id == max_interactions_per_session.user_id).drop(max_interactions_per_session.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+
|405614124|          0.0|        0.0|           2|             2|               0.0|               0.0|                         1.0|                        0.0|                           1|
|485991194|          0.0|        0.0|           3|             1|             398.0|              null|                         3.0|                       null|                           3|
|496765250|          0.0|        0.0|           1|

#### Percent of total events that are x (Purchase, Cart, View) ('purchase_pct_of_total_events', 'cart_pct_of_total_events', 'view_pct_of_total_events')

In [201]:
event_counts = m1.groupBy('user_id', 'user_session').pivot('event_type').agg(count('event_type'))
# Here the three types of event count are pivoted out for later tabulation

In [202]:
event_counts = event_counts.fillna(0) #replace nulls with 0 for math
event_counts.show(5)

+---------+--------------------+----+--------+----+
|  user_id|        user_session|cart|purchase|view|
+---------+--------------------+----+--------+----+
|602491865|7ec0e1e6-94e0-493...|   0|       0|   4|
|552639168|d335b338-322b-4eb...|   1|       1|   2|
|581273021|847d49fa-06a5-438...|   0|       0|   3|
|515047041|591cd0ea-f290-47c...|   0|       0|   1|
|591332625|1f8c24dd-9574-47c...|   1|       0|  14|
+---------+--------------------+----+--------+----+
only showing top 5 rows



In [203]:
events_per_session = event_counts.withColumn('events_per_session_total', col('cart') + col('purchase') + col('view')) 
# Get total number of events per session

In [204]:
events_per_session.show(5)

+---------+--------------------+----+--------+----+------------------------+
|  user_id|        user_session|cart|purchase|view|events_per_session_total|
+---------+--------------------+----+--------+----+------------------------+
|602491865|7ec0e1e6-94e0-493...|   0|       0|   4|                       4|
|552639168|d335b338-322b-4eb...|   1|       1|   2|                       4|
|581273021|847d49fa-06a5-438...|   0|       0|   3|                       3|
|515047041|591cd0ea-f290-47c...|   0|       0|   1|                       1|
|591332625|1f8c24dd-9574-47c...|   1|       0|  14|                      15|
+---------+--------------------+----+--------+----+------------------------+
only showing top 5 rows



In [205]:
pct_events = events_per_session.groupBy('user_id').sum()

In [206]:
pct_totalevents = pct_events.withColumn('purchase_pct_of_total_events', col('sum(purchase)')/col('sum(events_per_session_total)')) \
                  .withColumn('view_pct_of_total_events', col('sum(view)')/col('sum(events_per_session_total)')) \
                  .withColumn('cart_pct_of_total_events', col('sum(cart)')/col('sum(events_per_session_total)'))

In [207]:
merge_me = pct_totalevents.select('user_id', 'purchase_pct_of_total_events', 'view_pct_of_total_events', 'cart_pct_of_total_events')

In [208]:
full = full.join(merge_me, full.user_id == merge_me.user_id).drop(merge_me.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+
|405614124|          0.0|        0.0|           2|             2|               0.0|               0.0|                         1.0|                        0.0|                           1|    

#### Average number of purchases per session (avg_purchases_per_session)

In [209]:
avg_purchases_per_session = events_per_session.groupBy('user_id').avg('purchase').withColumnRenamed('avg(purchase)', "avg_purchases_per_session")

In [210]:
avg_purchases_per_session.show(5)

+---------+-------------------------+
|  user_id|avg_purchases_per_session|
+---------+-------------------------+
|512700240|       0.5714285714285714|
|539141084|                     0.04|
|606753158|                      0.0|
|575813444|                      0.0|
|581503547|                      0.0|
+---------+-------------------------+
only showing top 5 rows



In [211]:
full = full.join(avg_purchases_per_session, full.user_id == avg_purchases_per_session.user_id).drop(avg_purchases_per_session.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+
|405614124|          0.0|        0.0|           2|             2|               0.0|               0.0|            

#### STD of number of purchases per session per person (std_purchases_per_session)

In [212]:
std_purchases_per_session = events_per_session.groupBy('user_id') \
                                              .agg(stddev('purchase')) \
                                              .withColumnRenamed('stddev_samp(purchase)', "sd_purchases_per_session")
std_purchases_per_session.show(5)

+---------+------------------------+
|  user_id|sd_purchases_per_session|
+---------+------------------------+
|512700240|      0.5345224838248488|
|539141084|                     0.2|
|606753158|                     0.0|
|575813444|                     0.0|
|581503547|                     0.0|
+---------+------------------------+
only showing top 5 rows



In [213]:
full = full.join(std_purchases_per_session, full.user_id == std_purchases_per_session.user_id).drop(std_purchases_per_session.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+
|405614124|          0.0|        0.0|   

#### Total number of each type of event over whole month (monthlyCartTotal, monthlyPurchaseTotal, monthlyViewTotal)

In [214]:
event_counts_month = event_counts.groupBy('user_id').sum('cart', 'purchase', 'view')\
                     .withColumnRenamed('sum(cart)', 'cart_events') \
                     .withColumnRenamed('sum(purchase)', 'purchase_events') \
                     .withColumnRenamed('sum(view)', 'view_events')

In [215]:
event_counts_month.show(5)

+---------+-----------+---------------+-----------+
|  user_id|cart_events|purchase_events|view_events|
+---------+-----------+---------------+-----------+
|512700240|          6|              4|         23|
|539141084|          2|              1|        184|
|606753158|          0|              0|          5|
|575813444|          0|              0|          9|
|581503547|          2|              0|         57|
+---------+-----------+---------------+-----------+
only showing top 5 rows



In [216]:
full = full.join(event_counts_month, full.user_id == event_counts_month.user_id).drop(event_counts_month.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+------------

#### Total number of sessions that contain event over whole month (NumSessWithPurchases, NumSessWithCart, NumSessWithView)

In [217]:
events_over_month = events_per_session.withColumn('purchase_events', when(col('purchase') == 0, 0).otherwise(1)) \
                                      .withColumn('cart_events', when(col('cart')==0, 0).otherwise(1)) \
                                      .withColumn('view_events', when(col('view')==0, 0).otherwise(1))

In [218]:
num_sesh_containing_event = events_over_month.groupBy('user_id').sum('purchase_events', "cart_events", "view_events") \
                            .withColumnRenamed("sum(purchase_events)", "sessions_with_purchase") \
                            .withColumnRenamed("sum(cart_events)", "sessions_with_cart") \
                            .withColumnRenamed("sum(view_events)", "sessions_with_view")

In [219]:
num_sesh_containing_event.show(5)

+---------+----------------------+------------------+------------------+
|  user_id|sessions_with_purchase|sessions_with_cart|sessions_with_view|
+---------+----------------------+------------------+------------------+
|512700240|                     4|                 4|                 7|
|539141084|                     1|                 1|                25|
|606753158|                     0|                 0|                 2|
|575813444|                     0|                 0|                 7|
|581503547|                     0|                 2|                10|
+---------+----------------------+------------------+------------------+
only showing top 5 rows



In [220]:
full = full.join(num_sesh_containing_event, full.user_id == num_sesh_containing_event.user_id).drop(num_sesh_containing_event.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|
+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+--------------------------

#### Percent of individual's sessions that end in cart/purchase (ses_end_purch, ses_end_cart)

In [221]:
session_ends2 = event_counts.withColumn('end_purchase', \
                                when(col('purchase') != 0, 1) \
                                .otherwise(0)) \
                            .withColumn('end_cart', \
                                when((col("purchase") == 0) & (col("cart") != 0), 1) \
                                .otherwise(0))
session_ends2.show(5)

+---------+--------------------+----+--------+----+------------+--------+
|  user_id|        user_session|cart|purchase|view|end_purchase|end_cart|
+---------+--------------------+----+--------+----+------------+--------+
|602491865|7ec0e1e6-94e0-493...|   0|       0|   4|           0|       0|
|552639168|d335b338-322b-4eb...|   1|       1|   2|           1|       0|
|581273021|847d49fa-06a5-438...|   0|       0|   3|           0|       0|
|515047041|591cd0ea-f290-47c...|   0|       0|   1|           0|       0|
|591332625|1f8c24dd-9574-47c...|   1|       0|  14|           0|       1|
+---------+--------------------+----+--------+----+------------+--------+
only showing top 5 rows



In [222]:
session_sum = session_ends2.groupBy('user_id').agg(count('user_session'), sum('end_purchase'), sum('end_cart'))
session_sum.show(5)

+---------+-------------------+-----------------+-------------+
|  user_id|count(user_session)|sum(end_purchase)|sum(end_cart)|
+---------+-------------------+-----------------+-------------+
|512700240|                  7|                4|            0|
|539141084|                 25|                1|            0|
|606753158|                  2|                0|            0|
|575813444|                  7|                0|            0|
|581503547|                 10|                0|            2|
+---------+-------------------+-----------------+-------------+
only showing top 5 rows



In [223]:
session_sum = session_sum.withColumn('pct_sessions_end_purchase', col('sum(end_purchase)')/col('count(user_session)')) \
                         .withColumn('pct_sessions_end_cart', col('sum(end_cart)')/col('count(user_session)'))
session_sum.show(5)

+---------+-------------------+-----------------+-------------+-------------------------+---------------------+
|  user_id|count(user_session)|sum(end_purchase)|sum(end_cart)|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+-------------------+-----------------+-------------+-------------------------+---------------------+
|512700240|                  7|                4|            0|       0.5714285714285714|                  0.0|
|539141084|                 25|                1|            0|                     0.04|                  0.0|
|606753158|                  2|                0|            0|                      0.0|                  0.0|
|575813444|                  7|                0|            0|                      0.0|                  0.0|
|581503547|                 10|                0|            2|                      0.0|                  0.2|
+---------+-------------------+-----------------+-------------+-------------------------+---------------

In [224]:
temp = session_sum.select('user_id', "pct_sessions_end_purchase", "pct_sessions_end_cart")
temp.show(5)

+---------+-------------------------+---------------------+
|  user_id|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+-------------------------+---------------------+
|512700240|       0.5714285714285714|                  0.0|
|539141084|                     0.04|                  0.0|
|606753158|                      0.0|                  0.0|
|575813444|                      0.0|                  0.0|
|581503547|                      0.0|                  0.2|
+---------+-------------------------+---------------------+
only showing top 5 rows



In [225]:
full = full.join(temp, full.user_id == temp.user_id).drop(temp.user_id)
full.show(5)

+---------+-------------+-----------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+-------------+-----------+------------+------------

### Preview full dataframe

In [226]:
full.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- T_total_spend: double (nullable = true)
 |-- total_spend: double (nullable = true)
 |-- total_events: long (nullable = true)
 |-- total_sessions: long (nullable = true)
 |-- avg_session_length: double (nullable = true)
 |-- sd_session_length: double (nullable = true)
 |-- avg_interactions_per_session: double (nullable = true)
 |-- sd_interactions_per_session: double (nullable = true)
 |-- max_interactions_per_session: long (nullable = true)
 |-- purchase_pct_of_total_events: double (nullable = true)
 |-- view_pct_of_total_events: double (nullable = true)
 |-- cart_pct_of_total_events: double (nullable = true)
 |-- avg_purchases_per_session: double (nullable = true)
 |-- sd_purchases_per_session: double (nullable = true)
 |-- cart_events: long (nullable = true)
 |-- purchase_events: long (nullable = true)
 |-- view_events: long (nullable = true)
 |-- sessions_with_purchase: long (nullable = true)
 |-- sessions_with_cart: long (nullable =

In [227]:
full.show(1)

+---------+-------------+-----------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+
|  user_id|T_total_spend|total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|pct_sessions_end_purchase|pct_sessions_end_cart|
+---------+-------------+-----------+------------+--------------

In [228]:
full.count()

219080

### Remove errors/outliers

In [229]:
# Look at abnormally large number of sessions
full.select('total_sessions').sort(desc('total_sessions')).show(30)

+--------------+
|total_sessions|
+--------------+
|          4642|
|           938|
|           854|
|           738|
|           532|
|           505|
|           396|
|           347|
|           331|
|           317|
|           313|
|           301|
|           290|
|           278|
|           271|
|           271|
|           253|
|           242|
|           239|
|           221|
|           195|
|           192|
|           190|
|           173|
|           171|
|           171|
|           171|
|           158|
|           154|
|           152|
+--------------+
only showing top 30 rows



In [230]:
# More than 10 sessions a day for a month (300) seems likely to be an error. Removing those rows. 
full = full.filter(col('total_sessions') <= 300)
full.select('total_sessions').sort(desc('total_sessions')).show(5)
full.count()

+--------------+
|total_sessions|
+--------------+
|           290|
|           278|
|           271|
|           271|
|           253|
+--------------+
only showing top 5 rows



219068

In [231]:
# Look at abnormally long sessions
# Average session length greater than 8 hours,or 28800 seconds, is almost certainly an error
full.select('avg_session_length').sort(desc('avg_session_length')).show(10)

+------------------+
|avg_session_length|
+------------------+
|         2585613.0|
|         2536592.0|
|         2501178.0|
|         2374330.0|
|         2370803.0|
|         2352540.0|
|         2339932.0|
|         2290105.0|
|         2265817.0|
|         2239533.0|
+------------------+
only showing top 10 rows



In [232]:
# Remove individuals with average session length greater than 8 hours (28800 seconds)
full = full.filter(col('avg_session_length') <= 28800)
full.select('avg_session_length').sort(desc('avg_session_length')).show(5)
full.count()

+------------------+
|avg_session_length|
+------------------+
|           28780.5|
|           28740.0|
|          28720.25|
|           28684.0|
|           28657.5|
+------------------+
only showing top 5 rows



215443

In [233]:
# Look at outliers by total spend. 
full.select('total_spend', 'purchase_events', 'T_total_spend').sort(desc('total_spend')).show(50)
full.select('total_spend').summary().show()

+------------------+---------------+--------------------+
|       total_spend|purchase_events|       T_total_spend|
+------------------+---------------+--------------------+
|2044030.4928588867|             37|   6177837.887897491|
|1612291.2746582031|             62|   6665424.844734192|
|1510384.8189868927|             86|   7986746.659622192|
|1194616.5629119873|             50|     6517077.0703125|
| 1185409.803451538|             35|1.2214709955963135E7|
|1130508.6236572266|             51|1.6676558014251709E7|
| 872235.1594161987|             34|    4496551.80267334|
| 779262.9273910522|             38|  3791775.5050964355|
| 683402.0625114441|             47|   9966037.290367126|
| 642396.4184399843|             82|   7555947.780554295|
| 619132.0810317993|             16|    3823697.43548584|
|504365.63328552246|             43|   1629816.219696045|
| 478818.6016845703|             27|   4265970.681861877|
|451705.52406311035|             78|   890662.3084888458|
|  437716.4042

In [234]:
# Remove extreme outliers that spend > 100k a month... 
full = full.filter(col('total_spend') <= 100000)
full.select('total_spend').sort(desc('total_spend')).show(5)
full.count()

+-----------------+
|      total_spend|
+-----------------+
|97969.59899902344|
|95540.28037261963|
|94285.19989013672|
|94272.89993286133|
| 94056.2700805664|
+-----------------+
only showing top 5 rows



215367

In [243]:
# Look at abnormally large numbers of events
# Greater than 3000 events is almost certainly a mistake. That's more than 100 events every day of the month
full.select('total_events').sort(desc('total_events')).show(50)
full.select('total_events').summary().show()

+------------+
|total_events|
+------------+
|        3000|
|        2982|
|        2925|
|        2886|
|        2871|
|        2844|
|        2790|
|        2723|
|        2720|
|        2716|
|        2682|
|        2632|
|        2628|
|        2584|
|        2565|
|        2560|
|        2544|
|        2538|
|        2520|
|        2516|
|        2470|
|        2470|
|        2436|
|        2425|
|        2416|
|        2394|
|        2390|
|        2385|
|        2384|
|        2365|
|        2324|
|        2254|
|        2222|
|        2208|
|        2196|
|        2190|
|        2160|
|        2160|
|        2142|
|        2141|
|        2132|
|        2080|
|        2058|
|        2025|
|        2001|
|        2000|
|        1995|
|        1980|
|        1975|
|        1950|
+------------+
only showing top 50 rows

+-------+------------------+
|summary|      total_events|
+-------+------------------+
|  count|            215313|
|   mean|15.916674794369127|
| stddev| 65.836427

In [244]:
# Remove extreme> 3000 events a month
full = full.filter(col('total_events') <= 3000)
full.select('total_events').sort(desc('total_events')).show(5)
full.count()

+------------+
|total_events|
+------------+
|        3000|
|        2982|
|        2925|
|        2886|
|        2871|
+------------+
only showing top 5 rows



215313

#### Save as parquet. (If saving in project group12 folder - Make sure to change permissions in bash using chmod 777 filename)

In [245]:
%%time
full.write.mode("overwrite").parquet("./processed_data/engineered_features.parquet")

CPU times: user 3.97 ms, sys: 3.25 ms, total: 7.22 ms
Wall time: 33 s


In [246]:
%%time
train, test = full.randomSplit([.8, .2], seed=42)

CPU times: user 1.2 ms, sys: 1.17 ms, total: 2.37 ms
Wall time: 11.4 ms


#### Purchased items in month 1, converted to PCA (pca_purchases)

Note: Unlike all of the other preprocessing, we need to train the PCA model on the training set, then implement it on the test set. For this reason it comes after the train/test split.

In [247]:
%%time

# Create a function that prepares a dataset for PCA.

# This function needs as input a list of columns on which PCA should be performed (and optionally
#       the subset of acceptable columns from the training set). So, this way we don't test on data that we have never seen in training
# This function need to return the subset of columns that should be used on the test set as well as the new PCA dataframe
def pca_prepare_on_subset(subset_df, limited_columns=[]):
    
    # Only get this data from the training (or test) set
    m1_subset = m1.join(subset_df,'user_id','leftsemi')

    # Remove the periods from the dataframe category_code and replace with dashes. PySpark does not do well with periods in column
    #  names, for some reason
    m1_stripped = m1.withColumn('category_code_s', translate('category_code', '.', '-'))

    # Pivot so that each category of purchase becomes a column
    # This table only contains user_id and the categories that a user purchases
    # This table is very sparse
    cats = m1_stripped.filter(m1.event_type == "purchase").groupBy('user_id').pivot('category_code_s').count().na.fill(0)

    # Now these are the specific columns we use as PCA input
    pca_input_cols = [cols for cols in cats.columns if cols!='user_id' and cols!='null']
        
    # Make a new copy of columns (this is from the training set to the test set, in order to filter out other columns)
    if(limited_columns==[]):
        limited_columns = copy.deepcopy(pca_input_cols)
        limited_columns.append('user_id')
    else:
        cats = cats.select(*limited_columns) # This is for the test set to select only the columns from train
        # print(cats.schema)

    # Transform columns into a sparse vector (prepare for PCA)
    assembler = VectorAssembler(
        inputCols=pca_input_cols,
        outputCol="to_pca_columns")
    
    # Create non sparse vector
    pca_df = assembler.transform(cats)
    return limited_columns, pca_df
    

CPU times: user 8 µs, sys: 4 µs, total: 12 µs
Wall time: 15.7 µs


In [1]:
# Get columns from training set and get training df
limited_columns, train_pre_pca = pca_prepare_on_subset(train)
# Limit to these columns on the test set and get test df
_, test_pre_pca = pca_prepare_on_subset(test, limited_columns=limited_columns)

# Visualize what this looks like
train_pre_pca.select(["user_id","to_pca_columns"]).show(2, truncate=False)

# Create new PCA instance
pca = PCAml(k=10, inputCol="to_pca_columns", outputCol="pca_purchases")
# Fit on training data
model = pca.fit(train_pre_pca)

# Transform training and test sets
train_with_pca = model.transform(train_pre_pca)
test_with_pca = model.transform(test_pre_pca)


NameError: name 'pca_prepare_on_subset' is not defined

In [249]:
# Merge PCA df back into full training set
join_train_df = train_with_pca.select(["user_id","pca_purchases"])
train = train.join(join_train_df, train.user_id == join_train_df.user_id).drop(join_train_df.user_id)

# Merge PCA df back into full test set
join_test_df = test_with_pca.select(["user_id","pca_purchases"])
test = test.join(join_test_df, test.user_id == join_test_df.user_id).drop(join_test_df.user_id)

In [250]:
train.show(5, truncate=False)
test.show(5, truncate=False)

+---------+------------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|user_id  |T_total_spend     |total_spend       |total_events|total_sessions|avg_session_length|sd_session_length |avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_sess

#### Write train and test

In [251]:
%%time
train.write.mode("overwrite").parquet("./processed_data/train.parquet")
test.write.mode("overwrite").parquet("./processed_data/test.parquet")

CPU times: user 9.54 ms, sys: 6.63 ms, total: 16.2 ms
Wall time: 1min 12s
