 # Rolling Window Features

Following notebook showcases an example workflow of creating rolling window features and building a model to predict which customers will buy in next 4 weeks.

This uses dummy sales data but the idea can be implemented on actual sales data and can also be expanded to include other available data sources such as click-stream data, call center data, email contacts data, etc.

***

<b>Spark 3.1.2</b> (with Python 3.8) has been used for this notebook.<br>
Refer to [spark documentation](https://spark.apache.org/docs/3.1.2/api/sql/index.html) for help with <b>data ops functions</b>.<br>
Refer to [this article](https://medium.com/analytics-vidhya/installing-and-using-pyspark-on-windows-machine-59c2d64af76e) to <b>install and use PySpark on Windows machine</b>.

### Building a spark session
To create a SparkSession, use the following builder pattern:
 
`spark = SparkSession\
    .builder\
    .master("local")\
    .appName("Word Count")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()`

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.types import FloatType, BooleanType
import pandas as pd
import re
import datetime
import pytz

In [2]:
spark = SparkSession\
    .builder\
    .appName("rolling_window")\
    .config("spark.executor.memory", "1536m")\
    .config("spark.driver.memory", "2g")\
    .getOrCreate()

In [3]:
spark

## Data prep

We will be using window functions to compute relative features for all dates. We will first aggregate the data to customer x week level so it is easier to handle.

<mark>The week level date that we create will serve as the 'reference date' from which everything will be relative.</mark>

All the required dimension tables have to be joined with the sales table prior to aggregation so that we can create all required features.

### Read input datasets

In [4]:
class ABT():
    
    # TODO: categorical handling https://github.com/GoogleCloudPlatform/cloud-for-marketing/tree/main/marketing-analytics/predicting/ml-data-windowing-pipeline
    
    def __init__(self, source, user_col, date_col):
        """
        IMPORTANT: Please make sure the date_col is loaded as date in the first place
        """
        self.source = source
        self.user_col = user_col
        self.date_col = date_col
    
    def set_snapshot_freq(self, freq):
        """
        1. Add column '_snapshot' to self.source to indicate aggregation level,
            e.g. if freq = 'W-SAT' then it will be converted to the Saturday of that week.
            '_snapshot_date' is also defined as self.snapshot_col
        2. Also add self.date_range for subsequent densify operation
        
        See acceptable arguments here https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases
        """ 
        
        self.freq = freq
        self.snapshot_col = '_snapshot_date'
        
        if freq == 'W-SAT':
            self.source = self.source.withColumn(self.snapshot_col, F.col(self.date_col) + 7 - F.dayofweek(self.date_col))
        
        elif freq == 'D':
            self.source = self.source.withColumn(self.snapshot_col, F.col(self.date_col))
        
        # get global date range
        global_min_date, global_max_date = self.source.groupBy().agg(
            F.min(self.snapshot_col).alias('global_min_date'),
            F.max(self.snapshot_col).alias('global_max_date')
        ).collect()[0]
       
        # densify the global date range, and create a Spark DataFrame from it, 
        dt_rng = pd.date_range(start=global_min_date, end=global_max_date, freq=self.freq)
        df_date = pd.DataFrame(dt_rng, columns=[self.snapshot_col])
        self.date_range = spark.createDataFrame(df_date)
        self.date_range = self.date_range.withColumn(self.snapshot_col,F.to_date(self.snapshot_col))
        
        # write an output containing user col and snapshot col only
        self.output = self.source.select(self.user_col, self.snapshot_col).dropDuplicates()
    
    def simple_agg(self, *aggs):
        """
        Aggregate values WITHOUT categorical breakdown e.g. total sales as columns
        """
        temp_agg = self.source.groupBy(self.user_col, self.snapshot_col).agg(*aggs)
        self.output = self.output.join(temp_agg, on=[self.user_col, self.snapshot_col], how='left')
    
    def pivot_agg(self, dimension, agg, formatting):
        """
        Aggregate values with categorical breakdown e.g. adding sales by product category as columns
        """
        prefix, suffix = formatting.split('%')
        
        # Add prefix & sufix to value e.g. cat_A_salesAmt, cat_D_salesAmt, etc., and also replace space with underscore        
        temp_agg = self.source.withColumn(dimension, F.concat(
            F.lit(prefix),
            F.regexp_replace(F.col(dimension),' ','_'),
            F.lit(suffix),
        ))
        temp_agg = temp_agg.groupBy([self.user_col, self.snapshot_col]).pivot(dimension).agg(agg)
        self.output = self.output.join(temp_agg, on=[self.user_col, self.snapshot_col], how='left')
    
    def densify(self):
        """
        Filling in the missing dates/weeks
        """
        
        # get the start and end of available data, by user
        df_cust = self.source.groupBy(self.user_col).agg(
            F.min(self.snapshot_col).alias('user_min_date'),
            F.max(self.snapshot_col).alias('user_max_date')
        )
    
        # cross join the customers with global date range
        df_base = df_cust.crossJoin(F.broadcast(self.date_range))
        
        # filter to keep only week_end since first week per customer
        df_base = df_base.where(F.col('_snapshot_date')>=F.col('user_min_date'))
        
        # drop the by-user min/max_week which are not useful. Now df_base contains only user_col and _snapshot_date
        df_base = df_base.drop("user_min_date", "user_max_date")

        self.output = df_base.join(self.output, on=[self.user_col, self.snapshot_col], how='left')
        
    def y_windowing(self, target, prediction_window, mode):
        
        self.prediction_window = prediction_window
        window = Window.partitionBy(self.user_col).orderBy(self.snapshot_col).rowsBetween(1, self.prediction_window)

        if mode == 'classification':
            # working field to turn the target to 0/1
            self.output = self.output.withColumn('_binary_flag', F.when(F.col(target)>0,1).otherwise(0))

            # window to aggregate the flag over next n periods
            self.output = self.output.withColumn('_y_variable', F.max('_binary_flag').over(window).cast(BooleanType()))            
            
            # remove the working field
            self.output = self.output.drop("_binary_flag")
            
        elif mode == 'regression':
            # window to aggregate the metric over next n periods
            self.output = self.output.withColumn('_y_variable', F.sum(target).over(window))
    
    # TODO: MAX should not be applied on an aggregated basis
    def x_windowing(self, columns, function, lookback, suffix):
        
        self.lookback = lookback
    
        # perform aggregation
        for one_lookback in lookback:
            window = Window.partitionBy(self.user_col).orderBy(self.snapshot_col).rowsBetween(1 - one_lookback, Window.currentRow)
            for one_column in columns:
                self.output = self.output.withColumn(one_column + f'_{one_lookback}{suffix}', function(F.col(one_column)).over(window))
    
    def trim(self):
        """
        Trim dataset based on prediction window and the maximum of lookback.
        Can only be performed after y_windowing and x_aggregation.
        
        Idea: To also remove rows where all X are NULL, but maybe a bad idea:
            e.g. customers who have not been active in the last 7 days might be interested to log in
        """
        
        valid_start = self.date_range.collect()[max(self.lookback)-1][0]
        valid_end = self.date_range.collect()[-self.prediction_window-1][0]
        self.output = self.output.where((F.col(self.snapshot_col)>=valid_start) & (F.col(self.snapshot_col)<=valid_end))

In [5]:
ga4 = spark.read.format('bigquery').option('table', "adroit-hall-301111.demo.ga4_abt").load()

Aggregate data per row

In [6]:
user_col, date_col = 'user_pseudo_id', 'event_date'

abt2 = ABT(ga4, user_col, date_col)
abt2.set_snapshot_freq('D')
abt2.simple_agg(
    F.sum('item_revenue').alias('items_revenue'),
    F.max('purchase_revenue').alias('max_revenue'),
)

abt2.pivot_agg('event_name', F.countDistinct('event_id'), '%_count')
abt2.pivot_agg('category', F.countDistinct(F.when((F.col('event_name') == 'session_start'), F.col('event_id'))), '%_sessions')

# To-doRemove columns that has too few users and thus might not be worthwhile for prediction,
# remember to include the y variable in the "not in" clause
# for aggregation in [i for i in abt2.output.columns if i not in (user_col, date_col)]:
#    if F.countDistinct(F.when((F.col('event_name') == 'session_start')

                                                                                

In [None]:
abt2.output = abt2.output.drop("null") # In case some dimension values are null 

Densify and apply windowing

In [7]:
abt2.densify()
abt2.y_windowing('items_revenue', 3, 'classification')

x_to_agg = [i for i in abt2.output.columns if re.match('.*count|.*revenue|.*sessions', i)]
abt2.x_windowing(x_to_agg, F.sum, [4], 'w_sum')



                                                                                

Truncation

In [None]:
abt2.trim()
abt2.output = abt2.output.where((F.col('screen_view_count_3d_sum')>0))
for i in [i for i in abt2.output.columns if re.match(f'.*count$', i)]:
    abt2.output = abt2.output.drop(i)

Get proportion of features

In [None]:
divide_list = ['login','screen_view','session_start']
suffix_list = ['3d_sum'] #,'7d_sum','14d_sum','28d_sum'

for event in divide_list:
    for suffix in suffix_list:
        x_to_divide = [i for i in abt2.output.columns if re.match(f'.*_{event}_*{suffix}$', i)]
        for variant in x_to_divide:
            abt2.output = abt2.output.withColumn(f'{variant}_pct', (F.col(variant) / F.col(f"{event}_count_{suffix}")))

abt2.output = abt2.output.fillna(0)

In [8]:
output_date = datetime.datetime.now(pytz.timezone('Asia/Hong_Kong')).strftime('%Y%m%d')

# save scored output
abt2.output.write \
  .format("bigquery") \
  .option("temporaryGcsBucket","dataproc-staging-us-central1-712368347106-boh5iflc/temp") \
  .save(f"demo.ga4abt_{output_date}")

22/06/13 10:17:53 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                