<table><tr><td><img src="https://spark.apache.org/images/spark-logo-trademark.png"></td><td><img src="https://bento.cdn.pbs.org/hostedbento-prod/blog/20170114_200556_794501_pk-channel-16x9.jpeg"></td></tr></table>

# Introduction

This is the second part of my attempt to conduct the kaggle competition on PBS Kids prediction in PySpark, focused on feature engineering. I've not attempted much innovation in approach, rather adapt another public notebooks: '890 features' www.kaggle.com/braquino/890-features by Bruno Aquino https://www.kaggle.com/braquino.

As per Bruno's approach, the feature engineering will preserve the session data for the sessions involving an assessment, plus record 'year to date' performance across all sessions including those without an assessment. 

Regarding, Pyspark has been much faster for feature engineering than for the investigation in Part 1, as there is much less instantiation which suit's Spark's lazy evaluation approach. It was difficult to compare timings versus the mainstream Python approach used by Bruno as I didn't follow his approach entirely. But the process of manually creating featues in Python as per Bruno's workbook took 4.5 minutes versus less than a minute in Pyspark, even running on a single laptop. And worth noting though that Bruno's approach suffers from using 'for' loops for things like summation across a window and one-hot encoding, which I suspect slows things compared to using the built-in windowing and encoding functions with scikitlearn. I've similalry tried to use Pyspark's functions for these tasks.

One challenge I found in using Spark here was avoiding a stack overrun when applying'window' functions to calculate 'year to date' performance - windows function really doesn't like high dimension dataframes. I've avoided this by replacing for loops with list comprehension in my execution plan.


In [None]:
#you will need to install pyspark as it isn't part of the standard kaggle environment.  Make sure you set internet on for this workbook
!pip install pyspark
!pip install spark_sklearn

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import psutil
        
# Any results you write to the current directory are saved as output.
from scipy.stats import skew,norm
from scipy import stats


from pyspark.sql import SparkSession, Window 
from pyspark.sql.functions import col,when,unix_timestamp,to_date,min,max,isnull,count,concat_ws,lit,sum,\
instr,datediff
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType,TimestampType,\
BooleanType,LongType,FloatType,DoubleType,ArrayType
from pyspark.ml.feature import OneHotEncoder, StringIndexer,OneHotEncoderEstimator,VectorAssembler,MinMaxScaler\
,PCA
from pyspark.sql.functions import udf,pandas_udf, PandasUDFType,to_date

from pyspark.ml import Pipeline
import json
import pyarrow as pa
import pyarrow.parquet as pq
import spark_sklearn
from collections import Counter
from scipy import stats

import pyspark.sql.functions as F



pd.set_option('display.max_columns', 1000)
pd.option_context('mode.use_inf_as_na', True)

%pylab inline

In [None]:
#Initialise the Spark context
os.environ["PYSPARK_PYTHON"]="python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python3"

#NumCores=psutil.cpu_count(logical=False) #Not necessary to manually set number of workers, just for clarity

Spark = SparkSession\
.builder\
.master('local[4]')\
.appName("PBS_Kids_Spark")\
.config("spark.executor.memory", "4g") \
.config("spark.driver.memory", "14g") \
.config("spark.memory.offHeap.enabled",True)\
.config("spark.memory.offHeap.size","10g")\
.config("spark.driver.maxResultSize",0)\
.config("spark.sql.execution.arrow.enabled",True)\
.getOrCreate()

In [None]:
%%time
#Load data to DataFrames

TrainDf=Spark.read.csv('../input/data-science-bowl-2019/train.csv',header=True,quote='"',escape='"') #quote and escape options required to parse double quotes
TrainlabelsDf=Spark.read.csv('../input/data-science-bowl-2019/train_labels.csv',header=True,quote='"',escape='"')
TestDf=Spark.read.csv('../input/data-science-bowl-2019/test.csv',quote='"',header=True,escape='"')

#Load smaller files as panda Dfs
SpecsDf=pd.read_csv('../input/data-science-bowl-2019/specs.csv')
sample_submissionDf=pd.read_csv('../input/data-science-bowl-2019/sample_submission.csv')

In [None]:
%%time
#getting rid of the installation_ids that never took an assessment.  We saw these in Part 1
TrainDf.createOrReplaceTempView("Train")
keepidDf=Spark.sql(f'SELECT installation_id from Train WHERE type="Assessment"').dropDuplicates()
keepidDf.createOrReplaceTempView("keepid")
Columns=','.join(['Train.'+a for a in TrainDf.columns])
TrainDf=Spark.sql(f'SELECT {Columns} from Train INNER JOIN keepid ON Train.installation_id=keepid.installation_id')

In [None]:
#drop rows wih na
TrainDf=TrainDf.na.drop()

In [None]:
''' I've limited the size of the dataframe as kaggle machines don't really have enough HDD storage (<5Gb) to support Spark as a head node.  
Especially when I use Pyarrow to save the feature dataframe to parquet.  Hopefully you have access to a more powerful PC and can remove this '''
TrainDf=TrainDf.limit(1000000)

## Create flags

In [None]:
%%time
#Add indentifying column to Test and Train dfs
TestDf=TestDf.withColumn('TestOrTrain',lit("Test"))
TrainDf=TrainDf.withColumn('TestOrTrain',lit("Train"))

In [None]:
%%time
#identify test records
TestRecordsDf=TestDf.groupBy('installation_id').agg(F.last('timestamp').alias('timestamp'))
TestRecordsDf=TestRecordsDf.withColumn('TestFlag',lit(1))
TrainDf=TrainDf.withColumn('TestFlag',lit(0))

TestDf.createOrReplaceTempView("Test")
TestRecordsDf.createOrReplaceTempView("TestRecords")

TestDf=Spark.sql(f'SELECT *, \
Test.installation_id as id1,Test.timestamp as ts1,\
TestRecords.installation_id,TestRecords.timestamp \
from Test LEFT JOIN TestRecords \
ON Test.installation_id=TestRecords.installation_id \
AND Test.timestamp=TestRecords.timestamp')\
.drop('installation_id','timestamp')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('ts1','timestamp')

In [None]:
%%time
#stack the test and train dataframes for combined operations
TestDf=TestDf.select(TrainDf.columns) #ensur esame column order
CombinedDf=TrainDf.union(TestDf).repartition(3) 

#concatenate the events and codes
CombinedDf = CombinedDf.withColumn('title_event_code',concat_ws('_',CombinedDf.title,CombinedDf.event_code))

#concatenate the world and event type
CombinedDf = CombinedDf.withColumn('world_type',concat_ws('_',CombinedDf.world,CombinedDf.type))

#String encode a number of fields via a pipeline
# ColumnsToEncode=['title','world','type','event_code','event_id']
# indexers = [StringIndexer(inputCol=column, outputCol=column+"_index",handleInvalid='skip')\
#              for column in ColumnsToEncode ]
# EncodePipeline = Pipeline(stages=indexers)
# CombinedDf = EncodePipeline.fit(CombinedDf).transform(CombinedDf)


#Flag the assessment tasks
CombinedDf=CombinedDf.withColumn('win_code',when(\
                     ((col('event_code')=='4100')\
                      & (F.instr(CombinedDf['title'],'(Assessment)')>0)\
                      &(col('title')!='Bird Measurer (Assessment)')   )\
                      |
                     ((col('event_code')=='4110') & (col('title')=='Bird Measurer (Assessment)'))\
                      |
                      (col('TestFlag')==1)                           
                    ,1).otherwise(0))

#For assessment tasks, indicate if pass or fail
CombinedDf=CombinedDf.withColumn('true_attempts',when(\
                     (col('win_code')==1)&(F.instr(CombinedDf['event_data'],'true')>0)\
                    ,1).otherwise(0))
CombinedDf=CombinedDf.withColumn('false_attempts',when(\
                     (col('win_code')==1)&(F.instr(CombinedDf['event_data'],'false')>0)\
                    ,1).otherwise(0))

#convert timestamp from string
CombinedDf=CombinedDf.withColumn('timestamp',unix_timestamp(col('timestamp')\
                                                             , "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'").cast("timestamp"))

#Flag if session type changes for each installation
windowval = Window.partitionBy('installation_id').orderBy('timestamp')
CombinedDf=CombinedDf.withColumn("ChangeSession", when(\
                                                       (col('type')!=F.lag(col('type'), 1, 0).over(windowval)),True
                                                      )\
                                                    .otherwise(False))

#Show chaged session type
CombinedDf=CombinedDf.withColumn('typeChange',when(\
                     (col('ChangeSession')==True),col('type')).otherwise('Unchanged'))

#Ensure key variables are the proper type
CombinedDf=CombinedDf.withColumn('game_time',col('game_time').cast(LongType()))
#Cache as we'll be using CombinedDf a lot
CombinedDf.cache() 


CombinedDf.createOrReplaceTempView("Combined")


#create frame of the 4 assessment titles
list_of_assess_titlesDf=Spark.sql(f'SELECT title from Combined WHERE type="Assessment"').dropDuplicates()

In [None]:
%%time
#get game data

@udf('int')
def json_attribute(data,attribute='misses'):
    try:
        result =json.loads(data)[attribute]
    except:
        result =-1
    
    return result

@udf('int')
def json_conditional_attribute(data,attribute='round'):
    try:
        result =json.loads(data)[attribute]
    except:
        result =-1
    
    return result


gameDf=CombinedDf.where(col('type')=='Game').where(col('event_code')=='2030')\
    .select('installation_id','timestamp','event_data')
gameDf=gameDf.withColumn('misses_cnt',json_attribute(col('event_data'),lit('misses')))
gameDf=gameDf.withColumn('game_round',json_conditional_attribute(col('event_data'),lit('round')))
gameDf=gameDf.withColumn('game_level',json_conditional_attribute(col('event_data'),lit('level')))
gameDf=gameDf.drop('event_data')

CombinedDf.createOrReplaceTempView("Combined")
gameDf.createOrReplaceTempView("game")

CombinedDf=Spark.sql(f'SELECT *, \
Combined.installation_id as id1,Combined.timestamp as ts1,\
game.installation_id,game.timestamp \
from Combined LEFT JOIN game \
ON Combined.installation_id=game.installation_id \
AND Combined.timestamp=game.timestamp')\
.drop('installation_id','timestamp')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('ts1','timestamp')


CombinedDf=CombinedDf.fillna(0, subset=['misses_cnt'])
CombinedDf=CombinedDf.fillna(-1, subset=['game_round','game_level'])

## Convert the raw data into processed features

In [None]:
%%time
#Record number of Title-event code pair (essentially one-hot encoding)
AllSessionTimingTEDf=CombinedDf.groupBy("installation_id","game_session").pivot("title_event_code").count().na.fill(0)

#Record number of type-world pair
AllSessionTimingTWDf=CombinedDf.groupBy("installation_id","game_session").pivot("world_type").count().na.fill(0)

# #Record  number of  event codes
AllSessionTimingECDf=CombinedDf.groupBy("installation_id","game_session").pivot("event_code").count().na.fill(0)

# #Record  number of  titles
AllSessionTimingTtlDf=CombinedDf.groupBy("installation_id","game_session").pivot("title").count().na.fill(0)

#Record  number of  event id
AllSessionTimingEIDf=CombinedDf.groupBy("installation_id","game_session").pivot("event_id").count().na.fill(0)

#Record  number of  world
AllSessionTimingWDf=CombinedDf.groupBy("installation_id","game_session").pivot("world").count().na.fill(0)

#Record  number of  types
AllSessionTimingTypDf=CombinedDf.groupBy("installation_id","game_session").pivot("type").count().na.fill(0)

#Record  number of  type changes
AllSessionTimingTypCDf=CombinedDf.groupBy("installation_id","game_session").pivot("typeChange").count().na.fill(0)

ColumnNames= ['Activity', 'Assessment', 'Clip', 'Game']  #rename to avoid confustion with AllSessionTimingTypDf
NewColumnNames={'Activity':'ActivityC', 'Assessment':'AssessmentC', 'Clip':'ClipC', 'Game':'GameC'}
for Column in ColumnNames:
    AllSessionTimingTypCDf=AllSessionTimingTypCDf.withColumnRenamed(Column,NewColumnNames[Column])

In [None]:
%%time
#Record all assessment attempts and results by world
Assess_Titles=list_of_assess_titlesDf.toPandas()[['title']].values.tolist()
Assess_Titles=[x[0] for x in Assess_Titles]

AllAssessmentDf=CombinedDf.select("installation_id","game_session","title",\
                                "true_attempts","false_attempts","timestamp","win_code")
for Assess_Title in Assess_Titles:
    AllAssessmentDf=AllAssessmentDf.withColumn(Assess_Title+'True',\
                                    when(col('title')==Assess_Title,col('true_attempts'))\
                                                 .otherwise(0).cast(IntegerType()))
    AllAssessmentDf=AllAssessmentDf.withColumn(Assess_Title+'False',\
                                    when(col('title')==Assess_Title,col('false_attempts'))\
                                                 .otherwise(0).cast(IntegerType()))
    AllAssessmentDf=AllAssessmentDf.withColumn(Assess_Title+'AllAttemps',\
                                    when(col('title')==Assess_Title,col('true_attempts')+col('false_attempts'))\
                                                 .otherwise(0).cast(IntegerType()))

## Sum occurences and assessment results by installation id and session

In [None]:
%%time
#First get some miscellaeous session informantion

MiscSessionInfoDf=CombinedDf.groupBy("installation_id","game_session")\
.agg(count('event_id').alias('NumEvents')\
    ,F.sum('win_code').alias('NumAssessmentAttempts')
    ,min('timestamp').alias('StartTime')\
    ,max('timestamp').alias('EndTime')\
#     ,F.first(col("title_index")).alias('session_title')\
    ,F.first(col("title")).alias('r_session_title')\
    ,F.first(col("TestFlag")).alias('TestFlag')\
    ,F.first(col("TestOrTrain")).alias('TestOrTrain')\
     ,F.first(col("game_round")).alias('game_round')\
     ,F.first(col("game_level")).alias('game_level')\
     ,F.first(col("misses_cnt")).alias('misses_cnt')\
     ,F.first(col("type")).alias('Type')\
     ,F.first(col("World")).alias('World')\
    ,(F.unix_timestamp(F.max('timestamp'))-F.unix_timestamp(F.min('timestamp'))).alias('SessionDuration'))
MiscSessionInfoDf=MiscSessionInfoDf.withColumn('hour',F.hour(col('StartTime')))


In [None]:
%%time
#insert clip durations
clip_time = {'Welcome to Lost Lagoon!':19,'Tree Top City - Level 1':17,'Ordering Spheres':61, 'Costume Box':61,
        '12 Monkeys':109,'Tree Top City - Level 2':25, 'Pirate\'s Tale':80, 'Treasure Map':156,'Tree Top City - Level 3':26,
        'Rulers':126, 'Magma Peak - Level 1':20, 'Slop Problem':60, 'Magma Peak - Level 2':22, 'Crystal Caves - Level 1':18,
        'Balancing Act':72, 'Lifting Heavy Things':118,'Crystal Caves - Level 2':24, 'Honey Cake':142, 'Crystal Caves - Level 3':19,
        'Heavy, Heavier, Heaviest':61}

@udf('int')
def insert_clip_duration(Type,session_title,SessionDuration):
    if Type=='Clip':
        return clip_time[session_title]
    else:
        return SessionDuration
    

MiscSessionInfoDf=MiscSessionInfoDf.withColumn('SessionDuration',\
                                    insert_clip_duration(col('Type'),col('r_session_title'),col('SessionDuration'))\
                                               .cast(IntegerType()))

In [None]:
%%time
#sum occurences for each title-event code pair across each session 
SessionTimingTEDf=AllSessionTimingTEDf.groupby("installation_id","game_session").sum()
#sum occurences for each world-type pair across each session 
SessionTimingTWDf=AllSessionTimingTWDf.groupby("installation_id","game_session").sum()
#Same for title
SessionTimingTtlDf=AllSessionTimingTtlDf.groupby("installation_id","game_session").sum()
#Same for event id
SessionTimingEIDf=AllSessionTimingEIDf.groupby("installation_id","game_session").sum()
#Same for world
SessionTimingWDf=AllSessionTimingWDf.groupby("installation_id","game_session").sum()
#Same for type
SessionTimingTypDf=AllSessionTimingTypDf.groupby("installation_id","game_session").sum()
#Same for type change
SessionTimingTypCDf=AllSessionTimingTypCDf.groupby("installation_id","game_session").sum()
#Same for event code
SessionTimingECDf=AllSessionTimingECDf.groupby("installation_id","game_session").sum()

#Join the 7 occurence dataframes
SessionTimingTEDf.createOrReplaceTempView("SessionTimingTE")
SessionTimingTWDf.createOrReplaceTempView("SessionTimingTW")
SessionTimingTtlDf.createOrReplaceTempView("SessionTimingTtl")
SessionTimingEIDf.createOrReplaceTempView("SessionTimingEI")
SessionTimingWDf.createOrReplaceTempView("SessionTimingW")
SessionTimingTypDf.createOrReplaceTempView("SessionTimingTyp")
SessionTimingTypCDf.createOrReplaceTempView("SessionTimingTypC")
SessionTimingECDf.createOrReplaceTempView("SessionTimingEC")
MiscSessionInfoDf.createOrReplaceTempView("MiscSessionInfo")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTimingTE.installation_id as id1,SessionTimingTE.game_session as gs1,\
SessionTimingTtl.installation_id,SessionTimingTtl.game_session \
from SessionTimingTE INNER JOIN SessionTimingTtl \
ON SessionTimingTE.installation_id=SessionTimingTtl.installation_id \
AND SessionTimingTE.game_session=SessionTimingTtl.game_session')\
.drop('installation_id','game_session')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTiming.installation_id as id1,SessionTiming.game_session as gs1,\
SessionTimingTW.installation_id,SessionTimingTW.game_session \
from SessionTiming INNER JOIN SessionTimingTW \
ON SessionTiming.installation_id=SessionTimingTW.installation_id \
AND SessionTiming.game_session=SessionTimingTW.game_session')\
.drop('installation_id','game_session')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTiming.installation_id as id1,SessionTiming.game_session as gs1,\
SessionTimingEI.installation_id,SessionTimingEI.game_session \
from SessionTiming INNER JOIN SessionTimingEI \
ON SessionTiming.installation_id=SessionTimingEI.installation_id \
AND SessionTiming.game_session=SessionTimingEI.game_session')\
.drop('installation_id','game_session')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTiming.installation_id as id1,SessionTiming.game_session as gs1,\
SessionTimingW.installation_id,SessionTimingW.game_session \
from SessionTiming INNER JOIN SessionTimingW \
ON SessionTiming.installation_id=SessionTimingW.installation_id \
AND SessionTiming.game_session=SessionTimingW.game_session')\
.drop('installation_id','game_session')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTiming.installation_id as id1,SessionTiming.game_session as gs1,\
SessionTimingTyp.installation_id,SessionTimingTyp.game_session \
from SessionTiming INNER JOIN SessionTimingTyp \
ON SessionTiming.installation_id=SessionTimingTyp.installation_id \
AND SessionTiming.game_session=SessionTimingTyp.game_session')\
.drop('installation_id','game_session','agame_session','aid')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTiming.installation_id as id1,SessionTiming.game_session as gs1,\
SessionTimingTypC.installation_id,SessionTimingTypC.game_session \
from SessionTiming INNER JOIN SessionTimingTypC \
ON SessionTiming.installation_id=SessionTimingTypC.installation_id \
AND SessionTiming.game_session=SessionTimingTypC.game_session')\
.drop('installation_id','game_session','agame_session','aid')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTiming.installation_id as id1,SessionTiming.game_session as gs1,\
SessionTimingEC.installation_id,SessionTimingEC.game_session \
from SessionTiming INNER JOIN SessionTimingEC \
ON SessionTiming.installation_id=SessionTimingEC.installation_id \
AND SessionTiming.game_session=SessionTimingEC.game_session')\
.drop('installation_id','game_session','agame_session','aid')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

SessionTimingDf=Spark.sql(f'SELECT *, \
SessionTiming.installation_id as id1,SessionTiming.game_session as gs1,\
MiscSessionInfo.installation_id,MiscSessionInfo.game_session \
from SessionTiming INNER JOIN MiscSessionInfo \
ON SessionTiming.installation_id=MiscSessionInfo.installation_id \
AND SessionTiming.game_session=MiscSessionInfo.game_session')\
.drop('installation_id','game_session','agame_session','aid')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionTimingDf.createOrReplaceTempView("SessionTiming")

##  Get running total of duration and assessments across installations and sessions

In [None]:
%%time
windowval = Window.partitionBy('installation_id').orderBy('StartTime')

# get cumulative lag timings
ColumnsToSum=['SessionDuration']+SessionTimingDf.columns[:-14]+['misses_cnt']#Don't include key columns in summations
lag_summed_cols = [F.sum(F.lag(col(Column), 1).over(windowval)).over(windowval).alias(Column+'LagCum') \
                     for Column in ColumnsToSum]#lag one session to avoid double counting cumulative and current session

session_title_cols=list(SessionTimingTtlDf.columns)[2:]
missing_cols = [x for x in SessionTimingDf.columns if x not in ColumnsToSum+session_title_cols]\
                        +['SessionDuration','misses_cnt']


SessionTimingCumDf=SessionTimingDf.select(missing_cols+lag_summed_cols+session_title_cols).na.fill(0)

#Get the start time of previous session:
SessionTimingCumDf=SessionTimingCumDf\
.withColumn('PreviousSessStart',F.lag(col('StartTime'), 1).over(windowval))

# get time since last session
SessionTimingCumDf=SessionTimingCumDf\
.withColumn('TimeSinceLastSess'\
           ,F.unix_timestamp(col('StartTime'))-F.unix_timestamp(col('PreviousSessStart'))).na.fill(100000)

# get cumulative count of sessions since last session
SessionTimingCumDf=SessionTimingCumDf\
.withColumn("NumSessionsLagCum", F.count(col('game_session')).over(windowval))



#get rolling average duration
SessionTimingCumDf=SessionTimingCumDf.withColumn('duration_lag_mean',\
                                    (col('SessionDurationLagCum')/col('NumSessionsLagCum')))

#get rolling average #events
SessionTimingCumDf=SessionTimingCumDf.withColumn('numevents_lag_mean',\
                                    (col('NumEventsLagCum')/col('NumSessionsLagCum')))

In [None]:
%%time
#get some data on assessment sessions only
AssessmentsDf=SessionTimingCumDf.filter((col('NumAssessmentAttempts')>0))
AssessmentsDf=AssessmentsDf.withColumn('AssessmentDurationLag',F.lag(col('SessionDuration'), 1).over(windowval))\
            .na.fill(-1)

AssessmentsDf=AssessmentsDf.select('installation_id','game_session','AssessmentDurationLag')

#join the dataframes
SessionTimingCumDf.createOrReplaceTempView("SessionTimingCum")
AssessmentsDf.createOrReplaceTempView("Assessments")

SessionTimingCumDf=Spark.sql(f'SELECT *, \
SessionTimingCum.installation_id as id1,SessionTimingCum.game_session as gs1,\
Assessments.installation_id,Assessments.game_session \
from SessionTimingCum LEFT JOIN Assessments \
ON SessionTimingCum.installation_id=Assessments.installation_id \
AND SessionTimingCum.game_session=Assessments.game_session')\
.drop('installation_id','game_session','agame_session','aid')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')

In [None]:
%%time
#sum duration by type
Types=['Activity','Assessment','Clip','Game']
for Type in Types:
    SessionTimingCumDf=SessionTimingCumDf.withColumn(Type+'Dur',\
                                when(col('Type')==Type,col('SessionDuration'))\
                                             .otherwise(0).cast(IntegerType()))
#Get cum lag duration    
ColumnsToSum=[Type+'Dur' for Type in Types]#Don't include key columns in summations
lag_summed_cols = [F.sum(F.lag(col(Column), 1).over(windowval)).over(windowval).alias(Column+'LagCum') \
                     for Column in ColumnsToSum]

#get lag cum mean
lag_mean_cols = [F.avg(F.lag(col(Column), 1).over(windowval)).over(windowval).alias(Column+'LagMean') \
                     for Column in ColumnsToSum]

#get lag stddev
lag_std_cols = [F.stddev(F.lag(col(Column), 1).over(windowval)).over(windowval).alias(Column+'LagStd') \
                     for Column in ColumnsToSum]

#get lag max
lag_max_cols = [F.max(F.lag(col(Column), 1).over(windowval)).over(windowval).alias(Column+'Lagmax') \
                     for Column in ColumnsToSum]

SessionTimingCumDf=SessionTimingCumDf.select(SessionTimingCumDf.columns\
            +lag_summed_cols+lag_mean_cols+lag_std_cols+lag_max_cols).na.fill(0)    

In [None]:
%%time
#get game data
SessionTimingCumDf=SessionTimingCumDf.withColumn('game_missMeanLag',F.avg(F.lag(col('misses_cnt'), 1)\
                                    .over(windowval)).over(windowval)).na.fill(0)   
SessionTimingCumDf=SessionTimingCumDf.withColumn('game_missStdLag',F.stddev(F.lag(col('misses_cnt'), 1)\
                                    .over(windowval)).over(windowval)).na.fill(0)    

ColumnsToLag=['game_round', 'game_level','misses_cnt']
lag_cols = [F.lag(col(Column), 1).over(windowval).alias(Column+'Lag') \
                     for Column in ColumnsToLag]
SessionTimingCumDf=SessionTimingCumDf.select(SessionTimingCumDf.columns+lag_cols)
LaggedCols=[i+'Lag' for i in ColumnsToLag]
SessionTimingCumDf=SessionTimingCumDf.fillna(-1, subset=LaggedCols)


In [None]:
%%time
#do the same for assessment data
AssessmentDf=AllAssessmentDf.groupby("installation_id","game_session").sum()
MiscAssessmentInfoDf=MiscSessionInfoDf.select("installation_id","game_session"\
                                              ,'StartTime','NumAssessmentAttempts')

AssessmentDf.createOrReplaceTempView("Assessment")
MiscAssessmentInfoDf.createOrReplaceTempView("MiscAssessmentInfo")
SessionAssessmentDf=Spark.sql(f'SELECT *, \
Assessment.installation_id as id1,Assessment.game_session as gs1,\
MiscAssessmentInfo.installation_id as aid,MiscAssessmentInfo.game_session as agame_session \
from Assessment INNER JOIN MiscAssessmentInfo \
ON Assessment.installation_id=MiscAssessmentInfo.installation_id \
AND Assessment.game_session=MiscAssessmentInfo.game_session')\
.drop('installation_id','game_session','agame_session','aid')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')
SessionAssessmentDf.createOrReplaceTempView("SessionAssessment")

In [None]:
%%time
#get lagged assessments
ColumnsToAdd=list(SessionAssessmentDf.columns)[:-4]+['NumAssessmentAttempts']#Don't include key columns in summation
lag_summed_cols = [F.sum(F.lag(col(Column), 1, 0).over(windowval)).over(windowval).alias(Column+'LagCum') \
                     for Column in ColumnsToAdd]#lag one session to avoid double counting cumulative and current session
missing_cols = [i for i in SessionAssessmentDf.columns if i not in ColumnsToAdd]


SessionAccuracyDf=SessionAssessmentDf.select(missing_cols+ColumnsToAdd+lag_summed_cols).na.fill(-1)

@pandas_udf('int', PandasUDFType.SCALAR)
def Count_Trues(Unq,World):
    return Unq+World

In [None]:
%%time
#Get time since last assessment
#Get the start time of previous session:
SessionAccuracyDf=SessionAccuracyDf\
.withColumn('PreviousSessStart',F.lag(col('StartTime'), 1).over(windowval))

# get time since last session
SessionAccuracyDf=SessionAccuracyDf\
.withColumn('TimeSinceLastAssess'\
           ,F.unix_timestamp(col('StartTime'))-F.unix_timestamp(col('PreviousSessStart'))).na.fill(100000)

In [None]:
%%time
#Get current accuracy
for Assess_Title in Assess_Titles:
    SessionAccuracyDf=SessionAccuracyDf.withColumn(Assess_Title+"Accy",\
                                   (col(f'sum({Assess_Title}True)')\
                                     /col(f'sum({Assess_Title}AllAttemps)'))).na.fill(-1)
    
SessionAccuracyDf=SessionAccuracyDf.withColumn('AllAssessmentAccy',\
                                   (col(f'sum(true_attempts)')\
                                     /(col(f'sum(true_attempts)')+col(f'sum(false_attempts)')))).na.fill(-1)

# Get cumulative lagged accuracy
for Assess_Title in Assess_Titles:
    SessionAccuracyDf=SessionAccuracyDf.withColumn(Assess_Title+'AccyLagCum',\
                                   col(f'sum({Assess_Title}True)LagCum')\
                                     /col(f'sum({Assess_Title}AllAttemps)LagCum')).na.fill(-1)
    
SessionAccuracyDf=SessionAccuracyDf.withColumn('AllAssessmentAccyLagCum',\
                    col('sum(true_attempts)LagCum')\
                    /(col('sum(true_attempts)LagCum')+col('sum(false_attempts)LagCum'))\
                    ).na.fill(-1)

In [None]:
%%time
#Get lagged features
ColumnsToLag=['Cart Balancer (Assessment)Accy', 'Cauldron Filler (Assessment)Accy', 'Bird Measurer (Assessment)Accy',
 'Mushroom Sorter (Assessment)Accy', 'Chest Sorter (Assessment)Accy', 'AllAssessmentAccy']
lag_cols = [F.lag(col(Column), 1).over(windowval).alias(Column+'Lag') \
                     for Column in ColumnsToLag]
SessionAccuracyDf=SessionAccuracyDf.select(SessionAccuracyDf.columns+lag_cols)
LaggedCols=[i+'Lag' for i in ColumnsToLag]
SessionAccuracyDf=SessionAccuracyDf.fillna(-1, subset=LaggedCols)

In [None]:
%%time
#Add accuracy_group for each world across each session
Assess_Titles=list_of_assess_titlesDf.toPandas()[['title']].values.tolist()
for Assess_Title in Assess_Titles:
    #Create accuracy group by world
    SessionAccuracyDf=SessionAccuracyDf.withColumn(Assess_Title[0]+'_accuracy_group',\
         when((col(f'sum({Assess_Title[0]}True)')==0)\
          &(col(f'sum({Assess_Title[0]}False)')==0),'NoAssess'\
             ).otherwise(\
                       when((col(f'sum({Assess_Title[0]}True)')==1)\
                            &(col(f'sum({Assess_Title[0]}False)')==0),'3'\
                     ).otherwise(\
                            when((col(f'sum({Assess_Title[0]}True)')==1)\
                                 &(col(f'sum({Assess_Title[0]}False)')==1),'2'\
                                ).otherwise(when(col(f'sum({Assess_Title[0]}True)')==0,'0'\
                                                ).otherwise('1')\
                                ))))
#Create overall accuracy group 
SessionAccuracyDf=SessionAccuracyDf.withColumn('all_accuracy_group',\
     when((col(f'sum(true_attempts)')==0)\
          &(col(f'sum(false_attempts)')==0),'NoAssess'\
         ).otherwise(\
                   when((col(f'sum(true_attempts)')==1).cast(BooleanType())\
                        &(col(f'sum(false_attempts)')==0),'3'\
                 ).otherwise(\
                        when((col(f'sum(true_attempts)')==1)\
                                 &(col(f'sum(false_attempts)')==1),'2'\
                                ).otherwise(when(col(f'sum(true_attempts)')==0,'0'\
                                                ).otherwise('1')\
                                ))))
ColumnsToLag=[Title[0]+'_accuracy_group' for Title in Assess_Titles]


#Add lagged accuracies
Condition=F.lag(col('all_accuracy_group'),1).over(windowval)!='NoAssess'
SessionAccuracyDf=SessionAccuracyDf.withColumn("Lag_all_accuracy", F.when(Condition\
                                    , F.lag(col('AllAssessmentAccy'),1).over(windowval)))
SessionAccuracyDf=SessionAccuracyDf.withColumn("Lag_all_accuracy"\
                            ,F.last(col('lag_all_accuracy'),ignorenulls=True).over(windowval))\
                            .na.fill(-1)

for Assess_Title in Assess_Titles:
    Condition=F.lag(col(f'{Assess_Title[0]}_accuracy_group'),1).over(windowval)!='NoAssess'
    SessionAccuracyDf=SessionAccuracyDf.withColumn(f'Lag_{Assess_Title[0]}_accuracy', F.when(Condition\
                                    , F.lag(col(f'{Assess_Title[0]}Accy'),1).over(windowval)))
    SessionAccuracyDf=SessionAccuracyDf.withColumn(f'Lag_{Assess_Title[0]}_accuracy'\
                            ,F.last(col(f'lag_{Assess_Title[0]}_accuracy'),ignorenulls=True).over(windowval))\
                            .na.fill(-1)

    
#Create lagged accuracy group 
SessionAccuracyDf=SessionAccuracyDf.withColumn('Lag_all_accuracy_group',\
     when((col('Lag_all_accuracy')==0),0)\
            .otherwise(\
                   when((col('Lag_all_accuracy')==1),3\
                 ).otherwise(\
                        when((col('Lag_all_accuracy')==0.5),2\
                                ).otherwise(\
                                             when((col('Lag_all_accuracy')==-1),-1\
                                                  ).otherwise(1)\
                                ))))
ColumnsToLag=[Title[0]+'_accuracy_group' for Title in Assess_Titles]

## Collate the various dataframes

In [None]:
SessionAccuracyDf=SessionAccuracyDf.drop('StartTime','NumAssessmentAttemptsLagCum'\
                                        ,'PreviousSessStart','TimeSinceLastSess')

In [None]:
%%time
#Collate all details of assessment sessions - we will filter non-assessment sessions at the end.  

#Join the frame with current session assessments
SessionAccuracyDf.createOrReplaceTempView("SessionAccuracy")
SessionTimingCumDf.createOrReplaceTempView("SessionTimingCum")
reduce_CombinedDf=Spark.sql(f'SELECT *,\
SessionTimingCum.installation_id as id1,SessionTimingCum.game_session as gs1 \
from SessionTimingCum LEFT JOIN SessionAccuracy \
ON SessionTimingCum.installation_id=SessionAccuracy.installation_id \
AND SessionTimingCum.game_session=SessionAccuracy.game_session').drop('installation_id','game_session')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')


reduce_CombinedDf=reduce_CombinedDf.na.drop()

reduce_CombinedDf.createOrReplaceTempView("reduce_Combined")


# # # #Remove non-assessment sessions - their data will be captured in the 'Cum' fields, we only want to focus on assessment sessions
reduce_CombinedDf=reduce_CombinedDf.filter(reduce_CombinedDf['NumAssessmentAttempts']>0)


In [None]:
%%time
#one-hot encode title 
SessionTitleDf=SessionTimingCumDf.groupBy("installation_id","game_session").pivot("r_session_title").count().na.fill(0)
SessionTitleDf.createOrReplaceTempView("SessionTitle")
reduce_CombinedDf.createOrReplaceTempView("reduce_Combined")
reduce_CombinedDf=Spark.sql(f'SELECT *,\
reduce_Combined.installation_id as id1,reduce_Combined.game_session as gs1 \
from reduce_Combined INNER JOIN SessionTitle \
ON reduce_Combined.installation_id=SessionTitle.installation_id \
AND reduce_Combined.game_session=SessionTitle.game_session').drop('installation_id','game_session')\
.withColumnRenamed('id1','installation_id').withColumnRenamed('gs1','game_session')

In [None]:
reduce_CombinedDf=reduce_CombinedDf.na.drop()

In [None]:
#remove '()'  etc as parquet doen't like spaces or "()' in field names 
reduce_CombinedDf=reduce_CombinedDf.toDF(*(c.replace('(', '') for c in reduce_CombinedDf.columns))
reduce_CombinedDf=reduce_CombinedDf.toDF(*(c.replace(')', '') for c in reduce_CombinedDf.columns))
reduce_CombinedDf=reduce_CombinedDf.toDF(*(c.replace(',', '') for c in reduce_CombinedDf.columns))
reduce_CombinedDf=reduce_CombinedDf.toDF(*(c.replace(' ', '') for c in reduce_CombinedDf.columns))

## Identify the features

In [None]:
KeyColumns =['installation_id', 'game_session', 'StartTime', 'PreviousSessStart','TestOrTrain','Type','TestFlag','EndTime',
 'r_session_title','World']

In [None]:
AccuracyGroupColumns =['CartBalancerAssessment_accuracy_group', 'CauldronFillerAssessment_accuracy_group'\
, 'BirdMeasurerAssessment_accuracy_group', 'MushroomSorterAssessment_accuracy_group'\
, 'ChestSorterAssessment_accuracy_group','all_accuracy_group']

In [None]:
CurrentAssessmentColumns=['accuracy_group',
 'NumAssessmentAttempts',
 'sumtrue_attempts',
 'sumfalse_attempts',
 'sumCartBalancerAssessmentTrue',
 'sumCartBalancerAssessmentFalse',
 'sumCartBalancerAssessmentAllAttemps',
 'sumCauldronFillerAssessmentTrue',
 'sumCauldronFillerAssessmentFalse',
 'sumCauldronFillerAssessmentAllAttemps',
 'sumBirdMeasurerAssessmentTrue',
 'sumBirdMeasurerAssessmentFalse',
 'sumBirdMeasurerAssessmentAllAttemps',
 'sumMushroomSorterAssessmentTrue',
 'sumMushroomSorterAssessmentFalse',
 'sumMushroomSorterAssessmentAllAttemps',
 'sumChestSorterAssessmentTrue',
 'sumChestSorterAssessmentFalse',
 'sumChestSorterAssessmentAllAttemps',
 'CartBalancerAssessmentAccy',
 'CauldronFillerAssessmentAccy',
 'BirdMeasurerAssessmentAccy',
 'MushroomSorterAssessmentAccy',
 'ChestSorterAssessmentAccy',
 'AllAssessmentAccy',
  'NumEvents','misses_cnt', 'game_round',
 'game_level',
 'SessionDuration','sum12Monkeys',
 'sumAirShow',
 'sumAllStarSorting',
 'sumBalancingAct',
 'sumBirdMeasurerAssessment',
 'sumBottleFillerActivity',
 'sumBubbleBath',
 'sumBugMeasurerActivity',
 'sumCartBalancerAssessment',
 'sumCauldronFillerAssessment',
 'sumChestSorterAssessment',
 'sumChickenBalancerActivity',
 'sumChowTime',
 'sumCostumeBox',
 'sumCrystalCaves-Level1',
 'sumCrystalCaves-Level2',
 'sumCrystalCaves-Level3',
 'sumCrystalsRule',
 'sumDinoDive',
 'sumDinoDrink',
 'sumEggDropperActivity',
 'sumFireworksActivity',
 'sumFlowerWatererActivity',
 'sumHappyCamel',
 'sumHeavyHeavierHeaviest',
 'sumHoneyCake',
 'sumLeafLeader',
 'sumLiftingHeavyThings',
 'sumMagmaPeak-Level1',
 'sumMagmaPeak-Level2',
 'sumMushroomSorterAssessment',
 'sumOrderingSpheres',
 'sumPanBalance',
 "sumPirate'sTale",
 'sumRulers',
 'sumSandcastleBuilderActivity',
 'sumScrub-A-Dub',
 'sumSlopProblem',
 'sumTreasureMap',
 'sumTreeTopCity-Level1',
 'sumTreeTopCity-Level2',
 'sumTreeTopCity-Level3',
 'sumWateringHoleActivity',
 'sumWelcometoLostLagoon!', 'ActivityDur',
 'AssessmentDur',
 'ClipDur',
 'GameDur',
 'sumwin_code'
]      

In [None]:
feature_cols=[x for x in reduce_CombinedDf.columns if (x not in AccuracyGroupColumns) and (x not in KeyColumns) \
              and (x not in CurrentAssessmentColumns) ]

In [None]:
[x for x in feature_cols if not 'Lag' in x] 

In [None]:
# vectorise the features
vectorAssembler = VectorAssembler(inputCols=feature_cols,outputCol='features',handleInvalid="skip")
assembledDf = vectorAssembler.transform(reduce_CombinedDf).drop(*feature_cols)

In [None]:
#Convert target labels to integers
cols_to_convert = [col(Column).cast(IntegerType()) for Column in AccuracyGroupColumns]

missing_cols = [i for i in assembledDf.columns if i not in AccuracyGroupColumns]

assembledDf=assembledDf.select(missing_cols+cols_to_convert)

And we have finally finished the feature engineering. Wow that took forever! Much time was spent learning how to do things in Pyspark, though to paraphrase Thomas Edison, data science is 90% data wrangling and 10% applying algorthms. Hopefully innovations like automated feature engineering will help here. https://towardsdatascience.com/why-automated-feature-engineering-will-change-the-way-you-do-machine-learning-5c15bf188b96

In [None]:
%%time
#This is the step that takes the longest and might test Kaggle machine's HDD
assembledDf.write.mode("overwrite").save('assembledDf.parquet')