# Ecommerce Churn Assignment

The aim of the assignment is to build a model that predicts whether a person purchases an item after it has been added to the cart or not. Being a classification problem, you are expected to use your understanding of all the three models covered till now. You must select the most robust model and provide a solution that predicts the churn in the most suitable manner. 

For this assignment, you are provided the data associated with an e-commerce company for the month of October 2019. Your task is to first analyse the data, and then perform multiple steps towards the model building process.

The broad tasks are:
- Data Exploration
- Feature Engineering
- Model Selection
- Model Inference

### Data description

The dataset stores the information of a customer session on the e-commerce platform. It records the activity and the associated parameters with it.

- **event_time**: Date and time when user accesses the platform
- **event_type**: Action performed by the customer
            - View
            - Cart
            - Purchase
            - Remove from cart
- **product_id**: Unique number to identify the product in the event
- **category_id**: Unique number to identify the category of the product
- **category_code**: Stores primary and secondary categories of the product
- **brand**: Brand associated with the product
- **price**: Price of the product
- **user_id**: Unique ID for a customer
- **user_session**: Session ID for a user


### Initialising the SparkSession

The dataset provided is 5 GBs in size. Therefore, it is expected that you increase the driver memory to a greater number. The m4.xlarge instance holds 16 GB memory and we are allocating 14 GB to the driver.

In [1]:
# initialising the session with 14 GB driver memory
MAX_MEMORY = "14g"
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('DT').config("spark.driver.memory", MAX_MEMORY).getOrCreate()

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
5,application_1595531932238_0006,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
# installing required libraries
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.evaluation import MulticlassMetrics

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
# loading the dataset from the parquet file
df = spark.read.parquet("s3://asr-aiml-2020/processed.parquet")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
df = df.withColumnRenamed("is_purchased","label")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
from pyspark.sql.types import DoubleType

df = df.withColumn("label", df["label"].cast(DoubleType()))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
# Check if the dataframe is correctly loaded
df.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+------+------------+-----------+-------------+-----------+--------------------+---------------------+---------+-----+-----------------+---------+------------+---------------+--------------+--------------+---------------+--------------------+
|  brand| price|    Category|SubCategory|ActivityCount|ProductView|SubCategoryViewCount|AvgSpendingOnCategoty|EventHour|label|EventHour_Buckets|brand_idx|Category_idx|SubCategory_idx|     brand_Enc|  Category_enc|SubCategory_enc|            features|
+-------+------+------------+-----------+-------------+-----------+--------------------+---------------------+---------+-----+-----------------+---------+------------+---------------+--------------+--------------+---------------+--------------------+
|samsung|172.14| electronics| smartphone|            5|         10|                  76|   464.33546803887987|       22|  1.0|                3|      0.0|         0.0|            0.0|(20,[0],[1.0])|(12,[0],[1.0])| (37,[0],[1.0])|(75,[0,20,32,69,7.

In [7]:
# exploring the dataframe - schema
df.schema

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

StructType(List(StructField(brand,StringType,true),StructField(price,DoubleType,true),StructField(Category,StringType,true),StructField(SubCategory,StringType,true),StructField(ActivityCount,IntegerType,true),StructField(ProductView,IntegerType,true),StructField(SubCategoryViewCount,IntegerType,true),StructField(AvgSpendingOnCategoty,DoubleType,true),StructField(EventHour,IntegerType,true),StructField(label,DoubleType,true),StructField(EventHour_Buckets,IntegerType,true),StructField(brand_idx,DoubleType,true),StructField(Category_idx,DoubleType,true),StructField(SubCategory_idx,DoubleType,true),StructField(brand_Enc,VectorUDT,true),StructField(Category_enc,VectorUDT,true),StructField(SubCategory_enc,VectorUDT,true),StructField(features,VectorUDT,true)))

In [8]:
# Number of rows and columns in the dataset
print(df.count(), len(df.columns))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

784361 18

<hr>

## Task 3: Model Selection
3 models for classification:	
- Logistic Regression
- Decision Tree
- Random Forest

### Model 2: Decision Trees

In [9]:
# Splitting the data into train and test (Remember you are expected to compare the model later)
train, test = df.randomSplit([0.7,0.3], seed=5043)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
# Number of rows in train and test data
print(train.count(), test.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

549630 234731

#### Model Fitting

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
# Building the model with hyperparameter tuning
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
dtevaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
# Create ParamGrid for Cross Validation
dtparamGrid = (ParamGridBuilder() 
             .addGrid(dt.maxDepth, [10, 20, 30])
             .addGrid(dt.maxBins, [10, 30, 60, 90])
             .build())


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
# Run cross-validation steps
dtcv = CrossValidator(estimator = dt,
                      estimatorParamMaps = dtparamGrid,
                      evaluator = dtevaluator,
                      numFolds = 4)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
# Fitting the models on transformed df
dtcvModel = dtcv.fit(train)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-13:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 1585



In [14]:
# Best model from the results of cross-validation
bestModel = dtcvModel.bestModel

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### Model Analysis

Required Steps:
- Fit on test data
- Performance analysis
    - Appropriate Metric with reasoning

In [15]:
dtpredictions = bestModel.transform(test)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
print('Accuracy:', dtevaluator.evaluate(dtpredictions))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Accuracy: 0.7059376478626083

In [17]:
print('AUC:', BinaryClassificationMetrics(dtpredictions['label','prediction'].rdd).areaUnderROC)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

AUC: 0.7189282691738145

In [18]:
def confusion_matrix(pred_df):
    rdd = pred_df.select(['prediction', 'label']).rdd.map(tuple)
    metrics = MulticlassMetrics(rdd)
    return metrics.confusionMatrix().toArray()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [19]:
print(confusion_matrix(dtpredictions))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[[79002. 33971.]
 [31919. 89839.]]

#### Summary of the best Decision Tree model

In [20]:
sc.install_pypi_package("pandas")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/af/f3/683bf2547a3eaeec15b39cef86f61e921b3b187f250fcd2b5c5fb4386369/pandas-1.0.5-cp37-cp37m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.6.1 (from pandas)
  Using cached https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl
Installing collected packages: python-dateutil, pandas
Successfully installed pandas-1.0.5 python-dateutil-2.8.1

In [21]:
import pandas as pd
def ExtractFeatureImp(featureImp, dataset, featuresCol):
    list_extract = []
    for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
        list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
ExtractFeatureImp(bestModel.featureImportances, df, "features").head(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

    idx                        name     score
0    69                       price  0.232203
1    70               ActivityCount  0.228910
3    72        SubCategoryViewCount  0.201989
2    71                 ProductView  0.155227
4    73           EventHour_Buckets  0.057929
6     0           brand_Enc_samsung  0.023280
38   32  SubCategory_enc_smartphone  0.010880
8     2            brand_Enc_xiaomi  0.008114
9     3            brand_Enc_others  0.007908
5    74       AvgSpendingOnCategoty  0.006387