In [2]:
# Setting the environment variables
import os
import sys
import pandas as pd
import numpy as np
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook --no-browser"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/home/ec2-user/spark-2.4.4-bin-hadoop2.7"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

# Ecommerce Churn Assignment

The aim of the assignment is to build a model that predicts whether a person purchases an item after it has been added to the cart or not. Being a classification problem, you are expected to use your understanding of all the three models covered till now. You must select the most robust model and provide a solution that predicts the churn in the most suitable manner. 

For this assignment, you are provided the data associated with an e-commerce company for the month of October 2019. Your task is to first analyse the data, and then perform multiple steps towards the model building process.

The broad tasks are:
- Data Exploration
- Feature Engineering
- Model Selection
- Model Inference

### Data description

The dataset stores the information of a customer session on the e-commerce platform. It records the activity and the associated parameters with it.

- **event_time**: Date and time when user accesses the platform
- **event_type**: Action performed by the customer
            - View
            - Cart
            - Purchase
            - Remove from cart
- **product_id**: Unique number to identify the product in the event
- **category_id**: Unique number to identify the category of the product
- **category_code**: Stores primary and secondary categories of the product
- **brand**: Brand associated with the product
- **price**: Price of the product
- **user_id**: Unique ID for a customer
- **user_session**: Session ID for a user


### Initialising the SparkSession

The dataset provided is 5 GBs in size. Therefore, it is expected that you increase the driver memory to a greater number. You can refer to notebook 1 for the steps involved here.

In [3]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct ,col, avg, mean, isnan, when, count, col, to_timestamp
from pyspark.sql.types import IntegerType
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from datetime import timedelta
import pandas as pd
import numpy as np
import pyspark.sql.functions as f

In [4]:
# initialising the session with 14 GB driver memory
MAX_MEMORY = "14G"

spark = SparkSession \
    .builder \
    .appName("demo") \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

spark

In [5]:
# Loading the clean data
df_clean = spark.read.csv('/home/ec2-user/df_updated2', header= False, inferSchema= True)

<hr>

## Task 4: Model Inference

- Feature Importance
- Model Inference
- Feature exploration

## Logistic Regression

- Features considered

categorical_features=['brand_new','category2', 'day_of_week']
continuous_features=['price', 'product_view_counts', 'category2_view_counts', 'average_shopping_expense', 'session_counts', 'activity_countval']


- Model Inference

Model generated the following inferences:
- Accuracy: 81%
- Recall: 85%
- Precision: 81%
- fMeasure: 83%
- Area under ROC: 88%

## Decision Tree

- Features considered

all_categorical_features = ['day_of_week', 'category1', 'category2', 'binnedhour', 'brand_new']
all_continuous_features = ['price', 'activity_countval', 'product_view_counts', 'category2_view_counts', 'session_counts', 'average_shopping_expense']

- Model Inference

Model generated the following inferences:
- Accuracy: 96%
- Recall: 96%
- Precision: 96%

### Final observation

- While comparing both the models, Decision Tree seems to generate high recall & precision values.  It seems to be the best model to identify whether the product will move from cart to order (i.e whether the person will eventually buy the product or not)