# Goal

To group customers into meaningful groups using 
1. Kmean clustering
    * 1a.  All event data 
    * 1b. RFI clustering 
2. RFM analysis
    * Analysis of RFI clustering, RFM manual buckets using quantiles and business knowledge
3. Cohort analysis

## Setup config

In [1]:
import findspark
findspark.init()

import pyspark;
#get_ipython().profile_dir.startup_dir

In [19]:
# PYTHON MODULES

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import time
import datetime as dt
import functools
from pyspark.sql import functions as f
from pandas.plotting import parallel_coordinates

In [3]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext 

In [4]:
spark = SparkSession.builder.master('local').appName('test').config('spark.driver.memory', '5G').getOrCreate()

In [5]:
spark.builder.config('spark.executor.memory', '16G')
spark.builder.config("spark.executor.cores", "4")

<pyspark.sql.session.SparkSession.Builder at 0x1117af208>

### Data Import

In [6]:
rfm_df = spark.read.csv("/Users/spurushe/Documents/data-science-world/input_data/Online_Retail.csv"
                        , inferSchema=True
                        ,header=True
                        ,timestampFormat = "MM/dd/yyyy hh:mm")

In [7]:
rfm_df.show(4)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|0010-12-01 08:26:00|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|0010-12-01 08:26:00|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|0010-12-01 08:26:00|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|0010-12-01 08:26:00|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 4 rows



In [8]:
rfm_df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



### Recency Frequency Monetary Value metrics
We will be treating this data as if it were recent and calculating a hypothetical 'today'.  
Using this 'hypothetical today' lets calculate the RFM for the last 12 months. 

Metrics used 
* Recency -- time unit since last transaction (the lower the better) 
* Frequency -- number of transactions in the last (unit of time) (the higher the better)
* Monetary Value -- total spend by the customer in the (unit of time) (the higher the better)

Unit of time is chosen according to the (1) business model and the (2) product and customer lifecycle.

Followed by Segmentation of RFM values which can be done by 
1. Percentiles or quantiles 
2. Pareto split i.e. 80/20 rule 
3. Based on predefined thresholds decided through business knowledge

**(A) Data should be very recent i.e. either today or yesterday. Else we need to create a hypothetical 'today' mimicking a recent snapshot of the data.  
We will use this hypothetical today to calculate the recency.**

In [21]:
#Calculating the min and max transaction dates for the invoice
#snapshot_date = rfm_df.select(['InvoiceDate']).groupby().agg({'InvoiceDate':'max'})

In [26]:
snapshot_date = f.max(rfm_df.InvoiceDate)

In [27]:
snapshot_date.

Column<b'max(InvoiceDate)'>


## 1. Segmentation using KMeans clustering

Assumptions of Kmeans clustering
1. The data should be symmetrical -- check for skewness
2. Same variance of each variable 
3. Same mean of each variable so that all variables contribute equally to the clustering. 

Therefore the pipeline should be 
1. Check for Skewness -- fix with log transformation
2. Check for centrality -- fix with z score normalization (subtract by mean and divide by std)
3. Kmeans

In [28]:
spark.stop()