# RFM Segmentation
https://clevertap.com/blog/rfm-analysis/

## 1. Data Acquisition & Analysis

## 2. Modeling - Data Preparation

In [2]:
import org.apache.spark.sql.types._

val schema = StructType(
                List(
                    StructField("customer_id", StringType, false),
                    StructField("purchase_amount", DoubleType, false),
                    StructField("date_of_purchase", DateType, false)
                )
            )
val data = spark.read
                .option("sep", "\t")
                .option("mode","FAILFAST")
                .option("dateFormat","YYYY-MM-dd")
                .schema(schema)
                .csv("../../data/foundation-marketing-analytics/purchases.txt")
                .toDF

schema = StructType(StructField(customer_id,StringType,false), StructField(purchase_amount,DoubleType,false), StructField(date_of_purchase,DateType,false))
data = [customer_id: string, purchase_amount: double ... 1 more field]


[customer_id: string, purchase_amount: double ... 1 more field]

### Creating

#### Derive days_since col for RECENCY calculation later

In [3]:
import org.apache.spark.sql.functions._

//Why is end_date set to "2016-01-01"?

val enriched1 = data
                .withColumn("end_date", lit("2016-01-01"))
                .withColumn("days_since", datediff($"end_date", $"date_of_purchase"))
enriched1.show(5)

//Verify if any calculations have failed
val nullCount = enriched1.filter(isnull($"days_since")).count()
assert(nullCount == 0)

+-----------+---------------+----------------+----------+----------+
|customer_id|purchase_amount|date_of_purchase|  end_date|days_since|
+-----------+---------------+----------------+----------+----------+
|        760|           25.0|      2009-11-06|2016-01-01|      2247|
|        860|           50.0|      2012-09-28|2016-01-01|      1190|
|       1200|          100.0|      2005-10-25|2016-01-01|      3720|
|       1420|           50.0|      2009-07-09|2016-01-01|      2367|
|       1940|           70.0|      2013-01-25|2016-01-01|      1071|
+-----------+---------------+----------------+----------+----------+
only showing top 5 rows



enriched1 = [customer_id: string, purchase_amount: double ... 3 more fields]
nullCount = 0


0

#### Create features: FREQUENCY, RECENCY (in days) and Monetary value

1. Recency: How recently a customer has made a purchase
2. Frequency: How often a customer makes a purchase
3. Monetary Value: How much money a customer spends on purchases

In [4]:
val enriched2 = enriched1
                .groupBy($"customer_id")
                .agg(
                    min($"days_since").alias("recency"),
                    count($"customer_id").alias("frequency"),
                    avg($"purchase_amount").alias("amount"))
enriched2.filter($"customer_id".isin("10", "90")).show(5)

+-----------+-------+---------+------+
|customer_id|recency|frequency|amount|
+-----------+-------+---------+------+
|         90|    758|       10| 115.8|
|         10|   3829|        1|  30.0|
+-----------+-------+---------+------+



enriched2 = [customer_id: string, recency: int ... 2 more fields]


[customer_id: string, recency: int ... 2 more fields]

**Let us do some summary/descriptive stats**

In [5]:
enriched2.describe().show()

+-------+------------------+------------------+------------------+------------------+
|summary|       customer_id|           recency|         frequency|            amount|
+-------+------------------+------------------+------------------+------------------+
|  count|             18417|             18417|             18417|             18417|
|   mean|137573.51088668077|  1253.03789976652|2.7823749796383774|57.792985101815624|
| stddev|  69504.5998805089|1081.4378683668397| 2.936888270392829|154.36010930845674|
|    min|                10|                 1|                 1|               5.0|
|    max|             99990|              4014|                45|            4500.0|
+-------+------------------+------------------+------------------+------------------+



In [6]:
enriched2.createOrReplaceTempView("customers")
spark.sql("select * from customers").show()

+-----------+-------+---------+------------------+
|customer_id|recency|frequency|            amount|
+-----------+-------+---------+------------------+
|       6240|   3005|        3| 76.66666666666667|
|      52800|   3320|        1|              15.0|
|     100140|     13|        4|             51.25|
|     109180|     30|        8|             48.75|
|     131450|    205|        8|            103.75|
|      45300|    234|        6|29.166666666666668|
|      69460|     15|        9| 28.88888888888889|
|      86180|      2|        9| 21.11111111111111|
|     161110|   1528|        1|              30.0|
|      60070|   2074|        3|51.666666666666664|
|      13610|   1307|        8|           3043.75|
|     100010|    413|        7|27.857142857142858|
|     107930|    150|        5|              79.0|
|     132610|     30|        7|28.571428571428573|
|     154770|    427|        1|              45.0|
|      49290|    371|        5|              24.0|
|     229650|    419|        1|

### Feature Scaling

We need to address the following before we can train the model:

1. Data Dispersion: The *amount* is skewed. We need to take **log** to address this skewness.
2. Scale: Different features use different scales (see table). These features need to be standardized so that they contribute equally to the result.


|Feature|Scale|
|-------|-----|
|recency|days|
|frequency|purchase occassions|
|amount|dollars|

**1. Handle data dispersion**

Below histogram shows that majority of the purchases are below 5 and 300 dollars. The data is left-skewed .

In [16]:
val (startValues, counts) = enriched2.select($"amount").map(v=>v.getDouble(0)).rdd.histogram(15)

startValues = Array(5.0, 304.6666666666667, 604.3333333333334, 904.0, 1203.6666666666667, 1503.3333333333333, 1803.0, 2102.6666666666665, 2402.3333333333335, 2702.0, 3001.6666666666665, 3301.3333333333335, 3601.0, 3900.6666666666665, 4200.333333333333, 4500.0)
counts = Array(18126, 144, 28, 56, 16, 8, 19, 4, 3, 5, 1, 0, 0, 5, 2)


Array(18126, 144, 28, 56, 16, 8, 19, 4, 3, 5, 1, 0, 0, 5, 2)

Let us derive a new column, *log_amount*

In [24]:
val enriched3 = enriched2.withColumn("log_amount", log($"amount"))
val (startValues, counts) = enriched3.select($"log_amount").map(v=>v.getDouble(0)).rdd.histogram(15)
enriched3.createOrReplaceTempView("customers")
spark.sql("select * from customers").show(5)

+-----------+-------+---------+-----------------+------------------+
|customer_id|recency|frequency|           amount|        log_amount|
+-----------+-------+---------+-----------------+------------------+
|       6240|   3005|        3|76.66666666666667| 4.339467020255086|
|      52800|   3320|        1|             15.0|  2.70805020110221|
|     100140|     13|        4|            51.25|3.9367156180185177|
|     109180|     30|        8|            48.75| 3.886705197443856|
|     131450|    205|        8|           103.75| 4.641984159110808|
+-----------+-------+---------+-----------------+------------------+
only showing top 5 rows



enriched3 = [customer_id: string, recency: int ... 3 more fields]
startValues = Array(1.6094379124341003, 2.062930896655721, 2.516423880877342, 2.9699168650989627, 3.423409849320583, 3.8769028335422036, 4.330395817763825, 4.783888801985445, 5.237381786207066, 5.690874770428687, 6.1443677546503075, 6.597860738871929, 7.051353723093549, 7.50484670731517, 7.958339691536791, 8.411832675758411)
counts = Array(109, 987, 1567, 6989, 3004, 3291, 1453, 487, 217, 95, 81, 73, 26, 26, 12)


Array(109, 987, 1567, 6989, 3004, 3291, 1453, 487, 217, 95, 81, 73, 26, 26, 12)

**2. Scale**

Distance algorithms like K-means (clustering) are most affected by the range of features because they are using distance between data points to determine their similarity.  So there is a chance that higher weights will be given to features with higher magnitude.

**Standardization** is a technique used to scale our features so that all the features **contribute equally to the result**. Logic: (data - mean)/std-dev. This puts most of the data between -2 & 2.

We will use [StandardScaler](https://spark.apache.org/docs/1.4.1/ml-features.html#standardscaler) from Spark.

In [29]:
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.util.MLUtils

val assembler = new VectorAssembler()
                    .setInputCols(Array("recency", "frequency", "log_amount"))
                    .setOutputCol("features")

val features = assembler.transform(enriched3)
features.show(3)


val scaler = new StandardScaler()
                .setInputCol("features")
                .setOutputCol("std_features")
                .setWithStd(true)
                .setWithMean(true)
// Compute summary statistics by fitting the StandardScaler
val scalerModel = scaler.fit(features)
// Standardize each feature to have unit standard deviation.
val scaledFeatures = scalerModel.transform(features)
scaledFeatures.show(3, truncate=false)

+-----------+-------+---------+-----------------+------------------+--------------------+
|customer_id|recency|frequency|           amount|        log_amount|            features|
+-----------+-------+---------+-----------------+------------------+--------------------+
|       6240|   3005|        3|76.66666666666667| 4.339467020255086|[3005.0,3.0,4.339...|
|      52800|   3320|        1|             15.0|  2.70805020110221|[3320.0,1.0,2.708...|
|     100140|     13|        4|            51.25|3.9367156180185177|[13.0,4.0,3.93671...|
+-----------+-------+---------+-----------------+------------------+--------------------+
only showing top 3 rows

+-----------+-------+---------+-----------------+------------------+------------------------------+-----------------------------------------------------------+
|customer_id|recency|frequency|amount           |log_amount        |features                      |std_features                                               |
+-----------+-------+----

assembler = vecAssembler_e1c75f813c76
features = [customer_id: string, recency: int ... 4 more fields]
scaler = stdScal_2bbbe005b11a
scalerModel = stdScal_2bbbe005b11a
scaledFeatures = [customer_id: string, recency: int ... 5 more fields]


[customer_id: string, recency: int ... 5 more fields]