# RFM Managerial Segmentation

To validate Module 2 and correctly answer the questions will require that you perform the following exercise first: take the code in the file named module2.R, and modify it such that the managerial segment "new active" is divided into two sub-segments called **new active high** (with an average purchase amount of >= 100 dollars) and **new active low** (with an average purchase amount of < 100). 

>Tips: make sure that you apply that modification to both the 2015 and 2014 segmentations, and that you modify accordingly the code that re-order the factor "segment".

You'll be asked to answer these questions :

1. How many "new active low" customers were there in 2015?
2. The number of "new active high" customers has increased between 2014 and 2015. What is the rate of that increase?
3. Regarding the customers who belonged to the "new warm" segment in 2014, what was there expected revenue, all things considered, in 2015?
4. In terms of expected revenue, which segment groups the least profitable customers?
5. Looking at segment description, what is the average purchase amount of a customer who belongs to the "new active high" segment?

![Rules](assignment-2-rfm-seg-rules.png)

In [1]:
import java.util.concurrent.TimeUnit
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.Row
import org.apache.spark.sql.Column
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

In [2]:
val schema = StructType(
                List(
                    StructField("customer_id", StringType, false),
                    StructField("purchase_amount", DoubleType, false),
                    StructField("date_of_purchase", DateType, false)
                )
            )

val OneYear = 365
val TwoYears = OneYear * 2
val ThreeYears = OneYear * 3

def enrich(in:DataFrame, dataBaseInvoiceDate: Column) : DataFrame = {
    in
        .withColumn("end_date", dataBaseInvoiceDate)
        .withColumn("year_of_purchase", year($"date_of_purchase"))
        .withColumn("days_since", datediff($"end_date", $"date_of_purchase"))
}

def calcRFM(in:DataFrame) : DataFrame = {
    in
        .groupBy($"customer_id")
        .agg(
            max($"days_since").alias("first_purchase"),
            min($"days_since").alias("recency"),
            count($"*").alias("frequency"),
            avg($"purchase_amount").alias("amount"))
}

def firstLevelSegmentation(in:DataFrame):DataFrame = {
    in
        .withColumn("segment1", 
                        when($"recency" > ThreeYears, "inactive")
                        .when($"recency" > TwoYears && $"recency" <= ThreeYears, "cold")
                        .when($"recency" > OneYear && $"recency" <= TwoYears, "warm")
                        .otherwise("active"))
}

/*
* Make sure that the conditions for "warm new" and "active new" come eralier than other conditions with respective 
* categories for accurate results

* This assignment requires two new segements: "active new high value" and "active new low value"
*/
def secondLevelSegmentation(in:DataFrame) :DataFrame = {
    in
        .withColumn("segment2",
                        when($"segment1" === lit("warm") && $"first_purchase" <= TwoYears, "warm new")
                        .when($"segment1" === lit("warm") && $"amount" >= 100, "warm high value")
                        .when($"segment1" === lit("warm") && $"amount" < 100, "warm low value")
                        .when($"segment1" === lit("active") && $"first_purchase" <= OneYear && $"amount" >= 100, "active new high value")
                        .when($"segment1" === lit("active") && $"first_purchase" <= OneYear && $"amount" < 100, "active new low value")
                        .when($"segment1" === lit("active") && $"amount" >= 100, "active high value")
                        .when($"segment1" === lit("active") && $"amount" < 100, "active low value"))
}

def segmentation(segment1Level:DataFrame, segment2Level:DataFrame) :DataFrame = {
    segment1Level
        .join(segment2Level, segment1Level("customer_id") === segment2Level("customer_id"), "inner")
            .select(segment1Level("customer_id"),
                    segment1Level("first_purchase"),
                    segment1Level("recency"),
                    segment1Level("frequency"),
                    segment1Level("amount"),
                    segment1Level("segment1"),
                    segment2Level("segment2"))
            .withColumn("segment", when(segment2Level("segment2").isNotNull, $"segment2").otherwise(segment1Level("segment1")))
            .orderBy("segment")
        
}

def segmentProfile(segmented: DataFrame, segColName: String) :DataFrame = {
    segmented
        .groupBy(col(segColName))
        .agg(
                round(avg($"recency"),2).alias("avg_r"),
                round(avg($"frequency"),2).alias("avg_f"),
                round(avg($"amount"),2).alias("avg_a"))
        .orderBy(col(segColName))
}

def sumSegmentCounts(collectedArr: Array[Row]): Long = {
    def convertNull(elem: String): String = if(elem == null) "null" else elem
    
    val count = collectedArr
                    .map(row => (convertNull(row.getString(0)), row.getLong(1)))//null throws exception
                    .map{case (seg: String, cnt: Long) => cnt}
                    .reduceLeft((accum, elem) => accum + elem)

    println("Total customer count: "+ count)
    
    count
}

schema = StructType(StructField(customer_id,StringType,false), StructField(purchase_amount,DoubleType,false), StructField(date_of_purchase,DateType,false))
OneYear = 365
TwoYears = 730
ThreeYears = 1095


enrich: (in: org.apache.spark.sql.DataFrame, dataBaseInvoiceDate: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
calcRFM: (in: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
firstLevelSegmentation: (in: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
secondLevelSegmentation: (in: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
segmentation: (segment1Level: org.apache.spark.sql.DataFrame, segment2Level: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
segmentProfile: (...


1095

In [3]:
val data = spark.read
                .option("sep", "\t")
                .option("mode","FAILFAST")
                .option("dateFormat","YYYY-MM-dd")
                .schema(schema)
                .csv("../../data/foundation-marketing-analytics/purchases.txt")
                .toDF

val enriched1 = enrich(data, lit("2016-01-01"))
val enriched2 = calcRFM(enriched1)  

enriched2.filter($"customer_id".isin("10", "90")).show(5)

+-----------+--------------+-------+---------+------+
|customer_id|first_purchase|recency|frequency|amount|
+-----------+--------------+-------+---------+------+
|         90|          3783|    758|       10| 115.8|
|         10|          3829|   3829|        1|  30.0|
+-----------+--------------+-------+---------+------+



data = [customer_id: string, purchase_amount: double ... 1 more field]
enriched1 = [customer_id: string, purchase_amount: double ... 4 more fields]
enriched2 = [customer_id: string, first_purchase: int ... 3 more fields]


[customer_id: string, first_purchase: int ... 3 more fields]

**Total purchases**

In [4]:
println("Total purchases: "+ data.count())

Total purchases: 51243


**Distinct customers**

In [5]:
println("Total distinct customers: "+enriched2.count())//enriched2 is already grouped by customer_id

Total distinct customers: 18417


**Number of purchases by year**

In [6]:
enriched1.groupBy($"year_of_purchase").count().orderBy($"year_of_purchase".desc).show(20)

+----------------+-----+
|year_of_purchase|count|
+----------------+-----+
|            2015| 6197|
|            2014| 5739|
|            2013| 5912|
|            2012| 5960|
|            2011| 4785|
|            2010| 4939|
|            2009| 5054|
|            2008| 4331|
|            2007| 4674|
|            2006| 2182|
|            2005| 1470|
+----------------+-----+



## 2015 Segmentation

In [7]:
val segment1Level = firstLevelSegmentation(enriched2)
println("First 201level segmentation")
segment1Level.groupBy($"segment1").count().orderBy($"segment1").show()

val segment2Level = secondLevelSegmentation(segment1Level)
println("Second level segmentation")
segment2Level.groupBy($"segment2").count().orderBy($"segment2").show(10, truncate=false)

//Let us create final segementation for all customers. It is reflected in **segmented** dataframe
//Cache to speedup subsequent calculations
val segmented2015 = segmentation(segment1Level, segment2Level).cache()
println("2015 final segmentation")
segmented2015.groupBy($"segment").count().orderBy($"segment").show(10, truncate=false)

First 201level segmentation
+--------+-----+
|segment1|count|
+--------+-----+
|  active| 5398|
|    cold| 1903|
|inactive| 9158|
|    warm| 1958|
+--------+-----+

Second level segmentation
+---------------------+-----+
|segment2             |count|
+---------------------+-----+
|null                 |11061|
|active high value    |573  |
|active low value     |3313 |
|active new high value|263  |
|active new low value |1249 |
|warm high value      |119  |
|warm low value       |901  |
|warm new             |938  |
+---------------------+-----+

2015 final segmentation
+---------------------+-----+
|segment              |count|
+---------------------+-----+
|active high value    |573  |
|active low value     |3313 |
|active new high value|263  |
|active new low value |1249 |
|cold                 |1903 |
|inactive             |9158 |
|warm high value      |119  |
|warm low value       |901  |
|warm new             |938  |
+---------------------+-----+



segment1Level = [customer_id: string, first_purchase: int ... 4 more fields]
segment2Level = [customer_id: string, first_purchase: int ... 5 more fields]
segmented2015 = [customer_id: string, first_purchase: int ... 6 more fields]


[customer_id: string, first_purchase: int ... 6 more fields]

In [8]:
//Verify segmentation logic. Total distinct customer count shouldn't change no matter how segmentation is done

val collect1Level: Array[Row] = segment1Level.groupBy($"segment1").count().collect()
assert(18417 == sumSegmentCounts(collect1Level))
println("1st level segmenation is accurate")

val collect2Level: Array[Row] = segment2Level.groupBy($"segment2").count().collect()
assert(18417 == sumSegmentCounts(collect2Level))
println("2nd level segmenation is accurate")

val collect2015: Array[Row] = segmented2015.groupBy($"segment").count().collect()
assert(18417 == sumSegmentCounts(collect2015))
println("2015 final segmenation is accurate")

Total customer count: 18417
1st level segmenation is accurate
Total customer count: 18417
2nd level segmenation is accurate
Total customer count: 18417
2015 final segmenation is accurate


collect1Level = Array([warm,1958], [active,5398], [cold,1903], [inactive,9158])
collect2Level = Array([active new low value,1249], [warm high value,119], [active high value,573], [null,11061], [warm new,938], [active low value,3313], [active new high value,263], [warm low value,901])
collect2015 = Array([active high value,573], [active low value,3313], [active new high value,263], [active new low value,1249], [cold,1903], [inactive,9158], [warm high value,119], [warm low value,901], [warm new,938])


Array([active high value,573], [active low value,3313], [active new high value,263], [active new low value,1249], [cold,1903], [inactive,9158], [warm high value,119], [warm low value,901], [warm new,938])

### 1. How many "new active low" customers were there in 2015?

Choices:

1. 1234
2. 1249
3. 1349
4. 1437
5. 1512

In [9]:
val newActiveLowCount = collect2015
            .filter(row => row.get(0) == "active new low value")(0)
            .getLong(1)
println("Count of customers in 'new active low' segment: "+newActiveLowCount)

Count of customers in 'new active low' segment: 1249


newActiveLowCount = 1249


1249

### 2015 - Segment Profile

In [10]:
segmentProfile(segmented, "segment").show(10, truncate=false)

Name: Compile Error
Message: <console>:39: error: not found: value segmented
       segmentProfile(segmented, "segment").show(10, truncate=false)
                      ^
<console>:39: error: not found: value truncate
       segmentProfile(segmented, "segment").show(10, truncate=false)
                                                     ^

StackTrace: 

## 2014 Segmentation
That is the segmentation of the database as if we were a **year ago. i.e retrospective segmentation**. 

**How did it work?**

The first thing to do is to remember that we are a year ago. Meaning that whatever data we take into account, anything that has happened over the last 365 days should be discarded.

We go back in time, assume the data that has been generated over the last year, for instance over the last period did not even exist. Adapt how we compute recency, frequency, monetary value and accordingly. And then we just apply everything we have applied before, same segmentation, same transformation, same analyses, and same tables.

**Why do we need to segment retrospectively?**

From a managerial point of view, it is also extremely useful to see not only to what extent each segment contributes to today's (2015) revenues. But also to what extent each segment today would likely contribute to tomorrow's revenues.

In [11]:
val enriched1_2014 = enrich(data.filter(year($"date_of_purchase") <= 2014), lit("2015-01-01"))
val enriched2_2014 = calcRFM(enriched1_2014)

println("Number of purchases by year")
enriched1_2014.groupBy($"year_of_purchase").count().orderBy($"year_of_purchase".desc).show(20)

val segment1Level_2014 = firstLevelSegmentation(enriched2_2014)
println("First level segmentation")
segment1Level_2014.groupBy($"segment1").count().orderBy($"segment1").show(10, truncate=false)

val segment2Level_2014 = secondLevelSegmentation(segment1Level_2014)
println("Second level segmentation")
segment2Level_2014.groupBy($"segment2").count().orderBy($"segment2").show(10, truncate=false)

//Cache to speedup subsequent calculations
val segmented2014 = segmentation(segment1Level_2014, segment2Level_2014).cache()
println("2014 final segmentation")
segmented2014.groupBy($"segment").count().orderBy($"segment").show(10, truncate=false)

println("# of customers 2015: "+ enriched2.count())
println("# of customers 2014: "+ enriched2_2014.count())

Number of purchases by year
+----------------+-----+
|year_of_purchase|count|
+----------------+-----+
|            2014| 5739|
|            2013| 5912|
|            2012| 5960|
|            2011| 4785|
|            2010| 4939|
|            2009| 5054|
|            2008| 4331|
|            2007| 4674|
|            2006| 2182|
|            2005| 1470|
+----------------+-----+

First level segmentation
+--------+-----+
|segment1|count|
+--------+-----+
|active  |4923 |
|cold    |2153 |
|inactive|7512 |
|warm    |2317 |
+--------+-----+

Second level segmentation
+---------------------+-----+
|segment2             |count|
+---------------------+-----+
|null                 |9665 |
|active high value    |475  |
|active low value     |3011 |
|active new high value|203  |
|active new low value |1234 |
|warm high value      |111  |
|warm low value       |956  |
|warm new             |1250 |
+---------------------+-----+

2014 final segmentation
+---------------------+-----+
|segment          

enriched1_2014 = [customer_id: string, purchase_amount: double ... 4 more fields]
enriched2_2014 = [customer_id: string, first_purchase: int ... 3 more fields]
segment1Level_2014 = [customer_id: string, first_purchase: int ... 4 more fields]
segment2Level_2014 = [customer_id: string, first_purchase: int ... 5 more fields]
segmented2014 = [customer_id: string, first_purchase: int ... 6 more fields]


[customer_id: string, first_purchase: int ... 6 more fields]

In [12]:
//Verify segmentation logic. Total distinct customer count shouldn't change no matter how segmentation is done

val collect1Level_2014: Array[Row] = segment1Level_2014.groupBy($"segment1").count().collect()
assert(16905 == sumSegmentCounts(collect1Level_2014))
println("1st level segmenation is accurate")

val collect2Level_2014: Array[Row] = segment2Level_2014.groupBy($"segment2").count().collect()
assert(16905 == sumSegmentCounts(collect2Level_2014))
println("2nd level segmenation is accurate")

val collect2014: Array[Row] = segmented2014.groupBy($"segment").count().collect()
assert(16905 == sumSegmentCounts(collect2014))
println("2014 final segmenation is accurate")

Total customer count: 16905
1st level segmenation is accurate
Total customer count: 16905
2nd level segmenation is accurate
Total customer count: 16905
2014 final segmenation is accurate


collect1Level_2014 = Array([warm,2317], [active,4923], [cold,2153], [inactive,7512])
collect2Level_2014 = Array([active new low value,1234], [warm high value,111], [active high value,475], [null,9665], [warm new,1250], [active low value,3011], [active new high value,203], [warm low value,956])
collect2014 = Array([active high value,475], [active low value,3011], [active new high value,203], [active new low value,1234], [cold,2153], [inactive,7512], [warm high value,111], [warm low value,956], [warm new,1250])


Array([active high value,475], [active low value,3011], [active new high value,203], [active new low value,1234], [cold,2153], [inactive,7512], [warm high value,111], [warm low value,956], [warm new,1250])

### 2014 - Segment Profile

In [13]:
segmentProfile(segmented2014, "segment").show(10, truncate=false)

+---------------------+-------+-----+------+
|segment              |avg_r  |avg_f|avg_a |
+---------------------+-------+-----+------+
|active high value    |85.34  |5.7  |261.9 |
|active low value     |98.09  |5.63 |40.46 |
|active new high value|94.62  |1.05 |284.67|
|active new low value |138.25 |1.07 |34.36 |
|cold                 |866.62 |2.25 |51.11 |
|inactive             |2058.44|1.73 |48.11 |
|warm high value      |461.2  |4.41 |187.85|
|warm low value       |470.66 |4.36 |37.38 |
|warm new             |497.32 |1.06 |51.37 |
+---------------------+-------+-----+------+



### 2. The number of "new active high" customers has increased between 2014 and 2015. What is the rate of that increase?

Choices:

1. 29.5%
2. 63.0%
3. 129.5%
4. 90%
5. 263%

In [14]:
println("2015 Segments")
segmented2015.groupBy($"segment").count().orderBy($"segment").show(10, truncate=false)

println("2014 Segments")
segmented2014.groupBy($"segment").count().orderBy($"segment").show(10, truncate=false)

def segmentCount(segmented: DataFrame, colName: String): Long = {
    segmented
        .groupBy($"segment")
        .count()
        .filter($"segment" === lit(colName))
        .take(1)(0)
        .getLong(1)
}
val newActiveHigh2014 = segmentCount(segmented2014, "active new high value")
val newActiveHigh2015 = segmentCount(segmented2015, "active new high value")

val increase: Double = newActiveHigh2015 - newActiveHigh2014
val rateOfIncrease: Double = (increase / newActiveHigh2014) * 100.0
println("Rate of increase from 2014 to 2015: "+ rateOfIncrease)

/*
For verification, the values have been copied from successful assignment submission on coursera

2015 segment counts
-------------------
active high value = 573 
active low value  = 3313  
new active high   = 263   
new active low    = 1249
cold              = 1903
inactive          = 9158
new warm          = 938 
warm high value   = 119
warm low value    = 901

2014 segment counts
-------------------
inactive                   = 7512
cold                       = 2153
warm high value            = 111
warm low value             = 956   
new warm                   = 1250
active high value          = 475
active low value           = 3011
new active high            = 203
new active low             = 1234
*/

2015 Segments
+---------------------+-----+
|segment              |count|
+---------------------+-----+
|active high value    |573  |
|active low value     |3313 |
|active new high value|263  |
|active new low value |1249 |
|cold                 |1903 |
|inactive             |9158 |
|warm high value      |119  |
|warm low value       |901  |
|warm new             |938  |
+---------------------+-----+

2014 Segments
+---------------------+-----+
|segment              |count|
+---------------------+-----+
|active high value    |475  |
|active low value     |3011 |
|active new high value|203  |
|active new low value |1234 |
|cold                 |2153 |
|inactive             |7512 |
|warm high value      |111  |
|warm low value       |956  |
|warm new             |1250 |
+---------------------+-----+

Rate of increase from 2014 to 2015: 29.55665024630542


newActiveHigh2014 = 203
newActiveHigh2015 = 263
increase = 60.0
rateOfIncrease = 29.55665024630542


segmentCount: (segmented: org.apache.spark.sql.DataFrame, colName: String)Long


29.55665024630542

## 2015 - Revenue Generation Per Segment

In [15]:
//Compute how much revenue is generated by each segment in 2015
//Notice that people with no revenue in 2015 do NOT appear
//i.e. we select only active customers
val revenue2015 = enriched1
                    .filter($"year_of_purchase" === 2015)
                    .groupBy($"customer_id")
                    .agg(sum($"purchase_amount").alias("revenue_2015"))
revenue2015.describe("revenue_2015").show()

//revenue2015.show()
val total2015Revenue = revenue2015
                        .select(sum($"revenue_2015"))
                        .collect()(0) //will return 1 row
                        .getDouble(0) //will return 1 column
println("Total 2015 revenue: "+total2015Revenue)

//we need to do left-join so that we can bring the customers who didn't generate revenue for 2015 i.e. didnt
//make any purchases in 2015
val actuals = segmented2015
                .join(revenue2015, Seq("customer_id"), "left")
                .na
                .fill(0.0, Seq("revenue_2015"))
println("No of rows: "+actuals.count())

actuals.describe("revenue_2015").show()

//Verify accurate calculation of actuals
val actualsTotal2015Revenue = actuals
                        .select(sum($"revenue_2015"))
                        .collect()(0) //will return 1 row
                        .getDouble(0) //will return 1 column
println("Total 2015 revenue (Actuals): "+actualsTotal2015Revenue)
assert(total2015Revenue == actualsTotal2015Revenue)

actuals
    .groupBy($"segment")
    .agg(round(avg($"revenue_2015"),2).alias("avg_revenue_2015"))
    .orderBy($"segment")
    .show(10,truncate=false)

+-------+------------------+
|summary|      revenue_2015|
+-------+------------------+
|  count|              5398|
|   mean| 88.62432938125232|
| stddev|224.35689735796478|
|    min|               5.0|
|    max|            4500.0|
+-------+------------------+

Total 2015 revenue: 478394.13
No of rows: 18417
+-------+------------------+
|summary|      revenue_2015|
+-------+------------------+
|  count|             18417|
|   mean|25.975681707118422|
| stddev| 127.9801632917415|
|    min|               0.0|
|    max|            4500.0|
+-------+------------------+

Total 2015 revenue (Actuals): 478394.13
+---------------------+----------------+
|segment              |avg_revenue_2015|
+---------------------+----------------+
|active high value    |323.57          |
|active low value     |52.31           |
|active new high value|287.56          |
|active new low value |35.28           |
|cold                 |0.0             |
|inactive             |0.0             |
|warm high value   

revenue2015 = [customer_id: string, revenue_2015: double]
total2015Revenue = 478394.13
actuals = [customer_id: string, first_purchase: int ... 7 more fields]
actualsTotal2015Revenue = 478394.13


478394.13

From above table, we can see that customers in **active new low value** segment generated lowest revenue in 2015.

###  Show AVERAGE revenue per customers and per segment for 2014 (FORWARD looking)
How much revenue you can expect from your active customers today (today in this data set is 2015), next year. We don't' know the future, we don't know exactly what's going to happen, but the one thing we can do, is to go back in the past (2014). And look at how much revenue we got from inactive customers in 2014, going into 2015. And that's the next step of this analysis. So what we'll do, is to merge the revenue generated in 2015, as before. But we're going to merge them with the customer list of 2014. And so we're going to look into, how much revenue's been generated by each customer, based on the segment they were in, a year ago.

And that's why we call it forward. Forward, as the segment in 2014 will enlight us, about how much revenue have been generated in 2015 from these customers.

In [16]:
//Merge 2014 customers with 2015 revenue
val forward = segmented2014
                .join(revenue2015, Seq("customer_id"), "left")
                .na
                .fill(0.0, Seq("revenue_2015"))
forward.describe("revenue_2015").show()

val forwardRevenue = forward
                        .groupBy($"segment")
                        .agg(round(avg($"revenue_2015"),2).alias("avg_revenue_2015"))
                        .orderBy($"avg_revenue_2015".desc)

forwardRevenue.show(10, truncate=false)

/*
For verification, the values have been copied from successful assignment submission on coursera
Ref: https://www.coursera.org/learn/foundations-marketing-analytics/discussions/weeks/3/threads/X7TaXUc3EeaS0w6RXgoWAw

6 active high value 254.077895
3   warm high value 114.459459
8   new active high 109.729064
7  active low value  41.896556
9    new active low  18.102917
4    warm low value  13.494770
2              cold   6.108221
5          new warm   5.064000
1          inactive   2.949466

*/

+-------+------------------+
|summary|      revenue_2015|
+-------+------------------+
|  count|             16905|
|   mean|21.218273883466434|
| stddev|111.24529944791601|
|    min|               0.0|
|    max|            4500.0|
+-------+------------------+

+---------------------+----------------+
|segment              |avg_revenue_2015|
+---------------------+----------------+
|active high value    |254.08          |
|warm high value      |114.46          |
|active new high value|109.73          |
|active low value     |41.9            |
|active new low value |18.1            |
|warm low value       |13.49           |
|cold                 |6.11            |
|warm new             |5.06            |
|inactive             |2.95            |
+---------------------+----------------+



forward = [customer_id: string, first_purchase: int ... 7 more fields]
forwardRevenue = [segment: string, avg_revenue_2015: double]


[segment: string, avg_revenue_2015: double]

### 3. Regarding the customers who belonged to the "new warm" segment in 2014, what was their expected revenue, all things considered, in 2015?

Choices:

1. \$2.94
2. \$5.06
4. \$6.10
4. \$18.10
5. \$41.89

We need to use **forward** looking dataframe to answer this question. Find average revenue generated in 2015 by each of the segments and pick the average revenue generated by "warm new" segment.

In [17]:
val newWarmAvgRevenue = forwardRevenue
                            .collect() 
                            .filter(row => row.get(0) == "warm new")(0)
                            .getDouble(1)
println("Expected revenue of 'new warm' segment: "+newWarmAvgRevenue)

Expected revenue of 'new warm' segment: 5.06


newWarmAvgRevenue = 5.06


5.06

### 4. In terms of expected revenue, which segment groups the least profitable customers?

Choices:

1. Warm high value
2. Warm low value
3. New active high
4. New active low
5. Active low value

We need to use **forward** looking dataframe to answer this question.  Find average revenue generated in 2015 by each of the segments and pick the average revenue generated by "warm new" segment.

In [18]:
val answer4Temp = forwardRevenue
    .filter($"segment".isin("warm high value", "warm low value", "active new high value", "active new low value", "active low value"))
    .orderBy($"avg_revenue_2015".asc)

answer4Temp.show(10, truncate=false)

val answer4 = answer4Temp
                    .take(1)(0) //pick the first 1 as the DF is sorted in low to high revenue
                    .get(0) //Pick first column "segment"

println("Least profitabel segment: "+answer4)

+---------------------+----------------+
|segment              |avg_revenue_2015|
+---------------------+----------------+
|warm low value       |13.49           |
|active new low value |18.1            |
|active low value     |41.9            |
|active new high value|109.73          |
|warm high value      |114.46          |
+---------------------+----------------+

Least profitabel segment: warm low value


answer4Temp = [segment: string, avg_revenue_2015: double]
answer4 = warm low value


warm low value

### 5. Looking at segment description, what is the average purchase amount of a customer who belongs to the "new active high" segment?

Choices

1. \$82.33
2. \$84.09
3. \$85.50
4. \$91.21
5. \$283.38

We need to use the **segmented2015** dataframe to answer this question.

In [19]:
segmentProfile(segmented2015, "segment").show(10, truncate=false)

val answer5 = segmentProfile(segmented2015, "segment")
                        .filter($"segment" === lit("active new high value")) //new active high segment
                        .collect()(0) //will return 1 row
                        .getDouble(3) //will return 4 cols, we need "avg_a" col

println("Avg purchase amt for 'new active high' segment: "+ answer5)

+---------------------+-------+-----+------+
|segment              |avg_r  |avg_f|avg_a |
+---------------------+-------+-----+------+
|active high value    |88.82  |5.89 |240.05|
|active low value     |108.36 |5.94 |40.72 |
|active new high value|82.37  |1.02 |283.38|
|active new low value |85.54  |1.05 |33.7  |
|cold                 |857.78 |2.3  |51.74 |
|inactive             |2178.11|1.81 |48.11 |
|warm high value      |455.13 |4.71 |327.41|
|warm low value       |474.38 |4.53 |38.59 |
|warm new             |509.3  |1.04 |66.6  |
+---------------------+-------+-----+------+

Avg purchase amt for 'new active high' segment: 283.38


answer5 = 283.38


283.38