**OVERVIEW**

**Objective**

The objective of this project is to increase CTR on a marketing email dataset.

**Dataset**

The dataset consists of 99,950 emails, with fields WRT

email metadata
user's metadata
clicked / not

**SETUP**

**Import**

In [None]:
install.packages('randomForest')
install.packages('kableExtra')
library(randomForest)
set.seed(4321)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.



**Read data**

In [None]:
data = read.csv("https://drive.google.com/uc?export=download&id=1PXjbqSMu__d_ppEv92i_Gnx3kKgfvhFk")

**BUILD MODEL**

**Extract label and drop email_id** because it's unnecessary

In [None]:
data$clicked = as.factor(data$clicked) 

data$email_id = NULL

**Bin data**

Bin data and speed up execution.

Bin data by hour

The dataset is binned as follows:

* 1 AM - 5 AM
* 5 AM - 1 PM
* 1 PM - 9 PM
* 9 PM - 2 AM

In [None]:
data$hour_binned = ifelse(data$hour>=6 & data$hour<14, "morning", 
                            ifelse (data$hour>= 14 & data$hour<22, "afternoon", "night")
                   )
data$hour_binned = as.factor(data$hour_binned)

Bin data by purchases

The dataset is binned as follows:

* [0, 1) purchases
* [1, 4) purchases
* [4, 8) purchases
* [8, 23) purchases


In [None]:
data$purchase_binned = ifelse(data$user_past_purchases==0, "None",
                                  ifelse(data$user_past_purchases<4, "Low",
                                          ifelse(data$user_past_purchases<8, "Medium", "High"))
                        )
data$purchase_binned = as.factor(data$purchase_binned)

Reorder the dataset to push label to end and remove continuous variables

In [None]:
data = data[,c(5,9,1,2,4,8,7)] 

**Prepare training and test set**

In [None]:
train_indices <- sample(nrow(data), size = nrow(data)*0.66)
train <- data[train_indices,]
test <-  data[-train_indices,]

**Build RF**

In [None]:
rf_model = randomForest(x=train[, -ncol(train)], y=train$clicked, xtest = test[, -ncol(test)], ytest=test$clicked, classwt=c(2, 1), ntree=50, keep.forest=TRUE)

In [None]:
rf_model


Call:
 randomForest(x = train[, -ncol(train)], y = train$clicked, xtest = test[,      -ncol(test)], ytest = test$clicked, ntree = 50, classwt = c(2,      1), keep.forest = TRUE) 
               Type of random forest: classification
                     Number of trees: 50
No. of variables tried at each split: 2

        OOB estimate of  error rate: 11.88%
Confusion matrix:
      0    1 class.error
0 57722 6898   0.1067471
1   937  410   0.6956199
                Test set error rate: 12.22%
Confusion matrix:
      0    1 class.error
0 29581 3680   0.1106401
1   473  249   0.6551247

_______

**Predict click-through-rate for each segment**

Remove classification label and drop duplicates

In [None]:
data_unique = data[, -ncol(data)] 
data_unique = data_unique[!duplicated(data_unique),]

Predict & add to dataset

In [None]:
prediction = predict(rf_model, data_unique, type="prob")[,2]
data_unique$prediction = prediction
knitr::kable(data_unique[1:5,])



|user_country |purchase_binned |email_text  |email_version |weekday   |hour_binned | prediction|
|:------------|:---------------|:-----------|:-------------|:---------|:-----------|----------:|
|US           |Low             |short_email |generic       |Thursday  |morning     |       0.00|
|US           |None            |long_email  |personalized  |Monday    |morning     |       0.00|
|US           |Low             |short_email |generic       |Tuesday   |afternoon   |       0.00|
|US           |High            |long_email  |personalized  |Thursday  |morning     |       0.86|
|UK           |Low             |short_email |generic       |Wednesday |morning     |       0.08|

______

**Identify the best email characteristics for each user**

1. Sort the records by descending probability of clicking
2. Remove duplicate entries with the same parameters to find the combination of user characteristics that has max. P(click)


In [None]:
require(dplyr)

best_segment = data_unique %>% 
               group_by(user_country, purchase_binned) %>% 
               arrange(desc(prediction)) %>% 
               filter(row_number()==1)
knitr::kable(best_segment)


Loading required package: dplyr


Attaching package: ‘dplyr’


The following object is masked from ‘package:randomForest’:

    combine


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union






|user_country |purchase_binned |email_text  |email_version |weekday   |hour_binned | prediction|
|:------------|:---------------|:-----------|:-------------|:---------|:-----------|----------:|
|US           |High            |short_email |personalized  |Wednesday |morning     |       0.98|
|UK           |High            |short_email |personalized  |Sunday    |morning     |       0.98|
|UK           |Medium          |short_email |personalized  |Monday    |morning     |       0.76|
|US           |Medium          |short_email |personalized  |Sunday    |morning     |       0.72|
|FR           |High            |short_email |personalized  |Sunday    |night       |       0.60|
|ES           |High            |short_email |personalized  |Tuesday   |morning     |       0.56|
|ES           |Medium          |short_email |personalized  |Tuesday   |afternoon   |       0.48|
|US           |Low             |short_email |personalized  |Wednesday |afternoon   |       0.36|
|FR           |Medium       

* Now we have a model that returns the best email strategy for each user combination

* Caveat: even the best email strategy has low probabilities for users with no purchases, regardless of country.

* We won’t get those people to convert just through emails.

* We may have to make product changes for that. After all, only good segments are marketing opportunities.

_________

**Estimate A/B test gains**

* We have a model to send personalized emails.

* We now have to test it.
* In order to test, I have to run the model on a randomized fraction of users
* Then, I would compare its results with the current email model.

* In order to run the test though, the product manager has to be convinced that it makes sense to run the test from a cost-opportunity standpoint.
* The best way to do that is giving them an estimate of by how much we think we could potentially increase click-rate.
* That way they can figure out whether it makes sense.

* Since we know the predicted probability for each group, we can just estimate the weighted average to estimate the final overall click-rate.

* Caveat: the predicted probability from the model is insufficient.
* Reason: model has pretty high class 1 error.

* Hence, we need to to adjust the predicted probabilities after taking into account the model expected error.

In [None]:
count_segment = data %>% group_by(user_country, purchase_binned) %>% summarize(weight = n()/nrow(data))
best_segment = merge(best_segment, count_segment)
knitr::kable(best_segment[order(best_segment$prediction, decreasing = TRUE),])

[1m[22m`summarise()` has grouped output by 'user_country'. You can override using the
`.groups` argument.




|   |user_country |purchase_binned |    weight|email_text  |email_version |weekday   |hour_binned | prediction|
|:--|:------------|:---------------|---------:|:-----------|:-------------|:---------|:-----------|----------:|
|9  |UK           |High            | 0.0271336|short_email |personalized  |Sunday    |morning     |       0.98|
|13 |US           |High            | 0.0832916|short_email |personalized  |Wednesday |morning     |       0.98|
|11 |UK           |Medium          | 0.0662531|short_email |personalized  |Monday    |morning     |       0.76|
|15 |US           |Medium          | 0.2001801|short_email |personalized  |Sunday    |morning     |       0.72|
|5  |FR           |High            | 0.0144472|short_email |personalized  |Sunday    |night       |       0.60|
|1  |ES           |High            | 0.0142271|short_email |personalized  |Tuesday   |morning     |       0.56|
|3  |ES           |Medium          | 0.0339070|short_email |personalized  |Tuesday   |afternoon   |   

**Calculating CTR** after adding Class 0 and Class 1 errors to the dataset

In [None]:
ppv = rf_model$confusion[2,2]/sum(rf_model$confusion[,2])

forate = rf_model$confusion[2,1]/sum(rf_model$confusion[,1])

best_segment$adjusted_prediction = best_segment$prediction * ppv + (1-best_segment$prediction) * forate

data.frame( predicted_click_rate = sum(best_segment$adjusted_prediction*best_segment$weight),
                  old_click_rate = mean(as.numeric(as.character(data$clicked))))

predicted_click_rate,old_click_rate
<dbl>,<dbl>
0.03450747,0.02070035
