# Pentathlon-III: Next Product to Buy Models

* Team-lead GitLab userid: juespino
* Group name: Team17
* Team member names: Julian Espinoza-Martinez, Praveeen Kumar Basker, Haritha Parupudi & William Quinn

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyrsm as rsm
import statsmodels.formula.api as smf
from sklearn import preprocessing
from statsmodels.genmod.families import Binomial
from statsmodels.genmod.families.links import logit

# increase plot resolution
# mpl.rcParams["figure.dpi"] = 150

In [2]:
## loading the data - this dataset must NOT be changed
pentathlon_nptb = pd.read_pickle("data/pentathlon_nptb.pkl")
pentathlon_nptb["buyer_yes"] = (pentathlon_nptb["buyer"] == "yes").astype(int)
#pentathlon_nptb.head()

In [4]:
# run python code from another notebook
%run ./q1_8.ipynb

AUC training data for intial model: 88.16%
AUC testing data for intial model: 88.17%



# The Analysis

All the estimations were done using the training sample, and performances of the models were assessed using the testing sample. The calculations for the following questions were for each of the 100,000 customers in the representative sample. 

### Q1: Determine the message predicted to lead to the highest probability of purchase.

#### Approach

To determine the message which will likely lead to the highest probability of purchase, we followed the following pipeline: 

> 1) Built a logistic regression with interactions, so that it is flexible to allow for customization of messages.    
> 2) Predict the probability of purchase for each message for each customer.    
> 3) Determine the message with the highest predicted probability of purchase, and store that message *(as offer)* and the value *(as target)*.

To account for customization of messages, we recognize that that different offers may work better for different customers. In other words, we want to customize messages because we think that there might be an interaction between (1) who the customer is and (2) how effective the offered message is. Hence, we need to interact `message type` with the variables that describe customer characteristics. 

This was how we built the logistic regression model, which had an AUC score of `88.17%` and `88.15%` for the testing and training sets respectively. The results of the prediction were 7 probabilities for each customer, 1 for each of the 7 messages. From these probabilities, we were able to get the corresponding message type. The table below has these results for the first 5 customers. 

In [5]:
q1.head()

Unnamed: 0,custid,to_offer_i
0,U45198803,water
1,U22197752,water
2,U19423462,endurance
3,U23888305,strength
4,U16954857,endurance


#### Alternative Models: Neural Networks & XGBoost

We attempted to fine tune our model by using neural networks to build an initial model as well.

#### Neural Network (Sciklearn)

In [6]:
%run ./neural_network.ipynb

AUC training data for neural network model: 88.46%
AUC testing data for neural network model: 88.45%



While the AUC results for the neural network look slightly higher than that for the logistic regression, the model resulted is poor predictions. 

With the neural network, we predicted that `strength` was the message that would lead us to the highest probability for every customer. We also ran the model on the original 5 million dataset and the model led us to predict similar results. This obviously is unrealistic and we determined that the neural network would not be an accurate model for us to predict which message would lead to the highest probability of purchase. 

Our approach was that, in order to get reasonable predictions, without them all being equal for each message variable, we would have to set each message dummy variable to 1 and the rest to zero, making 7 predictions in total for each (this has been commented out in the code blocks above). This yielded us poor results, so we made another attempt at making predictions on the representative data, while fitting the model on the training and testing data. This also yielded us similar results with one message type dominating across the board.

In [7]:
nn_q2

Unnamed: 0,customers (%)
strength,100.0


This model predicts that `strength` will maximize 100% of our customers probability of purchase for the representative data. Applying this model to the 5 million dataset also yielded us with similar results, with one message dominating the probability of purchase over the rest.

At this point, because of these results, we decided not to continue on with using this model and focus our efforts on fine-tuning model types that will provide us with better predictions. The neural network did not give us the predictions we were hoping for, likely because of the type of data at hand.

#### XGBoost

In [8]:
%run ./xgboost.ipynb

AUC training data for the XGBoost model: 90.77%
AUC testing data for the XGBoost model: 88.86%



The AUC results for the XGBoost model were the highest compared to the other two models, which was a good indication. Next, we predicted the message that would lead to the highest probability of purchase using this model, and got the following results.

In [9]:
q2_xg

Unnamed: 0,customers (%)
endurance,48.763
strength,21.756
water,11.372
racquet,8.512
backcountry,5.364
team,2.513
winter,1.72


Compared to the neural network model above, these results looked promising as well. 

The XGBoost model predicted that for the greatest population, `endurance` would be the message that would maximize the probability of purchase in the representative data, followed by `strength`, `water` and `winter` with the lowest percentage of customers. 

For the next part, when we began analyzing the expected profits, since case weights could not be accounted for here, we chose to work with the 5 million sample. This process led to results that were fairly close to the logistic regression model built above. Since the flexibility for customization is greater in the logistic regression model, we will move forward for the rest of the analysis with the **logistic regression** model.

### Q2: For each message, the percentage of customers for whom the message maximizes their *probability of purchase*.

Using the results of the predicted purchase probabilities above, we looked at the column which had the offers which would lead to the highest purchase probability and found the percentage of customers for whom each of the message types led to the highest such probability. And the results are summarized in the table below.

In [10]:
q2

Unnamed: 0,customers (%)
endurance,40.616
strength,20.492
water,18.55
team,10.945
backcountry,4.354
racquet,3.215
winter,1.828


### Q3: Determine the message predicted to lead to the highest expected profit (with a COGS of 60%)

##### Approach

To predict the expected profit, we know that the COGS is 60%, but we need the order size for each customer. To predict the order size, we built a linear regression model which considered all the customers who have bought within the training set and used `total_os` as the response variable. The best model was then used to predict the order size for each of the 7 message types. 

With these predicted order sizes for each message type, the predicted probability of purchase of each message type per customer, and the COGS, we were able to predict the expected profit for each customer for each of the 7 message types. 

Using the expected profits for each of the 7 messages, the message with the highest expected profit was set into a new column, and these messages were predicted to lead to the highest expected profit. These results have been summarized for the first 5 customers below. 

In [11]:
q3.head()

Unnamed: 0,custid,to_offer_ep
0,U45198803,water
1,U22197752,water
2,U19423462,endurance
3,U23888305,racquet
4,U16954857,racquet


### Q4: For each message, the percentage of customers for whom the message maximizes *expected profit*.

Using the results of the predicted expected profits above, we looked at the column which had the offers which would lead to the highest expected profits and found the percentage of customers for whom each of the message types led to the highest such profit. And the results are summarized in the table below.

In [12]:
q4

Unnamed: 0,customers (%)
water,23.387
endurance,18.465
backcountry,16.02
racquet,15.122
winter,11.074
team,9.46
strength,6.472


### Q5: Expected profit, on average, per e-mailed customer if the messages were customized.

We have the expected profit per customer by customizing the message that will be sent. Using these profits, the average expected profit on average, per e-mailed customer can be calculated. 

In [13]:
print(f"The expected profit, on average, per e-mailed customer if the messages were customized: {q5[0]}")

The expected profit, on average, per e-mailed customer if the messages were customized: €0.338


### Q6: Expected profit, on average, per e-mailed customer if the customers received the same message [for each of the 7 messages]

If all the customers were offered the same message, the average expected profit would be the average of the expected profits for each of the respective message types. From Q4, we have this prediction for all the customers. Thus, by taking the average of each of the 7 message types, the average profit can be reported for the 7 message types. The results are summarized in the table below. 

In [14]:
q6

Unnamed: 0,Avg. EP
ep_endurance,€0.27
ep_strength,€0.26
ep_winter,€0.25
ep_water,€0.24
ep_backcountry,€0.23
ep_racquet,€0.23
ep_team,€0.22


### Q7: Expected profit, on average, per e-mailed customer if every customer is assigned randomly to one of the 7 messages.

##### Approach

Until this point, we looked at customizing the message sent to customers, first using maximized probability of purchase and then using maximized expected profit as the metric. Now, we are attempting to find the average expected profit if the messages are offered at random to all the customers. To determine this, we took a new random assignment approach, by assigning each customer a new `to_offer_rnd` message type. Taking this final offered message, and the respective expected profit for the specified message type, we created a new `target_rnd` column with the expected profits by offering these randomized messages. Using this, the average expected profit was calculated, similar to Q5 above.

In [15]:
print(f"The expected profit, on average, per e-mailed customer if the messages were randomized: {q7[0]}")

The expected profit, on average, per e-mailed customer if the messages were randomized: €0.243


### Q8: For the 5 million customer e-mail blast, the improvement, *in percent and total Euros* Pentathalon could achieve by customizing messages.

Since the e-mail blast campaign was for 5 million customers, the calculations above, as done on the upsampled data, can be scaled to the entire dataset. The summarized results are printed below.

In [16]:
print(f"Expected profit from offering customization: {q5[0]}.")
print(f"Expected profit from randomized messaging: {q7[0]}.")
print(f"Expected profit improvement from offering customization: €{q8_profit.round(3):,}.\n")

print(f"Scaled expected profit from offering customization: €{ep_target.round(3):,}.")
print(f"Scaled expected profit from randomized messaging: €{ep_rnd.round(3):,}.")
print(f"Scaled expected profit improvement from offering customization: €{q8_profit_sc.round(2):,}.\n")

print(f"Scaled expected profit difference (as a percentage) from offering customization: {q8_perc.round(3):,}%.") 

Expected profit from offering customization: €0.338.
Expected profit from randomized messaging: €0.243.
Expected profit improvement from offering customization: €0.095.

Scaled expected profit from offering customization: €1,688,500.942.
Scaled expected profit from randomized messaging: €1,217,251.462.
Scaled expected profit improvement from offering customization: €471,249.48.

Scaled expected profit difference (as a percentage) from offering customization: 39.095%.


# A New Policy Proposal

#### Strengths we see in this new policy:

> The policy will work well in targeting customers who frequent Pentathlon to buy sporting equipment from one or two departments, such as youth sports coaches and multi-sport high school athletes.

#### Weaknesses we see in this new policy:

> This type of targeted mailing seems will be ineffective in allocating promotional e-mails to the varying departments and will likely result in customer fatigue for those receiving similar e-mails over the course of three weeks.

> In addition to this, if we think about the people who purchase items from a sporting goods store, it is more likely that they will make one big purchase item from a department and not have to return to that section for some time. For example, someone might purchase a surfboard from the water sports department, or a new baseball bat and glove from the team section for the new season. It is unlikely that these consumers, which we feel would be the majority, would return to the same department week after week. The use of targeting messages would be redundant if they have already been to the department for what they needed.

#### Our suggestions for improvement:

> We would recommend that instead of doing the analysis monthly, it should be done bi-weekly (every other week of the month) so that the customer will not be irritated in receiving targeted messages from departments they had interacted with a month ago.    

> Our second recommendation is that if a customer has already interacted with and purchased a product in the department of the message that would maximize their probability of purchase, we would suggest that the next message be substituted with the second or third highest for the next round that month. In doing so, the customer will not be sent a message from a department that they had just purchased an item from. Effectively, in a given month, once a customer has actually interacted with a message type, they will not see another message from this department. Instead, we suggest sending them a message from a department that they have not yet purchased an item from but are likely to. We see this as a way for Pentathlon to build the customers interest in a different department and encourage them to return.
