<a href="https://colab.research.google.com/github/sofials2002/SOFIA/blob/master/DML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Causal ML for Marketing Campaigns

A retailer aims to improve the effectiveness of their campaigns with discount marketing strategies. They distribute promotions across various channels and seek to refine their marketing strategies using data on user demographics, campaign and coupon details, product information, and previous transactions. The original dataset is available at [Kaggle](https://www.kaggle.com/datasets/vasudeva009/predicting-coupon-redemption), and the specific sample comes from [this source](https://doi.org/10.7910/DVN/2P8AY0).

**Data dictionary:**

- dailyspending: daily spending of the customer
- coupons: whether the customer received a coupon
- coupons_preperiod: whether the customer received a coupon in the previous period
- dailyspending_preperiod: daily spending of the customer in the previous period
- income_bracket: income bracket from 1 to 12
- age_range: age range from 1 to 6
- married: whether the customer is married
- rented: whether the customer rents a house
- family_size: number of people in the customer's household

In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import linear_model, ensemble

import warnings
warnings.simplefilter('ignore')

## Check the data

In [2]:
# Read data
path_data = 'https://github.com/pabloestradac/experimentation-notebooks/raw/main/data/'
df = pd.read_csv(path_data + 'coupon.csv')
df.head()

Unnamed: 0,dailyspending,coupons,coupons_preperiod,dailyspending_preperiod,income_bracket,age_range,married,rented,family_size
0,411.624,0,0,0.0,4,6,1,0,2
1,253.574444,0,0,411.624,4,6,1,0,2
2,261.673684,1,0,253.574444,4,6,1,0,2
3,0.0,1,1,0.0,5,4,1,0,2
4,0.0,1,1,0.0,5,4,1,0,2


In [3]:
# Descriptive Statistics
df.describe().round(2)

Unnamed: 0,dailyspending,coupons,coupons_preperiod,dailyspending_preperiod,income_bracket,age_range,married,rented,family_size
count,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0
mean,291.45,0.24,0.18,269.47,5.01,3.57,0.74,0.08,2.54
std,310.26,0.43,0.39,380.83,2.35,1.3,0.44,0.27,1.19
min,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
25%,56.09,0.0,0.0,0.0,4.0,3.0,0.0,0.0,2.0
50%,210.57,0.0,0.0,123.42,5.0,4.0,1.0,0.0,2.0
75%,427.36,0.0,0.0,395.34,6.0,4.0,1.0,0.0,3.0
max,1975.75,1.0,1.0,3565.34,12.0,6.0,1.0,1.0,5.0


## Regression

What is the effect of sending coupons on the daily spending of the customer?

$$
\text{dailyspending} = \beta_0 + \beta_1 \text{coupons} + e
$$

In [5]:
# OLS no controls
model_base = smf.ols(formula='dailyspending ~ coupons', data=df).fit(cov_type='HC1')
base = model_base.summary()
print(base)
results_ols = model_base


                            OLS Regression Results                            
Dep. Variable:          dailyspending   R-squared:                       0.017
Model:                            OLS   Adj. R-squared:                  0.016
Method:                 Least Squares   F-statistic:                     19.20
Date:                Fri, 06 Dec 2024   Prob (F-statistic):           1.27e-05
Time:                        15:13:38   Log-Likelihood:                -9241.5
No. Observations:                1293   AIC:                         1.849e+04
Df Residuals:                    1291   BIC:                         1.850e+04
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    268.7191      9.405     28.572      0.0

Let's add pre-treatment covariates to the model:

$$
\text{dailyspending} = \beta_0 + \beta_1 \text{coupons} + \beta_2' X + e
$$

In [8]:
model_controls = smf.ols(formula='dailyspending ~ coupons + coupons_preperiod + dailyspending_preperiod + (C(income_bracket) + C(age_range) + married + C(family_size))', data=df).fit(cov_type='HC1')

controls = model_controls.summary()
print(controls)

                            OLS Regression Results                            
Dep. Variable:          dailyspending   R-squared:                       0.094
Model:                            OLS   Adj. R-squared:                  0.077
Method:                 Least Squares   F-statistic:                     5.046
Date:                Fri, 06 Dec 2024   Prob (F-statistic):           3.79e-14
Time:                        15:21:30   Log-Likelihood:                -9188.9
No. Observations:                1293   AIC:                         1.843e+04
Df Residuals:                    1268   BIC:                         1.856e+04
Df Model:                          24                                         
Covariance Type:                  HC1                                         
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                 

In [6]:
# OLS with additive controls
#model_controls = dailyspending ~ coupons + (income_bracket + age_range + married + family_size | coupons)
X = df[['income_bracket', 'age_range', 'married', 'family_size']]
Y = df['dailyspending']
model_add = sm.OLS(Y, sm.add_constant(pd.concat([df['coupons'], X], axis=1))).fit(cov_type='HC1')
add = model_add.summary()
print(add)
results_ols_add = model_add


                            OLS Regression Results                            
Dep. Variable:          dailyspending   R-squared:                       0.054
Model:                            OLS   Adj. R-squared:                  0.051
Method:                 Least Squares   F-statistic:                     13.41
Date:                Fri, 06 Dec 2024   Prob (F-statistic):           9.18e-13
Time:                        15:15:55   Log-Likelihood:                -9216.5
No. Observations:                1293   AIC:                         1.845e+04
Df Residuals:                    1287   BIC:                         1.848e+04
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const            197.3882     39.563      4.

In [9]:
# OLS with interacted controls
X = df[['income_bracket', 'age_range', 'married', 'family_size']]
Y = df['dailyspending']
model_int = sm.OLS(Y, sm.add_constant(pd.concat([df['coupons'], df['coupons']*X], axis=1))).fit(cov_type='HC1')
int
results_ols_int = model_int


MissingDataError: exog contains inf or nans

## Double Machine Learning

Instead of assuming a linear relationship between the treatment and the outcome, we can use machine learning models to estimate the treatment effect.

$$
\begin{gathered}
\text{dailyspending} = \beta_1 \text{coupons} + g(X) + u \\
\text{coupons} = m(X) + v
\end{gathered}
$$

In [10]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'doubleml'])
except ImportError:
    pass

import doubleml as dml

In [11]:
# DML with linear and logistic regression
splits = 5
covariates = list(df.drop(['dailyspending', 'coupons'], axis=1).columns)
dml_data = dml.DoubleMLData(df, y_col='dailyspending', d_cols='coupons', x_cols=covariates)
ml_g = linear_model.LinearRegression()  # outcome model
ml_m = linear_model.LogisticRegression() # treatment model
results_dml_linear = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_linear)



------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: LinearRegression()
Learner ml_m: LogisticRegression()
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[301.61445498]]
Classification:
Learner ml_m Log Loss: [[0.42511191]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|      2.5 %      97.5 %
coupons  76.176917  23.127471  3.293785  0.000988  30.847908  121.505927


In [16]:
# DML with lasso
cv = 5
ml_g = linear_model.LassoCV(cv=cv)
ml_m = linear_model.LogisticRegressionCV(penalty='l1', solver='saga', cv=cv)

results_dml_lasso = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_lasso)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: LassoCV(cv=5)
Learner ml_m: LogisticRegressionCV(cv=5, penalty='l1', solver='saga')
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[301.65302248]]
Classification:
Learner ml_m Log Loss: [[0.66560481]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
             coef    std err         t     P>|t|      2.5 %     97.5 %
coupons  56.52858  17.191694  3.288133  0.001009  22.833478  90.223681


In [18]:
# DML with random forest
ml_g = ensemble.RandomForestRegressor(max_features='sqrt')
ml_m = ensemble.RandomForestClassifier()
results_dml_rf = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_rf)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_features='sqrt')
Learner ml_m: RandomForestClassifier()
Out-of-sample Performance:
Regression:
Learner ml_l RMSE: [[308.58545506]]
Classification:
Learner ml_m Log Loss: [[0.79872265]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|    2.5 %     97.5 %
coupons  45.953391  23.095813  1.989685  0.046626  0.68643  91.220352


We can also use an non-linear interacted regression model for the outcome equation:

$$
\begin{gathered}
\text{dailyspending} = g(\text{coupons}, X) + u \\
\text{coupons} = m(X) + v
\end{gathered}
$$

In [None]:
# DML with interacted regression and lasso
ml_g =
ml_m =
results_dml_int =

In [None]:
groups = df[['age_range']].astype('str')
gate_fam = results_dml_int.gate(groups=groups)
print(gate_fam)

## Summary

In [None]:
results = pd.DataFrame(columns=['Estimate', 'SE', 't-stat', 'p-value', 'CI_low', 'CI_high'],
                       index=['OLS', 'OLS_add', 'OLS_int', 'DML_linear', 'DML_lasso', 'DML_rf', 'DML_int'])

for i, res in enumerate([results_ols, results_ols_add, results_ols_int]):
    results.iloc[i, 0] = res.params['coupons']
    results.iloc[i, 1] = res.bse['coupons']
    results.iloc[i, 2] = res.tvalues['coupons']
    results.iloc[i, 3] = res.pvalues['coupons']
    results.iloc[i, 4] = res.conf_int().loc['coupons', 0]
    results.iloc[i, 5] = res.conf_int().loc['coupons', 1]


for i, res in enumerate([results_dml_linear, results_dml_lasso, results_dml_rf, results_dml_int]):
    results.iloc[i+3, 0] = res.coef[0]
    results.iloc[i+3, 1] = res.se[0]
    results.iloc[i+3, 2] = res.t_stat[0]
    results.iloc[i+3, 3] = res.pval[0]
    results.iloc[i+3, 4] = res.confint().iloc[0, 0]
    results.iloc[i+3, 5] = res.confint().iloc[0, 1]

results.astype('float').round(2)