In [3]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from scipy import stats
from matplotlib import style
import seaborn as sns
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
import graphviz as gr
style.use("fivethirtyeight")

In [4]:
data = pd.read_csv("collections_email.csv")
data.head()

Unnamed: 0,payments,email,opened,agreement,credit_limit,risk_score
0,740,1,1.0,0.0,2348.49526,0.666752
1,580,1,1.0,1.0,334.111969,0.207395
2,600,1,1.0,1.0,1360.660722,0.550479
3,770,0,0.0,0.0,1531.828576,0.560488
4,660,0,0.0,0.0,979.855647,0.45514


opened is a dummy variable for the customer opening the email or not. agreement is another dummy marking if the customers contacted the collections department to negotiate their debt, after having received the email. Run regression.

In [5]:
email_1 = smf.ols('payments ~ email + credit_limit + risk_score + opened + agreement', data=data).fit()
email_1.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,488.4416,9.716,50.272,0.000,469.394,507.489
email,-1.6095,2.724,-0.591,0.555,-6.949,3.730
credit_limit,0.1507,0.008,18.809,0.000,0.135,0.166
risk_score,-2.0929,38.375,-0.055,0.957,-77.325,73.139
opened,3.9808,3.914,1.017,0.309,-3.692,11.654
agreement,11.7093,4.166,2.811,0.005,3.542,19.876


opened and agreement are surely correlated with the email. After all, we can't open the email if we didn't receive it and we've also said that the agreement only considers renegotiation that happened after the email has been sent. But they don't cause email. Instead, they are caused by email.

In [None]:
g = gr.Digraph()

g.edge("email", "payments")
g.edge("email", "opened")
g.edge("email", "agreement")
g.edge("opened", "payments")
g.edge("opened", "agreement")
g.edge("agreement", "payments")

g.edge("credit_limit", "payments")
g.edge("credit_limit", "opened")
g.edge("credit_limit", "agreement")
g.edge("risk_score", "payments")
g.edge("risk_score", "opened")
g.edge("risk_score", "agreement")

g

We know that credit limit and risk cause payments. We also think that email causes payments. As for `opened`, we think that it does cause payments. Intuitively, people that opened the collection email are more willing to negotiate and pay their debt. We also think that `opened` causes agreements for the same reasons as it causes payments. Moreover, we know `opened` is caused by email and we believe that people with different risk and credit limits have different open rates for the emails, so credit limit and risk also causes `opened`. As for agreement, we also think that it is caused by `opened`. Then we get following funnel:
$
email -> opened -> agreement -> payment 
$

Different levels of risk and line have different propensity of doing an agreement. As for email and agreement, some people just read the subject of the email and it makes them more likely to make an agreement. So email could also cause `agreement` without passing through `opened` .

From the graph we can see that `opened` and `agreement` are both in the causal path from email to payments. So, if we control them in the regression, this is the effect of email while keeping `opened` and `agreement` fixed. However, both are part of the causal effect of the email, so they are not fixed. Instead, email increases payments precisely because it boosts the agreement rate. If those variables are fixed, some of the true effect from the email variable are omitted. 

With potential outcome notation, we can say that, due to randomization $E[Y_0|T=0] = E[Y_0|T=1]$. However, even with randomization, when we control for agreement, treatment and control are no longer comparable. In fact, with some intuitive thinking, we can even guess how they are different:


$
E[Y_0|T=0, Agreement=0] > E[Y_0|T=1, Agreement=0]
$

$
E[Y_0|T=0, Agreement=1] > E[Y_0|T=1, Agreement=1]
$

The first equation showa that those without the email and the agreement are better than those with the email and without the agreement. That is because, if the treatment has a positive effect, those that didn't make an agreement even after having received the email are probably worse in terms of payments compared to those that also didn't do the agreement but also didn't get the extra incentive of the email. As for the second equation, those that did the agreement even without having received the treatment are probably better than those that did the agreement but had the extra incentive of the email. 


This sort of bias is called selection bias.selection bias is when we control for a common effect or a variable in between the path from cause to effect.As a rule of thumb, always include confounders and variables that are good predictors of $Y$ in your model. Always exclude variables that are good predictors of only $T$, mediators between the treatment and outcome or common effect of the treatment and outcome.
