# Introduction to Causality
I just want say, it is the most misunderstood concept when it comes to data analysis

# Table of Content
<ul>
    <li> <a href="#Difference"> Difference between correlation and causality </a> </li>
    <li> <a href="#Problems"> Difficulty in estimating causality </a> </li>
    <li> <a href="#Bias"> Bias</a> </li> 
    <li> <a href="#Linear regression"> Linear regression</a> </li> 
</ul>

<a id='Difference'></a>
## Difference between correlation and causality
* **Correlation** is the presence of relationship between two or more variables
  * Variables tend to move together
  * Very useful in predictive models
  * New wave of artifical intelligence does not bring us intelligence, but the critical compononent of intelligence called precition (recognizing faces, translating from one langauage to another)
  * However, prediction is not always the solution. For example, in the hotel industry, prices are low outside the tourist season, and prices are high when demand is highest and hotels are full. Given that data, a naive prediction might suggest that increasing the price would lead to more rooms sold.
  
--------------------------------------------------------------------------------------------------

* **Causality** is harder to define. It generally involves "What if" question.
  * If I teach students with tablets, they tend to learn more?
  * if I take low fat diet, I can decrease the risk of heart attack?
  

<a id='Problems'></a>
## Difficulties in estimating casual relations
* **Counterfactuals**
  * Need parallel universe under the exact same conditions, with only difference being the conditions we are testing the effect of
  * Can never observe the same subject with and without treatment
  * This is where you defined well designed randomized experiments (A/B testing)
* **Heterogenetiy**
  * Humans are instrinscally different species conditioned by their society, environment, education ...so on
  * They react differently to differnt triggers, thus making it difficult to eastimate the effect of a cause
  
* **Confounders**
  * A confounding variable is an “extra” variable that you didn’t account for.
  * Increase in sales of ice cream is correlated to increase in Air conditioning sales
  * What does this mean? what is the cofounder variable here? Any guess
  
* **Selection effect**
  * Because of selection bias we may over- or underestimate a causal effect when we just take the difference in average outcomes across treated and control groups.
  * Observed Difference in Means = CausalEffect + SelectionBias

In [None]:
# fix a seed for our random number generator and number of observations to simulate
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
np.random.seed(422)
nobs = 1000
# our third variable will be standard normal
temp = np.random.randn(nobs,1)
# let's say that z --> x and z--> y# Notice that x and y are not related!
x = 0.5 + 0.4*temp + 0.1*np.random.randn(nobs,1)
y = 1.5 + 0.1*temp + 0.01*np.random.randn(nobs,1)

plt.plot(x,y, marker='o', linestyle='')
plt.xlabel('ice cream sales')
plt.ylabel('AC sales')
plt.show()

<a id='Bias'></a>
## Concept of Bias

In [None]:
### Lets discuss a ideal situtation (Causality situtation)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(dict(
    i= [1,2,3,4],
    y0=[500,600,800,700],  # potential school performace without treatment
    y1=[450,600,600,750],  # potential school performance with treatment
    t= [0,0,1,1],
    y= [500,600,600,750],
    te=[-50,0,-200,50],  # effect of treatment on each school
))
df


In [None]:
print("Mean of schools that were treated effect is {}, means that the tablets reduces the academic performace by 75 points".format(df.iloc[2:].te.mean()))

In [None]:
### lets discuss an realistic situtation (Association situtation)

In [None]:
df2 = pd.DataFrame(dict(
    i= [1,2,3,4],
    y0=[500,600,np.nan,np.nan],
    y1=[np.nan,np.nan,600,750],
    t= [0,0,1,1],
    y= [500,600,600,750],
    te=[np.nan,np.nan,np.nan,np.nan],
))

df2

In [None]:
#Here we can take mean of treated and mean of untreated and take a differnce to esitmate the effect
print("Mean treatment effect is {}, means that the tablets \
increase the academic performace by 125 points".format(df.iloc[2:]['y'].mean() - df.iloc[0:2]['y'].mean()))



In [None]:
# See the results in both the scenario, why they are so different?

$Y_{0}$ is the outcome indivual would have, had he not received that treatment

$Y_{1}$ is the outcome indivual would have, had he received that treatment



$$E[Y | 1] - E[Y | 0] = E[Y_{1} | 1] - E[Y_{0} | 0]$$


Now add and subtract this from right side 

$$E[Y_{0} | 1]$$ (counterfactual outcome)
It tells what would have been the outcome of the treated, had they not received the treatment

$$E[Y | 1] - E[Y | 0] = E[Y_{1} | 1] - E[Y_{0} | 0] + E[Y_{0} | 1] - E[Y_{0} | 1]$$


$$\underbrace{E[Y | 1] - E[Y | 0]}_\text{association} = \underbrace{E[Y_{1} - Y_{0} | 1]}_\text{cause} + \underbrace{E[Y_{0} | 1] - E[Y_{0} | 0]}_\text{Bias}$$

The bias is given by how the treated and control group differ before the treatment, that is, in case neither of them has received the treatment






<a id='Linear regression'></a>
## Linear regression

In [None]:
x = np.array(df2.t.to_list())
y = np.array(df2.y.to_list())
slope, intercept = np.polyfit(x,y,1)

fig, axes = plt.subplots(figsize=(6,6))
plt.plot(x, x*slope+intercept, label='Regression')

plt.legend()

plt.xlabel('X [arbitrary units]')
plt.ylabel('Y [arbitrary units]')
plt.xlim(0,1)
plt.ylim(400,700)

Title = "Intercept = {:.1f}, slope = {:.1f}".format(intercept,slope)
plt.title(Title, fontsize=15, fontweight='bold')
plt.show()