# Working with models

# Pitfalls of statistics

## Spurious correlation

<img width="90%" src="img/spurious.png">

**Reference** http://tylervigen.com/spurious-correlations

## Spurious correlation

<img width="90%" src="img/spurious2.png">

**Reference** http://tylervigen.com/spurious-correlations

## Data dredging, aka p-hacking

<br>
<img width="80%" src="img/p_hacking.png">

**Reference** https://projects.fivethirtyeight.com/p-hacking/

## Data dredging, aka p-hacking

<br>
<img width="65%" src="img/republicans.png">

**Reference** https://projects.fivethirtyeight.com/p-hacking/

## Data dredging, aka p-hacking

<br>
<img width="65%" src="img/democrats.png">

**Reference** https://projects.fivethirtyeight.com/p-hacking/

## Data dredging, aka p-hacking

Remedies 

- randomized out-of-sample tests.
- statistical tests designed for confirmatory analysis, not exploratory analysis !

## Correlation does not imply causation

For a statistical association $X \not\perp Y$, the underlying causal mechanism can be:
- $X \rightarrow Y$
- $X \leftarrow Y$
- $X \leftarrow Z \rightarrow Y$

# Causal inference

## Confounders 

- $X \leftarrow Z \rightarrow Y$ implies $X\not\perp Y$ and $X \perp Y \mid Z$
- statistical association created by a **common cause**
- Sleeping with one's shoes on is strongly correlated with waking up with a headache.
- As ice cream sales increase, the rate of drowning deaths increases sharply.

## Simpson's paradox

Kidney stone Treatment : is A (surgery) more effective than B (puncture) ?

<img width="50%" src="img/simpson_paradox.png" style="display:inline"> <img width="40%" src="img/stone_treatment_dag.png" style="display:inline" >

## Explaining away

- $X \rightarrow Z \leftarrow Y$ implies $X\perp Y$ and $X \not\perp Y \mid Z$
- statistical association created by **conditioning on a common effect**


## Berkson's paradox

Hospital in-patient population
- negative correlation between flu and diabetes

General population
- actually no association between flu and diabetes

## Berkson's paradox

Example of selection bias (here hospital patients):
- flu $\rightarrow$ hospital $\leftarrow$ diabetes
- flu $\perp$ diabetes but flu $\not\perp$ diabetes $\mid$ hospital

## Graphical causal model

<img width="30%" src="img/original_dag.png" style="align:center"> 

**Causal graph** $P(ZYXCW) = P(Z|XY)P(Y|XC)P(X|WC)P(W)P(C)$  

## Intervention


<img width="30%" src="img/do_X_dag.png" style="align:center"> 

**Manipulated graph** $P(ZYCW|do(X)) = P(Z|XY)P(Y|XC)P(W)P(C)$

## Predicting the results of actions

Manipulated do(X) graph

$$P(ZYCW | do(X)) = P(Z|XY)P(Y|XC)P(W)P(C)$$


Causal effect of $X$ on $Y$  

$$P(Y\mid do(X)) = \sum_{C}  P(Y|XC) P(C)$$ 

## And machine learning in all that ?

- unsupervised learning : estimate $P(C)$
- supervised learning : estimate $P(Y\mid XC)$

With causal calculus : estimate $P(Y\mid do(X)) = \sum_{C}  P(Y|XC) P(C)$

## Passive observation vs active intervention

<img width="80%" src="img/proba_vs_causal.png" style="align:center"> 

In general $P(Y\mid X) \neq P(Y\mid do(X))$

## Infer causal structure from data

<img width="70%" src="img/CI_MAP.png" style="align:center"> 

## Causal discovery

PC algorithm (Peter Spirtes, Clark Glymour)
- list all conditional independencies in data
- find all consistent graphical models

# Pitfalls of ML

## Model bias / overfitting

- Reduce bias 
    - more features
    - increase model complexity, reduce regularization 
- Reduce variance 
    - more data
    - decrease model complexity, increase regularization 

## Validation

- **golden rule** never use the test set during training
- often tricky to build a good test and validation sets, eg time-series
- difficulty with unsupervised models

## Target leakage

If you are model is too good you should be supsicious !

- information about the target leaked into the features
- present in the training data but not available at test time

**Example** fraud detection, feature = last service that emailed the client  

## Covariate shift

- predict $y$ from $x$ 
- but $ P_{test}(x)\neq P_{train}(x)$

**Example** image classification, ImageNet different from today images (Instagram filters, selfies...)

## Covariate shift

Remedies
- reweight samples according to test/train likelihood ratio
- can approximate this ratio by Bayes rule + test/train classifier

**References** [Sugiyama et al, 2007](http://www.jmlr.org/papers/volume8/sugiyama07a/sugiyama07a.pdf) [Bickel et al, 2009](http://www.jmlr.org/papers/volume10/bickel09a/bickel09a.pdf)

## Covariate shift

<img width="45%" src="img/covariate_shift.png" style="display:inline"> <img width="45%" src="img/covariate_shift2.png" style="display:inline">

**Reference** [Sugiyama et al, 2007](http://www.jmlr.org/papers/volume8/sugiyama07a/sugiyama07a.pdf)

## Adversarial examples

<img width="80%" src="img/adversarial.png">

**Reference** [Goodfellow et al 2015](https://arxiv.org/abs/1412.6572)

## Importance of interpretation

- predict pneunomia risk vs patient attributes
- should asthmatic patients really be considered at lower risk ???
<img width="100%" src="img/asthma.png">

**Reference** [Caruana et al, 2015](http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf)

## Interpretation / performance tradeoff

<br>
<img width="50%" src="img/interpretation.png">

## Understanding ConvNets

Retrieve images that maximally activate a neuron [Girshick et al, 2014](https://arxiv.org/abs/1311.2524)
<img width="80%" src="img/understand_convnet.png">

## Explanations using a local linear approximation

<img width="50%" src="img/lime.png">

**Reference** [Ribiero et al 2016](https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf)

## References

Causal inference
- [Advanced data analysis from an elementary point of view](https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/), C. Shalizi

Interpretation of models
- http://cs231n.github.io/understanding-cnn/
- https://blog.datadive.net/interpreting-random-forests/