## Introduction
This notebook is used as a demonstration/introduction to propensity score matching. It uses the Kaggle Titanic dataset (https://www.kaggle.com/c/titanic). The main goal is to estimate the effect of a treatment (i.e. the passenger has a cabin) on the final survival of passengers. 

The dataset helps illustrate how we could potentially assess the impact of a treatment in cases where we cannot perform a RCT (randomised controlled testing) on the subjects.

## Key points
In order to proceed to PSM (propensity score matching), the following key points are considered:
- Matching is used to create an artificial control group so then to estimate the impact of treatment.
- Dimensions:
    - X are the underlying characteristics/features available.
    - T is the treatment; can be either 1 or 0. In this notebook the presence of a cabin is considered as T=1 (i.e. the passenger got treated).
    - Y is the outcome variable i.e. survived or not.
- Propensity score is the estimated probability that a subject/passenger is treated given certain observable characteristics X. In probability notation this is P(T=1|X). Propensity Score helps to "minimize/compress" the dimensions and solve the curse of dimensionality but on the other hand there is loss of information.
- The propensity score is calculated (usually) by logistic regression having T (treatment) as the outcome variable.
- There is a cost in not doing a proper RCT (randomised controlled testing). Treatment groups might not fully overlap (common support) or not all of characteristics X (i.e. age, fare etc.) might be equally balanced within the treatment groups.

Key assumptions:
- Unconfoundedness assumption: Selection on treatment (or not) should be solely based on observable characteristics (i.e. X). Assuming there is no selection bias from unobserved characteristics. It is not possible to prove the validity of this unconfoundedness assumption.
- Common Support: observations with similar characteristics X are present in both treatment and control groups.
- Conditional independence assumption: There are no unobserved differences correlated to potential outcomes once we have controlled for certain observable characteristics


## Approach
1. Estimate the propensity score. This is the propability (logistic regression) that an observation is treated or not. Then convert it to its logit value.
2. Perform matching. For each treated sample, identify an untreated sample with similar logit propensity score. The matching is 1-to-1 with replacement. In cases where we do not have enough untreated elements, then the same one can be re-used. The matching takes place using the treated elements as source.
3. Once matching is performed, we review the balance of the X variables to assess their balance.
4. Estimate the impact of treatment.

## Data Preparation

In [11]:
from sklearn.linear_model import LogisticRegression as lr

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn import metrics

In [12]:
# Enabled to remove warnings for demo purposes.
import warnings
warnings.filterwarnings('ignore')

In [13]:
#from functions import *
import math
import numpy as np
import pandas as pd
# import scipy.stats as stats
import statsmodels.api as sm

import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline

import seaborn as sns
sns.set(rc={'figure.figsize':(16,10)}, font_scale=1.3)

In [14]:
df = pd.read_csv('train.csv')
# Elements are dropped for simplicity.
df = df[~df.Age.isna()]
df = df[~df.Embarked.isna()]
df = df.reset_index()
y = df[['Survived']]
df = df.drop(columns = ['Survived'])

Create an artificial treatment effect. It is based on the condition that a passenger has a cabin (1) or not (0). The 'hasCabin' function is imported from the functions.py file.

In [15]:
def hasCabin(x):
    if pd.isna(x):
        return 0
    else:
        return 1

df['treatment'] = df.Cabin.apply(hasCabin)

There is high correlation between treatment (i.e. hasCabin) and Class.
This is desirable in this case as it plays the role of the systematic factor affecting the treatment.
In a different context this could be a landing page on site that only specific visitors see.

In [16]:
df_data = df[['treatment','Sex','Age','SibSp','Parch','Embarked', 'Pclass', 'Fare']]

## Here you will implement the logistic regression to predict the probability of receiving the treatment. P(T=1|X).

Remember to deal with the categorical variables (i.e. Sex, Embarked, Pclass)

Report accuracy, F1-score and the confusion matrix of your logistic regression model.

Convert propability to logit (based on the suggestion at https://youtu.be/gaUgW7NWai8?t=981)

## Matching Implementation
Use Nearerst Neighbors to identify matching candidates. Then perform 1-to-1 matching by isolating/identifying groups of (T=1,T=0).


## Matching Review

## Here you will evaluate the quality of your matching.

Plot features distribution before and after performing the matching.

PS: Ideally you also need to run a statistical test to show that there is no difference in distribution between the control and treatment variables.

## Here you will compute Average Treatement effect. The expected impact of the treatment on the outcome variable compared to the counterfactual outcome.

att = treatment outcome - control outcome

or 

att = E [Y (1) − Y (0) | T = 1]

Reference: https://www.youtube.com/watch?v=CEikQRj5n_A&t=940s&ab_channel=PEP