# Matching

Let's take a look at a simple technique known as matching, which we can use to estimate the certain effects.

Model:

A very simple toy example about vitamin C.

There are two types of people, those who are likely to get colds, and those that are not. In my simplistic universe this is a binary feature.

People can also choose to take vitamin C or not.

We assume that vitamin c is reduces the chance of getting a cold.

However, those who are likely to get a cold are also more likely to take the vitamin.

So measuring the correlation between taking the vitamin, and having a cold is likely to give a false result.

Let's first create some fake data for this.

Let's model the following scenario, our population is everyone who was not taking vitamin C in 2019. Some of them had colds then.

In 2020 some of them decide to start taking vitamin-C, some know have colds, some do not.

We know whether people started taking the medicine or not, we know if they had a cold in 2019, and whether they have a cold now in 2020.

There are two types of people, those who are suciptible to colds, they get a cold in the winter with probability p_s. There are those who are relatively immune (although not completely), they get colds with probability p_i.

Takikng vitamin c reduces your chance of getting a cold by e (the effect). So your subsequent chance of getting a cold is p_s(1-e) or p_i(1-e).

We don't know which group everyone is in, but we do know whether they previously had a cold.

Furthermore, suseptible people are more likley to take the vitamin. They take with probability t_s, and the immune people take with probability t_i.

Estimate e from a given set of data.

In [8]:
from random import random
import pandas as pd

In [71]:
p_s = 0.6 
p_i = 0.3
t_s = 0.7
t_i = 0.2
e = 0.1

population_size = 10000

# Split people into the two groups respectively.
susceptible = [int(random() < 0.5) for _ in range(population_size)]

# Now we decide whether these people had colds in 2019
had_cold = [int(random() < p_s) if s == 1 else int(random() < p_i) for s in susceptible]

# Now we decide if they took the pills, similarly to above.
took_pill = [int(random() < t_s) if s == 1 else int(random() < t_i) for s in susceptible]

# Determine their probability of getting a cold, recall that this depends on whether they
# are susceptible, and if they take the pill
p_cold = [p_s*(1-e*tp) if s == 1 else p_i*(1-e*tp) for s, tp in zip(susceptible, took_pill)]

# Finally, decide whether they got a cold or not ...
has_cold = [int(random() < p) for p in p_cold]

# Collect the data into a dataframe
df = pd.DataFrame(
    {
        'susceptible': susceptible,
        'had_cold': had_cold,
        'took_pill': took_pill,
        'p_cold': p_cold,
        'has_cold': has_cold
    }
)

In [72]:
df.head()

Unnamed: 0,susceptible,had_cold,took_pill,p_cold,has_cold
0,1,0,1,0.54,1
1,1,0,0,0.6,1
2,1,0,1,0.54,1
3,1,1,0,0.6,1
4,0,0,0,0.3,0


Now this is essentially a sanity check, we should get back (roughly) the numbers we put in above.

In [73]:
df.groupby('susceptible').agg({'had_cold': 'mean',
                               'took_pill': 'mean'
                              })

Unnamed: 0_level_0,had_cold,took_pill
susceptible,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.304922,0.20068
1,0.59936,0.69932


And sanity check again, there are four types of people, which should have different probabilities of getting colds, what are these probabilities? The difference in these two numbers is essentially our answer, but remember that in reality the susceptible variable is hidden from us (or at least it would be in reality).

In [74]:
df.groupby(['susceptible', 'took_pill']).p_cold.mean().unstack(-1)

took_pill,0,1
susceptible,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.3,0.27
1,0.6,0.54


Now do the naive way to calculate e, this is simply assuming that there are no confounders, and that correlation is causation.

In [75]:
naive = df.groupby('has_cold').took_pill.mean()
naive[1] - naive[0]

0.09569462353385699

Those that took the pill are 12% more likely to have a cold than those that did not! So you shouldn't take the pill right.

### Simple Matching

We can use simple matching to calculate the real effect of taking the pill.

Our treatment is the vitamin C pill, and our confounder is whether they had a cold before.

So for every example where someone is taking the tablets, find someone who is not, but who is similar in the confounding variable, in this case whether they had a cold last year.

For everyone who takes a pill, also find someone who doesn't take the pill, but is the same in the confounding variable.

### GroupBy Method

As this is such a simple example, you should be able to find the answer just with groupbys, rather than having to go through actually explicitly matching everything.

### Regression Method

Conceptually similar, what happens if you try a (logistic?) regression in this case? Do you get the same answer?

### Using the module

Python has a module already, see if your hand made answer is the same.

In [76]:
from causalinference import CausalModel
import numpy as np

In [77]:
cm = CausalModel(Y=np.array(has_cold),
            D=np.array(had_cold),
            X=np.array(took_pill)
           )

In [None]:
cm.est_via_matching()
cm.estimates