# Week 8 Overview
Matching is a powerful alternative to regression for estimating treatment effects when linear assumptions may not hold or when the relationship between variables is complex. This content introduces the concept of matching as a way to “close back doors” by pairing treated and untreated units with similar covariates, enabling more accurate estimation of causal effects. You’ll explore different matching techniques, including distance-based matching, propensity score matching, inverse probability weighting, and kernel-based weighting. The content also covers how to match using one or multiple variables, how to assess balance and common support, and how to handle practical challenges like bias-variance tradeoffs and the curse of dimensionality. By the end, you’ll understand how matching can be used on its own or combined with regression to support valid causal inference. 

## Learning Objectives 
At the end of this week, you will be able to: 
- Explain matching and its application to average treatment effect (ATE), ATT (average treatment effect on the treated), and ATUT (average treatment effect on the untreated) 
- Perform inverse probability weighting and propensity score matching  
- Explain matching variants such as the Mahalanobis distance and the Epanechnikov kernel 

## Topic Overview: Matching Fundamentals and Strategies
This section introduces matching as a flexible alternative to regression for closing back doors when the relationship between variables is complex or nonlinear. 

Instead of modeling the functional form directly, **matching** pairs treated and untreated observations with similar values of a confounding variable, $Z$. 

This allows estimation of treatment effects, like ATT, even without knowing the true form of the relationship between treatment and outcome. 

The section also covers **weighted matching** as a way to account for distributional imbalances and introduces key concepts like distance matching and propensity score matching, which help formalize similarity between observations using one or more covariates. 

### Learning Objectives 
- Explain matching and its application to ATE, ATT, and ATUT 
- Perform inverse probability weighting and propensity score matching 

### 1.1 Lesson: Another Way to Close Back Doors
What if the relationship between treatment and outcome isn’t something we can easily model with a straight line? In this video, we’ll explore how **matching** can help us estimate treatment effects—even when the relationship between variables is complex, messy, or nonlinear. 

#### Matching
With regression, we close back doors by finding a linear relationship between the outcome and the treatment, as well as with any confounders. But what if there isn’t a linear relationship, and we’re not sure how to model the situation? 

For example, what if:

$$ Y = \frac{2X + X^2}{Z} + \varepsilon(X, Z) $$

Then, we *could* model this linearly by including the term $\frac{X^2}{Z}$  as an additional covariate, but what if we don’t think to do this? It’s a pretty weird relationship — we’re unlikely to think of it unless our domain knowledge hints of it:

One solution would be matching: 

1. For each item with $X = 1$, we note its $Z$ value.
2. Then we find another item with approximately the same $Z$ value, and match them. 

Suppose the $X = 0$ values have $Z = 1, 2$, while the $X = 1$ values have $Z = 1, 2, 3$. We throw out the $Z = 3$ item (it doesn't match), and we get: 

$Y(X = 0, Z = 1) \; \& = 0$

$Y(X = 0, Z = 2) \; \& = 0$

$Y(X = 1, Z = 1) \; \& = 3$

$Y(X = 1, Z = 1) \; \& = 2.5$

In this scenario the estimated effect is the difference between the average $X = 1$ and $X = 0$ values, or 

$$  \frac{ (\frac{3. + 2.5}{2} - \frac{0}{2} )}{2} \; = \; 1.375$$

This is the average treatment effect on the treated, since we are including counterfactuals (matches) for all treated items, but we are not including counterfactuals for all untreated items ($Z = 3$ has no counterfactual). Matching works even though we do not know the true relationship between $X$, $Y$, and $Z$. All we need to know is that $Z$ is the only confounder. 

### 1.2 Lesson: Weighted Averages
We can also match by **weighting** 

In this case, if the untreated group has $Z = 1, 2, 3$ while the treated group has $Z = 1, 2$ 

Then we could weight the untreated values according to how close they are to the treated values: 
- For the untreated items, maybe the $Z = 1$ item would have a weight of 0.4, 
- $Z = 2$ has $0.4$, and 
- $Z = 3$ has 0.2 because $Z = 3$ is farther away from the treated $Z$ values. 

The exact weight is somewhat arbitrary: 
- Of the untreated items, $Z = 3$ should get a lower weight than $Z = 2$ because it’s farther from the treatment samples’ $Z$ values, but what weight, exactly, should it get?  

There are many ways of calculating this, but in the end, it’s up to you. There is no one right answer. 
- For the treated items" $Z = 1$ and $Z = 2$ both get an equal weight of $0.5$ 
- Then, to compute the effect, we weight each $Y$ value by the associated weight based on the $Z$ value. 

Another example could be if $Z$ is categorical: 

There are men and women. 
- If there are more men in the untreated group than in the treated group, we should downweight the untreated men (say, before computing the mean of ﻿Y﻿ or other statistic) according to the ratio of the count.
- If there are also more women in the untreated group, we would downweight the untreated women, too, but likely by a different amount. 
- if the treated group has 8 men and 2 women, while the untreated group has 4 men and 6 women, then we could compute the ATT by weighting the untreated group so that men get a total weight of 0.8 and women 0.2. - This would involve weighting men at 0.2 each and women at 0.0333 each. The treated group would weight everyone at 0.1. In this way, we effectively have one counterfactual for each man and woman in the treated group. 

Taken together, these weights would allow us to compute the mean or other statistic of the untreated group as if it had the same distribution of men and women as the treated group. 

#### A Single Matching Variable
As noted above, there’s a significant problem with trying to match the “nearest” $Z$ value between treated and untreated groups: 
- We need to know which $Z$ value is closest (and there may be multiple ways of measuring that). 
- When we are weighting, we need to know what weights to assign based on how close the $Z$ value is to other $Z$ values.

The simplest approach is **distance matching:**
- Distance Matching matches east treatment item to the closest untreated item(s) in distance, according ot the proximity of the matching variables (i.e., confounders like $Z$).
- If there is one matching variable only, this is easy; but if there are several matching variables, we have to define a distance between samples where there are multiple covariates ($Z_1, Z_2, Z_3,$ etc.)
- One way would be to standardize them, then take the euclidian distance, like $ \sqrt{(Z_{1,T} - Z_{1, U})^2 + (Z_{2,T} - Z_{2, U})^2 + (Z_{3,T} - Z_{3, U})^2}$.
    - Standardization is important because if the difference for $Z_1$ are on the order of 100, but for $Z_2$they are on the order of 1, then the $Z_1$ differences will completely outweight the $Z_2$ differences.
    - This is especially true with Euclidean distances!, $\sqrt{100^2 + 1^2}$  is $100.005$ ; the $1$ doesn't really matter because of the squaring.

Perhaps the most common approach is **propensity score matching:**
- In propensity score matching observations are similar if they were equally likely to be treated. 
- For example: 
    - if the matching variable is income, and the likelihood of treatment is normally distributed with mean $50,000, then the values $40,000 and $60,000 are “similar” because they are equally likely to be treated. - So, we can match a treated variable with income $40,000 to an untreated variable with $40,000 or with $60,000. This works to close the back door, although this may not be obvious. 
    
To see why, suppose we have:

In [2]:
import pandas as pd
import numpy as np

dict = {
    'Group' : ['Treated', 'Treated', 'Treated', 'Treated', 'Untreated', 'Untreated', 'Untreated', 'Untreated','Untreated', 'Untreated',],
    'Income' : [40000, 50000, 50000, 60000, 40000, 40000, 50000, 50000, 60000, 60000],
    'Target value' : [12, 18, 18, 0, 12, 12, 18, 18, 0, 0]
    }
df = pd.DataFrame(dict)
df.head()

Unnamed: 0,Group,Income,Target value
0,Treated,40000,12
1,Treated,50000,18
2,Treated,50000,18
3,Treated,60000,0
4,Untreated,40000,12


Then, taking the overall means results in the treated target value mean=12 and untreated=10:

In [3]:
df.groupby('Group').mean()

Unnamed: 0_level_0,Income,Target value
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
Treated,50000.0,12.0
Untreated,50000.0,10.0


But the income is a *confounder*; the two groups are actually the same when controlling for income. (This must be the case: the incomes’ targets match exactly across both groups.) 

If we match incomes using distance matching, for distance = 0, the untreated group gets 12, 18, 18, 0, and is the same target mean as the control group. 

This would involve the treated item with income $40,000 matching to the average of the two untreated items, with income $40,000 (all three of these have target = 12).

In [5]:
# Create a new dataframe with numeric treatment indicator
df_numeric = df.copy()
df_numeric['Treatment'] = (df_numeric['Group'] == 'Treated').astype(int)

# Separate the data into treated and untreated groups
treated = df_numeric[df_numeric['Treatment'] == 1]
untreated = df_numeric[df_numeric['Treatment'] == 0]

# Create a new dataframe for matched pairs
matched_treated = treated.copy()
matched_untreated = pd.DataFrame(columns=df_numeric.columns)

# For each treated observation, find matching untreated observations
for _, treated_row in treated.iterrows():
    # Find untreated observations with the same income (distance = 0)
    matches = untreated[untreated['Income'] == treated_row['Income']]
    
    # If we found matches, add them to our matched dataset
    if not matches.empty:
        # Calculate the average of the matched untreated observations
        avg_target = matches['Target value'].mean()
        avg_income = matches['Income'].mean()
        
        # Create a new row for the matched untreated observation
        new_row = pd.DataFrame({
            'Group': ['Untreated (Matched)'],
            'Income': [avg_income],
            'Target value': [avg_target],
            'Treatment': [0]
        })
        
        # Add this match to our matched untreated data
        matched_untreated = pd.concat([matched_untreated, new_row], ignore_index=True)

# Combine the matched treated and untreated data
matched_data = pd.concat([matched_treated, matched_untreated], ignore_index=True)

# Display results
print("Original data means by treatment status:")
print(df.groupby('Group')['Target value'].mean())

print("\nMatched data means:")
print("Treated mean:", matched_treated['Target value'].mean())
print("Untreated (matched) mean:", matched_untreated['Target value'].mean())
print("Treatment effect:", matched_treated['Target value'].mean() - matched_untreated['Target value'].mean())

Original data means by treatment status:
Group
Treated      12.0
Untreated    10.0
Name: Target value, dtype: float64

Matched data means:
Treated mean: 12.0
Untreated (matched) mean: 12.0
Treatment effect: 0.0


  matched_untreated = pd.concat([matched_untreated, new_row], ignore_index=True)


But what about propensity score matching? 

This would mean that we can randomly cross-match untreated-$40,000 and untreated-$60,000 to treated-$40,000 and treated-$60,000 if we like because they have the same propensity to be treated (one-third of each are treated, compared with one-half for $50,000). 

This will produce the same target value average in the untreated matched group as for distance matching (in this case, 12). 

The propensity score approach involves treated-$40,000 matching to untreated-$40,000 or untreated-$60,000 and untreated-$60,000 to untreated-$40,000 or untreated-$60,000. In other words, the two untreated values are equally likely to be chosen. 

The distance approach involves treated-$40,000 matching to untreated-$40,000 and treated-$60,000 matching to untreated-$60,000. So again, the two untreated values are equally likely to be chosen (they are chosen once each). 

Then, having a 50% likelihood of picking $40,000 or $60,000 for the match gives the same average target value as always picking exactly one $40,000 and one $60,000. 

An alternative perspective is that the confounder (income) influences the propensity score (which is 33% for the high and low income, 50% for the medium one), which in turn influences the treatment. If we think of this in terms of directed acyclic graphs, we’ll see that the path through income toward the outcome (﻿Y﻿) must pass through the propensity score. Therefore, controlling for the propensity score suffices to control for this path.

## Topic 2: Practical Considerations and Advanced Techniques
This section explains how to calculate propensity scores and use them effectively in matching and weighting strategies for causal inference. 
- You'll learn how logistic regression or binning can estimate treatment probabilities and explore practical decisions like how many matches to select, whether to match with or without replacement, and how to assign weights based on distance using kernels like Epanechnikov. 
- The section also introduces inverse probability weighting (IPW), considerations for caliper selection, matching in high-dimensional spaces, and advanced techniques like Mahalanobis distance, coarsened exact matching, and entropy balancing. 
- Finally, you'll see how to validate your propensity score models through stratification tests and how machine learning can enhance score estimation. 

### Learning Objectives 
- Perform inverse probability weighting and propensity score matching  
- Explain matching variants such as the Mahalanobis distance and the Epanechnikov kernel.

### 2.1 Lesson: Propensity Scores and Matching Strategies
When people receive a treatment, it’s often not random — it’s influenced by their characteristics. In this video, you’ll learn how propensity scores help us account for those differences and how we can use them — along with a method called inverse probability weighting — to estimate causal effects more accurately.

#### How to Calculate Propensity Scores
With one covariate (such as income in the example above), we’d often just match on the value of the covariate itself. But if we wanted to use propensity score, we could compute it in one of two ways:
- Perform logistic regression to model the probability of treatement. Above, we'd want to fit a quadratic function where:
$$\text{Prob} (40000) = \frac{1}{3}, \text{Prob} (50000) = \frac{1}{2}, \text{Prob} (60000) = \frac{1}{3}$$
- It has to be quadratic, including a $Z^2$ term, because we want it to be small for both high and low incomes and larger for the middle incomes.
- Bin the data(in this case, perhaps into bins of 10,000) and compute the probability of treatment in each bin.

#### Selecting Matches or Constructing a Matched Weighted Sample?  
Selecting matches means selecting one or more untreated observations for each treated observation (or vice-versa). 

In a matched weighted sample, each untreated observation is assigned a numerical “weight” that determines how important it is. 

The “selecting matches” effect can vary more with small changes to the dataset as the matched observations can be “in” or “out” according to small differences in $Z$ values. 

Suppose a treated item has $Z = 0.55$, and the two nearest untreated items have $Z = 1$ and $Z = 0$. 

This treated item is closer to $Z = 1$ than to $Z = 0$. But a small change to $Z = 0.45$ can make it closer to $Z = 0$.

#### If We’re Selecting Matches, How Many? 

Suppose we compute the ATT so that we find untreated matches (counterfactuals) for each treated item. There might be multiple good matches for each treated item. What do we do in that case? There are several options. 

1. Pick the best match. 
2. Pick the top k best matches (k-nearest-neighbor matching). We can then average their $Y$ values in order to find the effect. 
3. Pick all the matches within a given distance (radius matching). We can find all untreated $Z$ values within a radius of a given treated $Z$ value. Again, we average their $Y$ values. Radius matching can assign a varying number of matches to each treated item, so we’d want to weight these matches so that the effective number of matches to each treated item is 1. If one treated item has two matches and another has three, we’d weight the former by 0.5 each and the latter by 0.333 each. 

We can also match “with replacement,” meaning that a given untreated observation can match to multiple treated observations, or “without replacement,” meaning that each untreated observation can only match to one treated observation. 

One suggestion: If your untreated group contains many more observations than your treated group, it will make more sense to assign multiple untreated observations to each treated one. If your untreated group contains about the same number of observations as your treated group, you might not have enough observations to do that. 

#### If We’re Constructing a Matched Weighted Sample, How Will Weights Decay with Distance?
In an example earlier, I said that the farther-away $Z = 3$ value would get a weight of $0.2$, while the closer ones got a weight of $0.4$ each. That was just a random guess — is there any more systematic way to assign weights based on distance?

One approach is called the **Epanechnikov kernel**, which assigns a weight given the distance $x:$

$K (x) \; = \; \frac{3}{4} (1 - x^2)$

In this case, since $Z = 3$ was a distance of 1 away from $Z = 2$, we'd have:

$K(3 - 2) \; = \; \frac{3}{4} (1 - (3-2)^2)$

$ = \; \frac{3}{4} (1 - 1) \; = \; 0$ (So the $Z = 3$ value would be too far away to be included). 

If it were $Z = 2.5$ instead, we'd have $K(2.5 - 2) \; = \; \frac{3}{4} (1 - (0.5)^2) \; = \; 0.5625$ as its weight. We might want to rescale the kernel if our typical distances are too large for too small for this kernel to be meaningful. 

The Epanechnikov kernel is similar to doing radius matching with a radius of 1, excep that the weight increases continuously up to this radius rather than assigning a disrete value of either 0 or 1.

### 2.2 Lesson: Advanced Matching and Weighting Techniques, Part I

#### Inverse Probability Weighting (IPW)
This approach is used with propensity scores. 
- The weight for a given observation is equal to one divided by its probability. 
- For a treated observation, we use the probability that it would be treated. 
- For an untreated observation, we use the probability that it would be untreated. 
- This probability could be calculated using either (a) logistic regression based on one or more $Z$ values or (b) if $Z$ is categorical, we can simply count up the treatment probability for the given $Z$.

**"Treatment value”** means **“treated”** or **“untreated":** 
- In other words, if items with income 40,000 have a propensity score of 33% (one-third of them are treated, and two-thirds are untreated) 
- Then they are weighted by $\frac{1}{0.33} = 3$ in the treatment group, and $\frac{1}{0.67} = 1.5$ in the control group.
-  The result is that the total weight in the treatment and untreated group is, on average, the same (it is two times the number of items).
- That's because for (on average) $Np$ items in *treatment* and $N(1 - p)$ items in the untreated grup, the total weight is $N \cdot \frac{p}{p} = N$ in treatement, and $N \frac {1 - p}{1 - p} = N$ in the untreated group. 

In effect, if we have $N$ items of a certain type \(around income \$40,000, or having a similar propensity score to that\), then inverse probability weighting assigns weights so that the total weight of these items is $2N$ , i.e., the average weight of each item is 2. 

What’s meant here is not that the items’ weights must total exactly $2N$ but that if:
- (1) a number of samples were taken from the population and 
- (2) the model for computing the propensity is well-specified, with the correct formula, then on average, the total will be $2N$. 



This means that we are computing the average treatment effect (each item counts for 2, whether it is treated or untreated) instead of the average treatment effect on the treated (each treated item counts for a fixed amount, and untreated items count according to their treated match).

Let’s consider the example above. If items with income 50,000 have a propensity score of 50%, they are weighted by $\frac{1}{0.5} = 2$ in the treatment group and $\frac{1}{1 - 0.5} = 2$ in the control group. In the above example, applying IPW to each sample, this results in a $Y$ value of 

$$\frac {12 \times 3 + 18 \times 2 + 18 \times 2 + 0 \times 3}{10} = 10.8$$

The same $Y$ value for treated and untreated.

Let’s consider this special case where the treated and untreated items have the same target value given the same propensity score. There are $N_T$ of a given matching value in treated, which all have target value $T$ , and $N_U% in untreated, which also have target value ﻿T﻿ , then (assuming the statistical distribution of the data exactly reflects the 


#### What Is the Worst Acceptable Match?
What if one treated observation is totally different from all untreated observations? 
- We define a “caliper” or “bandwidth,” which is a number of standard deviations (either of the propensity score or of the matched value, whichever you are using to match; in the latter case, I am assuming that the matched covariate is a single number ﻿Z﻿ so that it has a single standard deviation). 
- You can use the Epanechnikov kernel, which automatically has a caliper of 1. 
- You can use exact matching, which automatically has a caliper of 0. (This is unsuitable for continuous data values that rarely or never take on the same value.) 
- You can use coarsened exact matching, where you bin the data and then require that the two matches (treated and untreated) be in the same bin. (This is suitable for continuous data values.) 

Recall the **bias-variance tradeoff:** 
- High bias means that our prediction is systematically wrong — even if we took an infinite number of samples and averaged the results, it would still be wrong. 
- High variance means that our prediction varies dramatically when we take different samples. The wider the bandwidth, the more bias because the $Y$ values in the large bin may differ systematically from each other. 
- However, the wider the bandwidth, the less variance because we sample more possible matches. In other words: 
    - if we only take a single $Y$ value from a small bin, it should be correct on average but might be wrong by chance. 
    - If we take many matching $Y$ values from a large bin, the average will be stable, but that average might differ systematically from the correct value. 
    - Another way of putting it is that if the matches are at a distance of 1, 2, 3, 4, and 5 away (in terms of the matching value), then using just the distance 1 value has low bias (because we’re using the best estimator available) but high variance (we’d need to take an average of multiple estimates to reduce the variance.) 

The bias-variance tradeoff is generally important when performing matching! 



### 2.3 Lesson: Advanced Matching and Weighting Techniques, Part II
As noted above, distance matching becomes trickier with **multiple variables**. The solution mentioned was to use the Euclidean distance. 

Another option is to use the **Mahalonobis distance** between two observations, which is the square root of the following inner product: the displacement vector between the two d-dimensional matching variable vectors, multiplied by the inverse $S^{-1}$ of the covariance matrix for all the matching variables, multiplied again by the same displacement vector. 

The advantage of this approach (over the Euclidean distance) is that it handles correlated variables well: 
- Suppose $Z_1$ and $Z_2$ are highly correlated confounders; they are almost equal. 
- Then, when two samples differ the same in $Z_1$ as in $Z_2$, that difference is counted less. 
- When a sample is higher in $Z_1$ but lower in $Z_2$ (i.e., they differ in a way opposite to the usual correlation), then that difference is counted more. 

That way, differences that are contrary to the correlation (which are likely smaller) have a chance to matter. 

Another point is that the inclusion of the $S^{-1}$ term will tend to reduce the importance of features that are correlated to other features. 

The **curse of dimensionality** may make it harder to find matches: 
- In high-dimensional spaces, almost all points are approximately the same distance from all other points. This makes matching almost meaningless. 
- In effect, if there are many $Z$ values, then it’s unlikely that all of them will be particularly large or small. On average, they will be about average. 
- When we combine the various $Z$ values, the Euclidean distance will likely be around the average distance, so we will have little basis for matching. 

Here’s how to counter the curse of dimensionality: 
1. Leave some of the matching variables ($Z$) out, thereby reducing the number of dimensions. However, in that case, these particular confounders will not be controlled for. 
2. Include more observations, assuming you have some more to include. In an $N$-dimensional space, you need exponentially many samples ($2^N$) to make it likely that one sample is suitably close to another. (Imagine a cube divided up into eight cubes of half the width. The number eight comes from $2^3 = 8$.) 
3. Have a larger caliper/bandwidth, thus finding more matches. It is possible to choose the bandwidth needed to get the number of matches you want. However, this does not really solve the curse of dimensionality; the number of matches may be very sensitive to the chosen bandwidth. That is, you may get almost no matches for small bandwidths and almost everything for larger ones. 

Alternatively, we can do coarsened exact matching, where we put all variables into bins, and matches must have the same bins in all of the matching variables. This requires a really big sample size. But then, multiple matching variables always tends to require a large sample size unless you just increase the bandwidth and accept the really large number of matches. 

#### Entropy Balancing
This means enforcing a restriction like “no difference in the mean of matching variable ﻿X﻿ between treatment and control.” We just need a set of weights that satisfies all restrictions. We can, in general, do any moment (recall that mean, variance, skew, and kurtosis are examples of moments). Thus, we could require no difference in variance, etc.

#### Propensity Score Weighting with Multiple Matching Variables
As mentioned above, you can estimate propensity scores by **regression**, such as logit or probit: 
- If the matching variable $Z$ can be thought of as influencing treatment via treatment propensity (which it really ought to be — that’s what treatment propensity is), then controlling for the propensity score should close all doors related to the matching variable. 
- To check this, within each bin or interval of the propensity score, you can check that the matching variable is unrelated to treatment. 
- For example, in the above case, since $40,000 and $60,000 are in the same propensity score bin (score = 1/3), we require that they have the same likelihood of treatment. 
    - If there is still a relationship, it might mean that the matching variable is not included in the regression correctly. (The propensity score is calculated incorrectly.) Perhaps you need more polynomial or interaction terms. This is called a **stratification test**. 

The point of the stratification test is that for a given interval of propensity score, the covariates used for matching ($Z$) should not be correlated with treatment. Otherwise, it might mean that we miscalculated the propensity score. 

You can also use machine learning to estimate the propensity score, including regularized regression and boosted regression. Boosted regression is a version of logistic regression that works similarly to gradient boost. The “boosting” means that it runs itself repeatedly, fixing its own remaining errors each time it runs.

### 2.4 Lesson: Practical Considerations and Advanced Techniques

#### Assumptions for Matching

**Conditional Indepdence Assumption:**
This is just the assumption that all back doors are closed. The set of matching variables you’ve chosen (covariates like $Z$) is enough to close all back doors. All remaining relationships between treatment and outcome is a desired (front door) relationship.

**Common Support:**
There should be enough control observations to match with. This is called **“common support.”** 
- You can imagine this could fail quite easily. 
- For example, imagine that in the treatment group, there are $Z$ values of $Z$ = 1, 5, and 10, while in the untreated group there are $Z$ values of 1, 11, and 12. We can easily find matches for $Z$ = 1 and $Z$ = 10 in the treated group but not for $Z$ = 5.

**How to Check for Common Support**
1. Look at the probability distribution of a variable for the treated and untreated groups. If one distribution is zero (or very small) where the other is nonzero, then we can’t use matching or weights to compare one to the other in that region. In the example above, this would be like $Z$ = 5, where the treatment distribution is nonzero while the untreated distribution is zero. However, to really talk about a probability distribution, it would be nice to have a denser array of samples. In the above data, we really cannot say whether $Z$ = 1, 5, and 10 means that we should be able to find $Z$ values ranging from 1 to 10 or whether those specific values are special.
2. Another option is to look within a certain caliper or radius of each treated item and see if it finds support within that radius.

Note that some approaches will automatically drop items lacking support, such as with calipers. However, propensity score matching with inverse probability weighting won't do this since an item with no support will simply have $P = 1$ (it was certain to be treated) and will be given a weight $\frac{1}{P} = \frac{1}{1} = 1$. That's smaller than other weights, but it is still not zero.

If we don’t have any matches, then we aren’t really doing matching — even though we are technically following the instructions. When this happens, you may want to drop the items without matches. In fact, you might want to require more than one match — perhaps there should be a certain number of matches within each propensity score bin for the untreated variable; otherwise, you skip that bin. 

Commonly, the bins you skip will have very high or very low propensity scores because these will be the scores that are least likely to have a match. 

In any case, the only way to fix common support issues is to drop the treated samples that find no match and/or the untreated samples that are present in low enough numbers that the match isn’t good enough. If you have to drop too much data, then you aren’t really computing an ATE or ATT anymore. At some point, you have diverged enough from your goal that matching may not work for you.

#### Balance
Balance is the assumption that we have matched correctly — we have common support, and we are using covariates $Z$ that close all backdoors. Whereas conditional independence means that we could, in principle, match correctly, balance means that we have matched correctly. If there is not common support, we may not be able to match correctly in practice. 

With interaction terms, it is more likely to fail to find common support and thus to fail to get balance. For instance, if we have ($X$, $Z$) = (1, 0) and (0, 1) in treatment but (0, 0) and (1, 1) in the untreated group, then $X$ and $Z$ have support individually (the $X$ = 1 matches to the $X$ = 1) but not jointly (the (1, 0) does not match to any (1,0)). 

A balance table displays moments such as the mean and standard deviation. If the mean is the same for both treatment and control group, then it is likely that balance is good. If the means are quite different, they could show a lack of common support. After matching, the balance table should definitely show similar means and similar standard deviations.

#### Estimating Mean Differences
We have discussed how to estimate the effect: simply find the $Y$ values of the matched treated and untreated data, perhaps weighted appropriately. But what about standard errors? 

The two problems here are: 
- The errors of the weights themselves contribute to the overall error. 
- If we dropped some observations, we can’t tell how the choice of observations to drop may have influenced the error. 

The best way to compute the standard error in this case is to use a bootstrap simulation. However, if you do this, you’d have to automate the match selection process and incorporate it into the bootstrap — the whole matching process would have to be repeated.

#### Combining Matching and Regression 
We can combine matching with regression using the following techniques: 
- Regression adjustment 
- Double robust estimation, which works even if regression or matching alone doesn’t close all the back doors. 
    - E.g., AIPWE (augmented inverse probability weighted estimator) 
    - E.g., entropy balancing

#### Matching and Treatment Effects
Regression typically gives about the same treatment effect whether you do ATE, ATT, or ATUT. That’s because a particular beta value is the effect. Matching, in contrast, can give you different treatment effects. To find ATE, use inverse probability weighting. 

To find ATT, weight all treatment observations the same or (if you are selecting matches) find a set of untreated matches for each treatment observation. 

For inverse probability weighting with ATT, the weighting involves giving the treated group a weight of 1, and the untreated group gets $\frac{p}{1 - p}$ instead of $\frac {1}{1 - p}$. 

To find ATUT, find a set of treated matches for each untreated observation. The weighting involves giving the untreated group a weight of 1 and the treated group gets $\frac{1 - p}{p}$ instead of $\frac {1}{p}$.  

You can also get the average treatment effect by averaging $p \times \text{ATT} + (1 - p) \times \text{ATUT}$, where $p$ is the proportion treated. 



### Key Terms and Definitions
1. **Matching:** A method for estimating causal effects by pairing treated and untreated units with similar values of confounding variables to simulate a randomized experiment.
2. **Average Treatment Effect on the Treated (ATT):** The average causal effect of a treatment on those who actually received the treatment. 
3. **Weighted Matching:** A method where untreated observations are weighted to match the distribution of confounders in the treated group rather than selecting exact matches.
4. **Distance Matching:** A technique that matches treated and untreated observations based on the smallest distance (e.g., Euclidean) between their covariate values. 
5. **Propensity Score Matching:** Matching based on the probability of receiving treatment given covariates; units with similar scores are matched to control for confounding.
6. **Propensity Score:** The probability of receiving the treatment, estimated using observed covariates (e.g., via logistic regression). 
7. **Epanechnikov Kernel:** A weighting function used in kernel-based matching that assigns weights based on the distance between units, favoring closer matches. 
8. **Inverse Probability Weighting (IPW):** A method where each observation is weighted by the inverse of the probability of receiving the treatment (or control) it actually received, to estimate ATE.
9. **Caliper / Bandwidth:** A threshold distance used to restrict acceptable matches; matches outside this range are discarded to improve quality. 
10. **Coarsened Exact Matching:** A matching approach where covariates are binned into categories, and matches are made within the same bins across groups. 
11. **Bias-Variance Tradeoff:** A fundamental concept where tighter matching (low bias) may increase variability, while looser matching (low variance) may introduce bias. 
12. **Mahalanobis Distance:** A distance metric that accounts for correlations among covariates when matching, improving match quality in multivariate settings. 
13. **Curse of Dimensionality:** A challenge in high-dimensional matching where most points are approximately equidistant, making meaningful matching difficult. 
14. **Entropy Balancing:** A reweighting method that ensures exact balance of specified covariate moments (e.g., means) between treated and control groups. 
15. **Common Support:** The condition that there is sufficient overlap in covariate distributions between treated and untreated groups to allow meaningful matching. 
16. **Balance:** The degree to which covariate distributions are similar between treated and control groups after matching. 
17. **Balance Table:** A table comparing means (and often standard deviations) of covariates across groups to assess the success of matching. 
18. **Bootstrap Simulation:** A resampling technique used to estimate standard errors of treatment effects in matching procedures. 
19. **Regression Adjustment:** Using regression after matching to control for residual covariate differences and improve precision. 
20. **Double Robust Estimation:** A method that combines matching and regression; valid if either model is correctly specified. 
21. **Augmented Inverse Probability Weighted Estimator (AIPWE):** A specific double robust estimator that combines outcome modeling and propensity weighting. 
22. **Treatment Effects (ATE, ATT, ATUT):** 
-  ATE: average treatment effect across the entire population 
-  ATT: Effect on the treated group 
-  ATUT: Effect on the untreated group
23. **Stratification Test:** A check to ensure that within strata (bins) of the propensity score, covariates are uncorrelated with treatment assignment. 
24. **Selecting Matches:** The process of choosing one or more untreated units for each treated unit (or vice versa), often using nearest neighbor, radius, or caliper methods. 
25. **Matching With/Without Replacement:** Whether an untreated unit can be matched to multiple treated units (with replacement) or only one (without replacement). 