# Matching Tutorial

The comparison method is used in statistical analysis to eliminate distortions caused by differences in the basic characteristics of the studied groups. Simply put, matching helps to make sure that the results of the experiment are really caused by the studied effect, and not by external factors. 

Matching is most often performed in cases where the use of a standard AB test is impossible.

In [1]:
from hypex.dataset import Dataset, ExperimentData, InfoRole, TreatmentRole, TargetRole 
from hypex.experiments.twin_search import MATCHING

  from .autonotebook import tqdm as notebook_tqdm


## Creation of a new test dataset with synthetic data. 

It is important to mark the data fields by assigning the appropriate roles:
- FeatureRole: a role for columns that contain features or predictor variables. Our split will be based on them. Applied by default if the role is not specified for the column.
- TreatmentRole: a role for columns that show the treatment or intervention.
- TargetRole: a role for columns that show the target or outcome variable.
- InfoRole: a role for columns that contain information about the data, such as user IDs. 

In [2]:
data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "treat": TreatmentRole(int), 
        "post_spends": TargetRole(float)
    }, data="data.csv",
)
data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444   NaN      M   
1           1             8      1       512.5   462.222222  26.0    NaN   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  
0     E-commerce  
1     E-commerce  
2      Logistics  
3     E-com

In [3]:
data.roles

{'user_id': Info(<class 'int'>),
 'treat': Treatment(<class 'int'>),
 'post_spends': Target(<class 'float'>),
 'signup_month': Feature(<class 'int'>),
 'pre_spends': Feature(<class 'float'>),
 'age': Feature(<class 'float'>),
 'gender': Feature(<class 'str'>),
 'industry': Feature(<class 'str'>)}

## Simple matching
Before execution, we wrap prepared dataset into ExperimentData to be able to run experiments on it. 
Then we execute pipeline, in this case we select one of the pre-assembled pipeline, in our case MATCHING. Also, pipline may be created depends on your needs with custom executors.

In [5]:
test = MATCHING
result = test.execute(ExperimentData(data))

## Matching results 

ATT shows the difference in treated group.      
ATC shows the difference in untreated group.      
ATE shows the weighted average difference between ATT and ATC.      

In [6]:
result.variables

{"MatchingMetrics┴┴['post_spends', 'FaissNearestNeighbors||post_spends_matched']": {'ATT': 275.68659110904105,
  'ATC': 309.32492819349966,
  'ATE': 291.19931400844945},
 "SMD┴┴['post_spends', 'FaissNearestNeighbors||post_spends_matched']": {0: 76.00524756368492,
  1: 23.666228822777718}}

You may find Mahalanobis distances in `result.groups` and matched treatment column in `result.additional_fields`

In [7]:
result.groups

{"MahalanobisDistance┴┴['signup_month', 'pre_spends', 'age']": {'control':         0          1         2
  0     NaN        NaN       NaN
  3     0.0  36.929974  3.254733
  12    0.0  34.757623  3.442269
  13    0.0  37.445447  4.191827
  14    0.0  36.598598  5.111848
  ...   ...        ...       ...
  9988  0.0  36.929974  2.467274
  9990  NaN        NaN       NaN
  9991  0.0  35.530832  3.523547
  9992  0.0  36.193584  2.529632
  9996  0.0  36.856335  2.323176
  
  [4294 rows x 3 columns],
  'test':              0          1         2
  1     4.293461  37.909827  2.303881
  2     3.756778  35.716248  2.208865
  5     3.220095  35.952757  3.576048
  8     2.146730  34.363881  5.139179
  9     2.146730  34.695257  4.284285
  ...        ...        ...       ...
  9981  3.220095  36.983703  4.090079
  9993  2.683413  34.127372  5.060565
  9997  1.610048  34.894946  2.000059
  9998  1.073365  36.493776  5.245584
  9999  3.756778  37.557224  3.162571
  
  [3675 rows x 3 columns]}}