# Matching pipline

In [5]:
from hypex.dataset import Dataset, ExperimentData, InfoRole, TreatmentRole, TargetRole 
from hypex.experiments.matching import Matching

## Data preparation 

It is important to mark the data fields by assigning the appropriate roles:

* FeatureRole: a role for columns that contain features or predictor variables. Our split will be based on them. Applied by default if the role is not specified for the column.
* TreatmentRole: a role for columns that show the treatment or intervention.
* TargetRole: a role for columns that show the target or outcome variable.
* InfoRole: a role for columns that contain information about the data, such as user IDs.

In [6]:
data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "treat": TreatmentRole(int), 
        "post_spends": TargetRole(float)
    }, data="data.csv",
)
data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444   NaN      M   
1           1             8      1       512.5   462.222222  26.0    NaN   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  
0     E-commerce  
1     E-commerce  
2      Logistics  
3     E-com

In [7]:
data.roles

{'user_id': Info(<class 'int'>),
 'treat': Treatment(<class 'int'>),
 'post_spends': Target(<class 'float'>),
 'signup_month': Feature(<class 'int'>),
 'pre_spends': Feature(<class 'float'>),
 'age': Feature(<class 'float'>),
 'gender': Feature(<class 'str'>),
 'industry': Feature(<class 'str'>)}

## Matching process  
Now matching has 3 steps: 
1. Process Mahalanobis distance 
2. Pair search by Faiss (it is possible to find pairs for test group only)
3. ATT, ATC, ATE estimation  

**be careful: it doesn't have preprocessing**

In [8]:
test = Matching()
result = test.execute(data)

user_id user_id_matched


In [9]:
result.full_data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444   NaN      M   
1           0             0      0       488.0   414.444444   NaN      M   
2           0             0      0       488.0   414.444444   NaN      M   
3           0             0      0       488.0   414.444444   NaN      M   
4           0             0      0       488.0   414.444444   NaN      M   
...       ...           ...    ...         ...          ...   ...    ...   
9995     5053            10      1       497.5   440.888889  65.0      F   
9996     5053            10      1       497.5   440.888889  65.0      F   
9997     5053            10      1       497.5   440.888889  65.0      F   
9998     5062             0      0       463.0   425.777778  29.0      M   
9999     5062             0      0       463.0   425.777778  29.0      M   

      industry  user_id_matched  signup_month_matched  treat_matched  \
0         True 

In [10]:
result.resume

{'ATT': 209.90982351881865,
 'ATC': 268.1156310338775,
 'ATE': 239.38524444444445}