# Matching pipline

The comparison method is used in statistical analysis to eliminate distortions caused by differences in the basic characteristics of the studied groups. Simply put, matching helps to make sure that the results of the experiment are really caused by the studied effect, and not by external factors.

Matching is most often performed in cases where the use of a standard AB test is impossible.

In [1]:
from hypex.dataset import Dataset, InfoRole, TreatmentRole, TargetRole, FeatureRole, GroupingRole
from hypex.experiments.matching import Matching

  from .autonotebook import tqdm as notebook_tqdm


## Data preparation 

It is important to mark the data fields by assigning the appropriate roles:

* FeatureRole: a role for columns that contain features or predictor variables. Our split will be based on them. Applied by default if the role is not specified for the column.
* TreatmentRole: a role for columns that show the treatment or intervention.
* TargetRole: a role for columns that show the target or outcome variable.
* InfoRole: a role for columns that contain information about the data, such as user IDs.

In [2]:
data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "treat": TreatmentRole(int), 
        "post_spends": TargetRole(float), 
        "industry": GroupingRole()
    },
    data="data.csv",
    default_role=FeatureRole(),
)
data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444   NaN      M   
1           1             8      1       512.5   462.222222  26.0    NaN   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  
0     E-commerce  
1     E-commerce  
2      Logistics  
3     E-com

In [3]:
data.roles

{'user_id': Info(<class 'int'>),
 'treat': Treatment(<class 'int'>),
 'post_spends': Target(<class 'float'>),
 'industry': Grouping(<class 'str'>),
 'signup_month': Feature(<class 'int'>),
 'pre_spends': Feature(<class 'float'>),
 'age': Feature(<class 'float'>),
 'gender': Feature(<class 'str'>)}

## Simple Matching  
Now matching has 4 steps: 
1. Dummy Encoder 
2. Process Mahalanobis distance 
3. Two sides pairs searching by faiss 
4. Metrics (ATT, ATC, ATE) estimation depends on your data 

In [4]:
data = data.fillna(method="bfill")

  return self.data.fillna(value=values, method=method, **kwargs)


In [5]:
test = Matching()
result = test.execute(data)

**ATT** shows the difference in treated group.   
**ATC** shows the difference in untreated group.   
**ATE** shows the weighted average difference between ATT and ATC.  

In [6]:
result.resume

      CI Lower   CI Upper  Effect Size  P-value  Standard Error
ATC  93.412912  99.572333    96.492623      0.0        1.571281
ATE  77.319387  82.968763    80.144075      0.0        1.441167
ATT  58.575547  68.167610    63.371579      0.0        2.446955

In [7]:
result.indexes

      indexes
0        9433
1        5438
2        5165
3        1735
4         539
...       ...
9995     5893
9996     7731
9997     7066
9998     1885
9999     5748

[10000 rows x 1 columns]

In [8]:
result.full_data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444  26.0      M   
1           1             8      1       512.5   462.222222  26.0      M   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  user_id_matched  signup_month_matched  treat_matched  \
0     E-comme

We can change **metric** and do estimation again. In this case matching will be estimated only in one direction 

In [9]:
test = Matching(metric="atc")
result = test.execute(data)

In [10]:
result.resume

      CI Lower  CI Upper  Effect Size  P-value  Standard Error
ATC  96.226285  96.75896    96.492623      0.0        0.135886

In [11]:
result.indexes

      indexes
0        9433
1          -1
2          -1
3        1735
4          -1
...       ...
9995       -1
9996     7731
9997       -1
9998       -1
9999       -1

[10000 rows x 1 columns]

In [12]:
result.full_data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444  26.0      M   
1           1             8      1       512.5   462.222222  26.0      M   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  user_id_matched  signup_month_matched  treat_matched  \
0     E-comme

Finally, we may search pairs in L2 distance. 

In [13]:
test = Matching(distance="l2", metric="att")
result = test.execute(data)

In [14]:
result.resume

     CI Lower   CI Upper  Effect Size  P-value  Standard Error
ATT  62.46076  64.273983    63.367372      0.0        0.462557

In [15]:
result.indexes

      indexes
0          -1
1        2490
2        5493
3          -1
4         321
...       ...
9995     5893
9996       -1
9997     8670
9998      507
9999     7155

[10000 rows x 1 columns]

In [16]:
result.full_data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444  26.0      M   
1           1             8      1       512.5   462.222222  26.0      M   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  user_id_matched  signup_month_matched  treat_matched  \
0     E-comme

In [17]:
test = Matching(group_match=True)
result = test.execute(data)

  groups = self.data.groupby(by, **kwargs)
100%|██████████| 2/2 [00:01<00:00,  1.82it/s]


In [18]:
result.resume

     E-commerce CI Lower  E-commerce CI Upper  E-commerce Effect Size  \
ATC            98.815121           105.567979              102.191550   
ATE            81.227356            85.244009               83.235682   
ATT            61.390855            66.570894               63.980875   

     E-commerce P-value  E-commerce Standard Error  Logistics CI Lower  \
ATC                 0.0                   1.722668           92.626791   
ATE                 0.0                   1.024656           77.135410   
ATT                 0.0                   1.321439           58.272601   

     Logistics CI Upper  Logistics Effect Size  Logistics P-value  \
ATC          100.959770              96.793280                0.0   
ATE           83.523810              80.329610                0.0   
ATT           68.269456              63.271029                0.0   

     Logistics Standard Error  
ATC                  2.125760  
ATE                  1.629694  
ATT                  2.550218  

In [19]:
result.indexes

      indexes
0        5490
1        1897
2        3222
3        3987
4         539
...       ...
9995     5517
9996     7731
9997     4237
9998     1908
9999     1756

[10000 rows x 1 columns]

In [20]:
result.full_data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444  26.0      M   
1           1             8      1       512.5   462.222222  26.0      M   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  user_id_matched  signup_month_matched  treat_matched  \
0     E-comme