# How to match? 

## 0. Import libraries 

In [1]:
import pandas as pd
import warnings
import numpy as np
from lightautoml.addons.matcher import Matcher

warnings.filterwarnings('ignore')
%config Completer.use_jedi = False

## 1. Create or upload your dataset  
In this case we will create random dataset with known effect size  
If you have your own dataset, go to the part 2 


In [2]:
# Simulating dataset with known effect size
num_users = 10000
num_months = 12

signup_months = np.random.choice(np.arange(1, num_months), num_users) * np.random.randint(0,2, size=num_users) # signup_months == 0 means customer did not sign up
df = pd.DataFrame({
    'user_id': np.repeat(np.arange(num_users), num_months),
    'signup_month': np.repeat(signup_months, num_months), # signup month == 0 means customer did not sign up
    'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12
    'spend': np.random.poisson(500, num_users*num_months) #np.random.beta(a=2, b=5, size=num_users * num_months)*1000 # centered at 500
})
# A customer is in the treatment group if and only if they signed up
df["treat"] = df["signup_month"]>0
# Simulating an effect of month (monotonically decreasing--customers buy less later in the year)
df["spend"] = df["spend"] - df["month"]*10
# Simulating a simple treatment effect of 100
after_signup = (df["signup_month"] < df["month"]) & (df["treat"])
df.loc[after_signup,"spend"] = df[after_signup]["spend"] + 100
df.head()

Unnamed: 0,user_id,signup_month,month,spend,treat
0,0,0,1,489,False
1,0,0,2,469,False
2,0,0,3,492,False
3,0,0,4,500,False
4,0,0,5,437,False


In [3]:
# Setting the signup month (for ease of analysis)
i = 3
df_i_signupmonth = (
    df[df.signup_month.isin([0, i])]
    .groupby(["user_id", "signup_month", "treat"])
    .apply(
        lambda x: pd.Series(
            {
                "pre_spends": x.loc[x.month < i, "spend"].mean(),
                "post_spends": x.loc[x.month > i, "spend"].mean(),
            }
        )
    )
    .reset_index()
)
df_i_signupmonth

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends
0,0,0,False,479.0,422.888889
1,1,0,False,487.5,422.222222
2,5,3,True,500.0,525.333333
3,6,0,False,499.5,414.777778
4,7,0,False,522.5,424.000000
...,...,...,...,...,...
5427,9985,0,False,487.0,432.333333
5428,9986,0,False,481.5,416.555556
5429,9987,0,False,480.0,428.777778
5430,9988,0,False,492.0,412.555556


In [4]:
# Additional category features
gender = np.random.choice(a=[0,1], size=df_i_signupmonth.user_id.nunique())
age = np.random.choice(a=range(18, 70), size=df_i_signupmonth.user_id.nunique())
industry = np.random.choice(a=range(1, 3), size=df_i_signupmonth.user_id.nunique())
df_i_signupmonth['age'] = age
df_i_signupmonth['is_male'] =  gender
df_i_signupmonth['industry'] =  industry
df_i_signupmonth['industry'] = df_i_signupmonth['industry'].astype('str')
df_i_signupmonth['treat'] = df_i_signupmonth['treat'].astype(int)
df_i_signupmonth.head()

Unnamed: 0,user_id,signup_month,treat,pre_spends,post_spends,age,is_male,industry
0,0,0,0,479.0,422.888889,27,1,1
1,1,0,0,487.5,422.222222,60,1,2
2,5,3,1,500.0,525.333333,59,0,2
3,6,0,0,499.5,414.777778,65,0,2
4,7,0,0,522.5,424.0,20,0,2


In [5]:
df_i_signupmonth.columns

Index(['user_id', 'signup_month', 'treat', 'pre_spends', 'post_spends', 'age',
       'is_male', 'industry'],
      dtype='object')

## 2. Matching  
### 2.0 Init params
info_col used to define informative attributes that should not be part of matching, such as user_id  
But to explicitly store this column in the table, so that you can compare directly after computation

In [6]:
info_col = ['user_id', 'signup_month']

outcome = 'post_spends'
treatment = 'treat'

### 2.1 Simple matching
This is the easiest way to initialize and calculate metrics on a Matching task  
Use it when you are clear about each attribute or if you don't have any additional task conditions (Strict equality for certain features) 

In [7]:
# Standard model with base parameters
model = Matcher(df=df_i_signupmonth, outcome=outcome, treatment=treatment, info_col=info_col)
results, quality_results = model.estimate()

[04.04.2023 11:16:15 | matcher | INFO]: Applying matching
[04.04.2023 11:16:15 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [28. 11. 44.  0. 12.  8.  0. 22. 26. 12.  0.  8.  1.  4. 54.  3.  0.  0.
  2.  0.  0.  0. 26. 36.  0.  0. 12. 23.  3. 24.  2. 12.  0.  0. 15.  1.
  0. 12.  0.  0.  8.  0.  0. 24.  4. 13.  0. 11.  2. 28.  5.  1.  6. 21.
  4.  0.  1.  0.  0.  0.  0. 33. 23. 18. 32. 12.  0. 15.  2.  0. 13.  0.
  2. 20.  5. 20. 17.  0. 77.  8.  9.  8. 28.  8.  1. 17.  0. 14. 13.  3.
 17.  0. 31.  5.  0. 18.  4. 32. 30. 21.  0.  0.  4.  0.  0.  6.  4. 81.
 22. 28.  0.  5.  7.  0.  4.  0. 28.  2. 25.  1.  7. 44.  0. 13.  5. 11.
  8.  2.  1.  0. 26.  3.  2.  9. 18.  1.  2.  2.  1. 10.  0.  8.  1.  8.
  4. 26.  2. 16. 14.  2.  9.  4.  5.  0. 32.  3. 32.  0. 20.  5. 20. 21.
 46. 39. 37. 43.  9. 17.  8.  0. 11. 14. 15.  0. 16.  0. 22.  5.  1.  0.
  7. 37.  0. 19. 44.  1.  1. 14.  0.  7.  4. 10. 17.  0.  3. 40.  0.  5.
  0.  2. 10.  4.  4.  4.

Finding index
Finding index


[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 8e-06
[04.04.2023 11:16:15 | psi_pandas | INFO]: PSI values: 0.0
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.0
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.00055
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.000566
[04.04.2023 11:16:15 | psi_pandas | INFO]: PSI values: 0.0
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.0
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.005259
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.012908
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.000114
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.003524
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.000914
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.005065
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.001335
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.00418


### 2.2 Matching with a fixed variable  
Used when you have categorical feature(s) that you want to compare by strict equality  
group_col is used for strict comparison of categorical features.  
In our case there is only one attribute  
If there are several such attributes, you should make one of them and use it

In [8]:
# group_col is used for strict comparison of categorical features.  
# In our case there is only one attribute  
# If there are several such attributes, you should make one of them and use it

group_col = "industry"

In [9]:
model = Matcher(df=df_i_signupmonth, outcome=outcome, treatment=treatment, 
                info_col=info_col, group_col=group_col)
results, quality_results = model.estimate()

[04.04.2023 11:16:15 | matcher | INFO]: Applying matching


Finding index
Finding index
Finding index
Finding index


[04.04.2023 11:16:15 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [ 8.  5.  0.  0.  0. 21.  0. 25. 29. 35.  0.  0.  0.  0. 43.  0.  0.  0.
  8.  0. 35.  0.  8. 32. 34.  8.  0. 44.  0.  0.  2.  0.  5. 14.  7.  6.
 13.  0.  0.  0.  0. 26.  0.  7.  0. 50.  0. 24. 21.  0.  0.  5. 25.  0.
  0.  0.  8.  1. 16.  0. 18.  1. 10. 40. 12. 16.  0.  0. 36.  5.  1.  4.
  5. 26.  8.  0.  0. 10. 12.  0. 24. 13. 22.  3.  0.  0.  7.  1.  0. 25.
 18.  0. 46.  0. 17. 35.  0.  0.  6.  0.  0.  0.  1.  9. 32. 32.  0. 16.
  0.  0. 32. 32.  0.  7. 19.  5.  2. 26.  1.  0. 11.  3.  7.  7. 50. 12.
 23.  0. 19.  1. 45. 42.  0.  5.  0.  3. 13.  0.  9.  0. 13.  3. 23.  1.
 18. 15. 36.  2. 27.  8. 22.  0. 19. 17.  0. 32.  0.  0.  0.  2. 21.  0.
  1. 20. 47. 30. 10.  0.  0.  0.  7.  4.  0. 36.  0. 30.  0.  0.  0.  3.
  3.  0. 24.  7.  0.  0.  0. 16.  1. 13.  0. 20.  0. 30. 30. 23.  7. 50.
  0.  0.  2. 19. 19. 39.  0. 23. 10.  0. 46.  1.  5. 18.  0. 12. 12.  0.
  0. 22. 

[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.015847
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.000346
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.000176
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.013067
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.012106
[04.04.2023 11:16:15 | psi_pandas | INFO]: sub_psi value is 0.00399
[04.04.2023 11:16:15 | psi_pandas | INFO]: PSI values: 0.06
[04.04.2023 11:16:15 | metrics | INFO]: Fraction of duplicated indexes: 0.87
[04.04.2023 11:16:15 | metrics | INFO]: Fraction of duplicated indexes: 0.06
[04.04.2023 11:16:15 | Faiss matcher | INFO]: PSI info: 
                 column  anomaly_score check_result                 column  \
0      is_male_treated           0.02           OK      is_male_untreated   
1   pre_spends_treated           0.10           OK   pre_spends_untreated   
2     industry_treated           0.00           OK     industry_untreated   
3  post

### 2.3 Matching but you don't know which features are better to use  
This is the easiest way to initialize and calculate metrics on a Matching task  
Use this method when you want to select the most important features and make matching based on them. 

In [10]:
model = Matcher(df=df_i_signupmonth, outcome=outcome, treatment=treatment, 
                info_col=info_col, group_col=group_col)

In [11]:
features_importance = model.lama_feature_select()
features_importance

[04.04.2023 11:16:15 | matcher | INFO]: Counting feature importance
[04.04.2023 11:16:15 | lama_feature_selector | INFO]: Getting feature scores


Unnamed: 0,Feature,Importance
0,pre_spends,266275.948853
1,age,265787.909546
2,is_male,28719.850342


In [12]:
features = features_importance['Feature'].to_list()

In [13]:
# You can use both variant that you like. So you can just list the features to be matched

#results, quality_results = model.estimate(features=features_importance[:3])
results, quality_results = model.estimate(features=features[:3])

[04.04.2023 11:16:20 | matcher | INFO]: Applying matching
[04.04.2023 11:16:20 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [ 6.  3.  0.  0.  0. 20.  0. 24. 31. 26.  6.  0.  0.  0. 62.  0.  0.  0.
  4.  0. 18.  0. 13. 38. 39. 14.  0. 36.  0.  0.  1.  0.  5. 14. 12.  5.
 13.  0.  0.  0.  0. 25.  0. 10.  0. 44.  0. 23. 22.  0.  0.  5. 24.  0.
  0.  0.  5.  3. 24.  0. 16.  0. 15. 40.  6. 19.  0.  0. 48. 10.  0.  3.
  5. 28.  4.  0.  0.  9. 16.  0. 15. 15. 20.  4.  0.  0.  1.  1.  0. 31.
 26.  0. 24.  0. 21. 47.  0.  2.  2.  0.  0.  0.  0.  7. 39. 30.  0. 14.
  0.  0. 32. 25.  0. 11. 16.  5.  1. 22.  2.  0.  6.  4.  7.  6. 58.  9.
 27.  0. 30.  1. 36. 33.  0. 18.  0.  6. 15.  0.  5.  0.  4.  6. 18.  1.
 17. 24. 40.  1. 16.  4. 21.  0. 33. 16.  0. 38.  1.  0.  0.  3. 28.  0.
  0. 27. 41. 28.  6.  0.  0.  0.  5. 15.  0. 42.  0. 31.  3.  4.  0.  4.
  4.  0. 22.  3.  0.  0.  0. 23.  4. 13.  0. 22.  0. 19. 20. 27.  7. 52.
  0.  0.  4. 19. 20. 26.

Finding index
Finding index
Finding index
Finding index


[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.002669
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.004241
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.003019
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.010741
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.025663
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.005057
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.00402
[04.04.2023 11:16:20 | psi_pandas | INFO]: PSI values: 0.12
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.0
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.001288
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.001417
[04.04.2023 11:16:20 | psi_pandas | INFO]: PSI values: 0.0
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 9.209419
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is 0.686663
[04.04.2023 11:16:20 | psi_pandas | INFO]: sub_psi value is

In [14]:
model.matcher.df_matched

Unnamed: 0,index,user_id,signup_month,pre_spends,age,is_male,industry,index_matched,user_id_matched,signup_month_matched,pre_spends_matched,age_matched,is_male_matched,industry_matched,post_spends,post_spends_matched,post_spends_matched_bias,treat,treat_matched
0,3058,5594,3,453.5,22,0,1,5056,9293,0,457.0,68,1,1,422.888889,520.222222,520.008062,0,1
1,3096,5654,3,483.0,26,0,1,2310,4249,0,463.5,54,0,1,426.444444,529.111111,529.325338,0,1
2,3006,5487,3,479.5,28,0,1,2856,5208,0,465.5,56,0,1,412.888889,513.888889,513.928434,0,1
3,3030,5524,3,480.5,46,0,1,2695,4932,0,484.5,59,0,1,439.444444,521.777778,521.952170,0,1
4,3048,5567,3,476.0,27,0,1,4415,8087,0,465.0,58,0,1,422.111111,511.222222,510.952318,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5427,3237,5948,0,516.5,25,0,2,2803,5118,3,494.5,68,1,2,525.444444,412.888889,412.925532,1,0
5428,3241,5953,0,497.0,19,0,2,3221,5925,3,486.5,69,1,2,521.111111,415.888889,415.876600,1,0
5429,1143,2124,0,514.0,20,1,2,1838,3363,3,520.5,50,1,2,518.444444,425.777778,425.746865,1,0
5430,1152,2140,0,458.0,21,0,2,1375,2533,3,469.0,68,0,2,518.555556,417.333333,417.331503,1,0


## 3. Results  
### 3.1 ATE, ATT, ATC

In [15]:
# model.matcher.results
results

Unnamed: 0,effect_size,std_err,p-val,ci_lower,ci_upper
ATE,99.548293,0.776481,0.0,98.02639,101.070196
ATC,99.524498,0.815659,0.0,97.925806,101.123189
ATT,99.804269,0.759612,0.0,98.31543,101.293109


### 3.2 SMD, PSI, KS-test, repeats

In [16]:
# matching quality result - SMD
model.quality_result

{'psi':                 column  anomaly_score check_result                 column  \
 0   pre_spends_treated           0.07           OK   pre_spends_untreated   
 1          age_treated           0.12           OK          age_untreated   
 2      is_male_treated           0.00           OK      is_male_untreated   
 3  post_spends_treated          16.11          NOK  post_spends_untreated   
 
    anomaly_score check_result  
 0           0.05           OK  
 1           0.05           OK  
 2           0.00           OK  
 3           8.25          NOK  ,
 'ks_test':             match_control_to_treat  match_treat_to_control
 pre_spends                0.030223            5.174888e-12
 age                       0.020454            3.134317e-07
 is_male                   0.997735            8.044524e-01,
 'smd':             match_control_to_treat  match_treat_to_control
 pre_spends                0.145243                0.052805
 age                       0.165797                0.019

In [17]:
# matching quality result - PSI
model.quality_result['psi']

Unnamed: 0,column,anomaly_score,check_result,column.1,anomaly_score.1,check_result.1
0,pre_spends_treated,0.07,OK,pre_spends_untreated,0.05,OK
1,age_treated,0.12,OK,age_untreated,0.05,OK
2,is_male_treated,0.0,OK,is_male_untreated,0.0,OK
3,post_spends_treated,16.11,NOK,post_spends_untreated,8.25,NOK


In [18]:
# matching quality result - KS-test

model.quality_result['ks_test']

Unnamed: 0,match_control_to_treat,match_treat_to_control
pre_spends,0.030223,5.174888e-12
age,0.020454,3.134317e-07
is_male,0.997735,0.8044524


In [19]:
# matching quality result - repeats
model.quality_result['repeats']

{'match_control_to_treat': 0.85, 'match_treat_to_control': 0.06}

### 3.3 Validation
validate result with placebo treatment  
our new effect size is close to zero (it should be)

In [20]:
 
model.validate_result()
 

[04.04.2023 11:16:20 | matcher | INFO]: Applying validation of result
[04.04.2023 11:16:20 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [ 13.  55.   1.   0.   0.   0.   0.   0.  11.   2.   0.  24.  75.   0.
   0.  33.   2.   2.  21.   0.   0.  52.   1.   0.   0.   0.   0.   0.
 110.   3.   0.   0.   0.   0.   0.   4.   0.   3.   0.  14.  11.   0.
   0.   0.   2.   0.   0.   0.   0.   0.   0.   0.  19.  14.  18.   0.
   0.   4.  24.  55.   0.   2.   0.  33.  18.   7.   0.  15.   0.   0.
  12.   0.   3.  37.  55.   0.   0.   0.   0.  63.   0.   0.  40.   4.
   0.  29.   0.   0.  72.   0.   0.  11.   0.   0.  52.   0.   0.  29.
  97.  17.   9.   0.   0.   0.   0.  43.  15.  18.  44.   0.   0.   2.
   0.   0.  38.  20.   0.   6.  22.   0.  62.   0.   0.   0.   0.   0.
   1.   0.   0.   0.  32.   2.  32.   5.   4.   0.  14.   0.   3.   0.
   0.   0.   0.   8.   0.   0.   0.  19.   0.   0.   0.   0.   1.  54.
   0.   7.   0.   0.   0.   0.  23

Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index


[04.04.2023 11:16:21 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [ 8.  1. 19. 19. 33. 20. 19. 10.  2.  0.  6.  3.  4. 19.  8.  0.  6.  6.
  8.  0.  3. 52. 30.  7. 24.  8.  3.  2.  3.  0. 11. 14. 10.  4.  3.  4.
  7.  5. 14.  4.  6.  2.  6. 14.  3. 11. 27. 11.  5. 13.  0.  3.  4.  0.
  2.  0.  0.  0. 15.  0.  1. 15. 15. 23. 20. 22. 33. 11. 21. 15. 11. 10.
 15. 18.  0.  1.  5.  3.  9.  5.  8.  0. 15. 21.  0.  7.  9. 26. 15. 27.
  9.  0.  8.  5.  8.  0. 20. 30. 13.  6.  2.  6. 17. 12. 31. 25.  2. 21.
  9.  0. 25. 34.  2.  7. 18. 10. 22. 12. 14. 21.  6.  2.  5. 10.  3.  4.
 30.  9. 17.  7. 17.  3.  9. 14. 12.  0.  9. 13.  5.  7. 19.  2. 16.  8.
 44. 11.  4.  9. 22. 13. 15. 31. 18.  2.  8. 19.  8. 13.  1.  0.  5.  0.
  2. 27. 12.  7. 18. 14. 11.  7.  4.  0.  0.  3. 26.  2. 36.  5.  0. 12.
  5.  6. 22. 11.  6. 14. 64. 14. 15.  4. 15.  6. 19.  5.  0. 12.  1.  8.
  9.  0. 14.  8. 17.  0. 77. 19.  8.  5. 11.  6.  8.  1.  3. 21. 14.  6.
  0. 22. 

Finding index
Finding index
Finding index
Finding index


[04.04.2023 11:16:21 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [11. 18. 26.  5.  0. 13.  8.  6. 17. 12. 22.  8. 12.  7. 34. 44.  4.  3.
 52. 11. 19.  6.  8.  1. 20. 14.  0.  3.  1. 14.  7.  4. 10. 27. 14.  6.
 25.  3. 39.  6.  0. 13. 11.  6. 14.  0.  8.  4.  8.  3.  0. 14.  0.  7.
 17. 10. 12.  0. 16. 31. 10. 16. 27.  0. 10.  2.  9. 22.  8.  0. 16.  0.
  1. 27.  5.  0. 11.  2. 26.  3. 20.  8.  0. 33.  4. 25.  0. 12. 13.  3.
 13. 13. 18. 11.  2.  1.  5.  9.  4.  0.  0. 42.  2.  2. 52. 22. 14.  0.
 17. 40.  5.  6. 31.  0. 14. 11.  6. 18. 30.  2.  0.  6. 35.  3.  3. 44.
  6. 21. 10. 36. 28. 20. 12.  2.  7.  0. 48.  2. 34. 21.  0.  6.  7.  3.
  2. 11.  0.  0.  0. 41. 18. 19.  0.  2. 15. 16.  4.  5.  1.  3.  5. 30.
 51. 19. 12.  4.  0.  8.  1. 17.  0. 13.  0. 22.  0.  4.  0. 11. 38.  8.
 24. 19. 11.  3. 32.  1.  0. 12. 30. 19. 35.  0.  9.  0. 31.  1.  3.  6.
  0. 28. 19. 25.  6.  0. 26.  0.  8.  5. 13. 12.  8.  8.  2.  5.  2.  5.
  1.  0. 

Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index


[04.04.2023 11:16:21 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [ 53.   0.  45.   5.  37.  10.   0.  42.  22.   1.   4.   0.   0.   0.
  10.   9.  16.   7.  49.  24.   0.   0.   0.   0.  27.   0.  41.   3.
  31.  17. 142.   7.  24.   0.   6.   0.   0.   1.   0.   0.  60.   0.
   0.  25.   4.  41.   0.   9.   0.   0.   0.  14.   0.  15.  19.   0.
   0.   4.  15.  31.   0.   0.   0.   0.  17.   0.   2.   0.  24.  29.
   0.  26.  32.   0.   0.   0.   0.  21.   0.   0.  28.  39.   0.   0.
   0.   0.  17.   0.   0.   0.   0.  16.   0.   0.   0.   3.   0.   0.
   0.   0.  10.  23.   0.  22.   0.  31.   0.   5.  40.  22.  20.   0.
   0.  47.   0.   0.   0.   0.   0.   0.   0.  13.   0.   5.   8.  23.
   0.   0.   0.   0.   0.   4.   0.   0.   0.   0.   1.   0.  40.   4.
  14.   0.  72.   0.   0.  21.   0.   0.  41.  52.   6.   0.   0.  29.
   0.  60.   8.   0.   0.   5.  36.   1.   0.  32.   0.  71.  13.  10.
   0.  48.  48.   0.   0.   0.   

Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index


[04.04.2023 11:16:21 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [  0.  10.  26.  25.  32.   6.   5.  27.   6.   9.  23.   9.   4.   0.
   9.   8.   7.  23.   3.  19.   0.   7.   6.   5.   0.   5.  29.   3.
  24.  29.   0.  15.   9.   2.   7.  18.  45.  30.   0.  11.   8.  13.
   5.  29.   8.  22.  26.  11.   0.   0.  22.   1.   1.  68.   0.  30.
  16.   0.   5.   5.  32.   8.  13.   7.  12.  15.   0.   0.  30.  12.
   8.  11.   0.  31.  10.   6.   0.  11.   3.   2.   6.   3.  10.  15.
  23.   0.   5.  10.   3.  12.   0.  20.  14.  25.   0.  10.  10.  12.
  11.  20.  23.   6.  25.  16.  24.  48.   6.  22.   0.  19.  26.  18.
  19.  16.  27.   3.   5.  13.  35.   2.   3.   7.  24.  11.  23.   6.
   7.   4.   7.   0.   6.  42.   0.   0.   8.   8.  12.  22.  14.   6.
  10.  37.   9.   5.   1.  15.  20.   5.   3.   9.   9.   0.  14.   0.
  25.  14.  34.   1.  17.  65.   8.   8.   3.   8.   0.   4.  10.   1.
   8.  23.   1.  16.  28.  23.   

Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index
Finding index


[04.04.2023 11:16:22 | Faiss matcher | INFO]: Calculated the number of times each subject has appeared as a match: [13.  2.  1.  4.  5. 16. 20.  5.  4. 29.  3. 24.  5. 19. 19. 11.  7.  5.
 12. 10. 24. 21. 10. 12. 39.  7. 22. 23.  1. 14.  6. 12. 19.  3.  3.  8.
 12. 10.  7.  1. 14. 29. 14. 14. 12. 12.  6.  6.  4.  2. 20. 12. 12. 14.
  6. 22.  8. 12. 10.  9. 18. 15. 13.  0.  7.  2.  7. 24. 13. 23. 13. 24.
  1.  2.  8. 20. 12. 19.  5. 22. 14. 26. 12. 16.  9.  7.  7. 12. 24. 13.
 22.  7. 13. 12.  1. 13.  3. 11. 37. 10.  5. 10. 21. 16.  2. 12.  0. 11.
  5. 13. 15. 12. 14. 22.  6.  7.  5. 10. 13.  2. 17.  9. 12.  8. 24. 10.
 17. 10.  0. 14.  6.  2.  1.  2. 25. 14. 11. 25. 17.  5.  2.  8.  3.  0.
  8.  9. 15.  9. 10.  3. 18.  9. 20. 24.  2. 26.  4.  9.  6.  4. 12.  8.
  3. 13. 19. 27.  6.  2.  4. 21. 17.  0. 12. 13. 15.  3.  4. 22. 17. 34.
 18.  3. 14. 17. 18. 15. 12. 10.  7.  6. 14.  8.  2.  3.  3.  3. 12.  7.
  0.  9.  5.  3. 13. 12. 14. 22. 23. 14.  8.  2. 11. 30. 30. 19. 17. 11.
 15.  2. 

Finding index
Finding index
Finding index
Finding index


{'post_spends': -0.4746583435226654}

In [21]:
model.matcher.df_matched

Unnamed: 0,index,user_id,signup_month,pre_spends,age,is_male,industry,index_matched,user_id_matched,signup_month_matched,pre_spends_matched,age_matched,is_male_matched,industry_matched,post_spends,post_spends_matched,post_spends_matched_bias,treat,treat_matched
0,3022,5514,0,461.0,54,0,1,5358,9842,0,503.0,32,0,1,422.888889,438.333333,438.491543,0,1
1,3041,5549,0,474.0,67,1,1,4231,7763,0,496.0,66,0,1,426.444444,426.666667,426.502520,0,1
2,3106,5673,0,471.5,36,0,1,127,232,0,487.5,24,0,1,527.444444,425.777778,428.563040,0,1
3,3114,5686,0,471.0,62,1,1,3361,6165,0,491.0,61,0,1,412.888889,419.777778,419.609104,0,1
4,3137,5739,0,491.5,42,0,1,1047,1925,0,493.0,31,1,1,439.444444,427.222222,428.337132,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5427,3237,5948,0,516.5,25,0,2,2009,3691,0,467.0,39,0,2,417.777778,413.555556,413.412597,1,0
5428,3241,5953,0,497.0,19,0,2,598,1102,0,467.5,26,0,2,429.111111,434.000000,433.299426,1,0
5429,1143,2124,0,514.0,20,1,2,1057,1947,0,496.0,46,0,2,440.777778,424.111111,423.899203,1,0
5430,1152,2140,0,458.0,21,0,2,2309,4248,0,476.5,19,1,2,424.888889,516.333333,516.855355,1,0
