# Homogeneity test tutorial 

Homogeneity test is carried out in order to make sure that the data in the sample belong to the same general population and do not contain significant differences among themselves. This is important for several reasons:
1. **Accuracy of results**: A homogeneous sample allows you to obtain more accurate and reliable analysis results.
2. **Correct conclusions**: If the sample is heterogeneous, the conclusions drawn from it may be incorrect or misleading.
3. **Correctness of statistical tests**: Many statistical tests assume sample uniformity. Violation of this assumption can lead to errors of the first or second kind.   

In [1]:
from hypex.experiments.homogeneity import HOMOGENEITY_TEST
from hypex.dataset import Dataset, ExperimentData, InfoRole, TreatmentRole, TargetRole

  from .autonotebook import tqdm as notebook_tqdm


## Creation of a new test dataset with synthetic data. 

It is important to mark the data fields by assigning the appropriate roles:
- FeatureRole: a role for columns that contain features or predictor variables. Our split will be based on them. Applied by default if the role is not specified for the column.
- TreatmentRole: a role for columns that show the treatment or intervention.
- TargetRole: a role for columns that show the target or outcome variable.
- InfoRole: a role for columns that contain information about the data, such as user IDs. 

In [2]:
data = Dataset(
    roles={
        "user_id": InfoRole(int),
        "treat": TreatmentRole(int),
        "pre_spends": TargetRole(float),
        "post_spends": TargetRole(float),
        "gender": TargetRole(str)
    }, data="data.csv",
)
data

      user_id  signup_month  treat  pre_spends  post_spends   age gender  \
0           0             0      0       488.0   414.444444   NaN      M   
1           1             8      1       512.5   462.222222  26.0    NaN   
2           2             7      1       483.0   479.444444  25.0      M   
3           3             0      0       501.5   424.333333  39.0      M   
4           4             1      1       543.0   514.555556  18.0      F   
...       ...           ...    ...         ...          ...   ...    ...   
9995     9995            10      1       538.5   450.444444  42.0      M   
9996     9996             0      0       500.5   430.888889  26.0      F   
9997     9997             3      1       473.0   534.111111  22.0      F   
9998     9998             2      1       495.0   523.222222  67.0      F   
9999     9999             7      1       508.0   475.888889  38.0      F   

        industry  
0     E-commerce  
1     E-commerce  
2      Logistics  
3     E-com

## Homogeneity test
Before execution, we wrap prepared dataset into ExperimentData to be able to run experiments on it. 
Then we execute pipeline, in this case we select one of the pre-assembled pipeline, in our case HOMOGENEITY_TEST. Also, pipline may be created depends on your needs with custom executors.

In [3]:
test = HOMOGENEITY_TEST
ed = ExperimentData(data)
result = test.execute(ed)

We can access the results of the experiment directly with the property analysis_tables of ExperimentData. It is included information about the results of each executor. Key is executor state id, value is the result of the executor. It may be useful for debugging or interpretation purposes.

In [4]:
result.analysis_tables

{'GroupSizes┴┴pre_spends':    control size  test size  control size %  test size %
 0          4936       5064           49.36        50.64,
 'GroupDifference┴┴pre_spends':    control mean   test mean  difference  difference %
 0    484.911973  489.220379    4.308406      0.888492,
 'TTest┴┴pre_spends':         p-value  statistic  pass
 0  2.315047e-30 -11.489293  True,
 'KSTest┴┴pre_spends':         p-value  statistic  pass
 0  1.559150e-13   0.077573  True,
 'GroupSizes┴┴post_spends':    control size  test size  control size %  test size %
 0          4936       5064           49.36        50.64,
 'GroupDifference┴┴post_spends':    control mean   test mean  difference  difference %
 0    420.046619  483.470664   63.424045     15.099287,
 'TTest┴┴post_spends':    p-value   statistic  pass
 0      0.0 -135.560001  True,
 'KSTest┴┴post_spends':    p-value  statistic  pass
 0      0.0     0.8959  True,
 'GroupSizes┴┴gender':    control size  test size  control size %  test size %
 0     

### Experiment results
To show the report with summary of the test we run the report method of the reporter, associated with the respective test type.

In [5]:
from hypex.reporters.homo import HomoDictReporter
from hypex.reporters.homo import HomoDatasetReporter

HomoDictReporter().report(result)

{'pre_spends GroupDifference control mean 0': 484.91197325769855,
 'pre_spends GroupDifference test mean 0': 489.2203791469194,
 'pre_spends GroupDifference difference 0': 4.3084058892208645,
 'pre_spends GroupDifference difference % 0': 0.8884923711568682,
 'post_spends GroupDifference control mean 0': 420.04661894471457,
 'post_spends GroupDifference test mean 0': 483.470664384764,
 'post_spends GroupDifference difference 0': 63.42404544004944,
 'post_spends GroupDifference difference % 0': 15.099287217068902,
 'pre_spends GroupSizes control size 0': 4936,
 'pre_spends GroupSizes test size 0': 5064,
 'pre_spends GroupSizes control size % 0': 49.36,
 'pre_spends GroupSizes test size % 0': 50.63999999999999,
 'pre_spends TTest p-value 0': 2.3150474758856975e-30,
 'pre_spends TTest pass 0': 'NOT OK',
 'post_spends TTest p-value 0': 0.0,
 'post_spends TTest pass 0': 'NOT OK',
 'pre_spends KSTest p-value 0': 1.5591501637482305e-13,
 'pre_spends KSTest pass 0': 'NOT OK',
 'post_spends KSTe

In [6]:
HomoDatasetReporter().report(result)

       feature group TTest pass  TTest p-value KSTest pass  KSTest p-value  \
0   pre_spends     0     NOT OK   2.315047e-30      NOT OK    1.559150e-13   
1  post_spends     0     NOT OK   0.000000e+00      NOT OK    0.000000e+00   
2       gender     0        NaN            NaN         NaN             NaN   

  Chi2Test pass  Chi2Test p-value  
0           NaN               NaN  
1           NaN               NaN  
2            OK          0.351553  