Code for reproducing key results in the paper CCMI : Classifier based Conditional Mutual Information Estimation by Sudipto Mukherjee, Himanshu Asnani and Sreeram Kannan. If you use the code, please cite our paper. The code can be used for mutual information and conditional mutual information estimation; conditional independence testing.
The code has been tested with the following versions of packages.
- Python 3.6.5
- Tensorflow 1.11.0
- xgboost 0.80 (Optional : To run CCIT baseline for conditional independence testing)
First cd to the folder containing CCMI code,
$ cd CIT
$ python
and then can run CCMI as shown in the example below (you could have this code snippet as a Python script) :
>> from CCMI import CCMI
>> import numpy as np
>> X = np.random.randn(5000, 1)
>> Y = np.random.randn(5000, 1)
>> Z = np.random.randn(5000, 1)
>> model_indp = CCMI(X, Y, Z, tester = 'Classifier', metric = 'donsker_varadhan', num_boot_iter = 10, h_dim = 64, max_ep = 20)
>> cmi_est_indp = model_indp.get_cmi_est()
>> print(cmi_est_indp)
-0.0003
>> Y = 0.5*X + np.random.normal(loc = 0, scale = 0.2, size = (X.shape))
>> model_dep = CCMI(X, Y, Z, tester = 'Classifier', metric = 'donsker_varadhan', num_boot_iter = 10, h_dim = 64, max_ep = 20)
>> cmi_est_dep = model_dep.get_cmi_est()
>> print(cmi_est_dep)
0.9707
./data/gen_cmi_data.py - Contains several categories of synthetic data generators that have ground truth CMI values known. The models have X and Y as 1-dimensional variables, while dimension of Z can scale. Model-I in the paper corresponds to 'Category F' and Model-II to 'Category G'. To generate data from a particular category (say category F) with given dimension (say dz = 20) and number of samples (say N = 5000), run the following from inside 'data' folder:
$ PYTHONPATH='..' python gen_cmi_data.py --cat F --num_th 5 --dz 20
(Note: PYTHONPATH='..' is required because NPEET code is in the parent folder, but gen_cmi_data.py is run from ./data/)
For ease of use, we have provided a bash script './data/gen_synthetic_data_bash.sh' which will generate all the data-sets used for linear and non-linear CMI estimation experiments in the paper. So, alternatively, to generate all data-sets used in the paper, just run
$ ./gen_synthetic_data_bash.sh
(Note: Due to random functions used to simulate data and different random seeds, the exact values of true and estimated CMI for generated data-sets will be different from those in the paper.)
To run Generator+Classifier estimators, first cd to CMI_Est and then run :
$ cd CMI_Est
$ python main_CMI_Est.py --mimic cgan --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20
Similarly for other Generators,
$ python main_CMI_Est.py --mimic cvae --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20
$ python main_CMI_Est.py --mimic knn --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20
For difference-based CMI estimates, run the following (Classifier-MI and f-MINE respectively) :
$ python main_CMI_Est.py --mimic mi_diff --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20
$ python main_CMI_Est.py --mimic mi_diff --tester Neural --metric f_divergence --cat F --num_th 5 --dz 20
For ease of use, we have provided bash scripts './CMI_Est/run_<estimator>_mimic.sh' which will run the corresponding estimator on all linear and non-linear CMI estimation experiments in the paper. For example, to obtain estimates from CGAN+Classifier, run the following
$ ./run_cgan_mimic.sh
Similary, run_cvae_mimic.sh, run_knn_mimic.sh, run_mi_diff_mimic.sh, run_mi_diff_mimic_neural.sh, run_ksg_baseline.sh .
(Note : Make sure to first create the data-sets using './data/gen_synthetic_data_bash.sh' before running the estimation scripts.)
To generate post-Non-Linear cosine data-sets, do the following :
$ cd data
$ python gen_cit_postNonLin_data.py
To run CCMI for conditional independence testing on synthetic data, run the following :
$ cd CIT
$ python main_CCMI_postNonLin.py
And for flow-cytometry real data-sets :
$ python main_CCMI_flowCyto.py
For comparison with state-of-the-art CI-Tester (CCIT), we have also provided code to run it for synthetic and real data-sets.
$ python main_CCIT_postNonLin.py --dz 1
Similarly, run for the other dimensions {5, 10, 20, 50, 100}.
And for flow-cytometry real data-sets :
$ python main_CCIT_flowCyto.py