Classifier Conditional Independence Test: A CI test that uses a binary classifier (XGBoost) for CI testing
This is an implementation of the paper: https://arxiv.org/abs/1709.06138
Please cite the above paper if this package is used in any publication.
Usage for pip install
pip install CCIT==0.3or
sudo -H pip install CCIT==0.3.
2(a). Now in your python script:
from CCIT import CCIT from CCIT import DataGen pvalue = CCIT.CCIT(X,Y,Z) #without bootstrap pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20) #with 30 bootstrap iterations and 20 threads in parallel.
2(b). If you want to test using the included DataGen module:
from CCIT import CCIT from CCIT import DataGen data = DataGen.generate_samples_cos(dx=1,dy=1,dz=20,sType='NI') #non-CI dataset, pvalue should be low X = data[:,0:1] Y = data[:,1:2] Z = data[:,2::] pvalue = CCIT.CCIT(X,Y,Z) #without bootstrap pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20) #with 30 bootstrap iterations and 20 threads in parallel.
We suggest normalizing each column of the data either standard normalization or bringing all values in each column in the range [0,1], for the best performance
Note that when
Z is None , it produces a pvalue for independence test between X and Y.
It is recommended to recale all columns of the data by standard deviation
Usage for pip install from github repo
clone the repo.
pip install .
(Optional) from the root directory of the package, run the command
This is a comprehensive test and may take some time to run.
- Now in your python script:
from CCIT import CCIT pvalue = CCIT.CCIT(X,Y,Z)
There may be some trouble in installing the xgboost dependency. In that case it is recommended to follow the steps in https://github.com/dmlc/xgboost/blob/master/python-package/build_trouble_shooting.md for installing xgboost first. Then install CCIT from pip.
Main function to generate pval of the CI test. If pval is low CI is rejected if its high we fail to reject CI. X: Input X table Y: Input Y table Z: Input Z table. If None then it reverts back to Independence test between X and Y. Optional Arguments: max_depths : eg. [6,10,13] list of parameters for depth of tree in xgb for tuning n_estimators: eg. [100,200,300] list of parameters for number of estimators for xgboost for tuning colsample_bytrees: eg. recommended [0.8] list of parameters for colsample_bytree for xgboost for tuning nfold: n-fold cross validation feature_selection : default 0 recommended train_samp: -1 recommended. Number of examples out of total to be used for training. threshold: defualt recommended num_iter: Number of Bootstrap Iterations. Default 20. Recommended 30. nthread: Number of parallel thread for running XGB. Recommended number of cores in the CPU. Default 8. bootstrap : True or False. If False, then num_iter is set to 1. One deterministic pval is outputted without averaging. If True, results are averaged over num_iter bootstraps and can have randomness. num_iter in this case has to be >= 20. Output: pvalue of the test.
If the dimensions of X, Y, and Z are 1,1,2 respectively and if the first three i.i.d samples are as follows: | X | Y | Z | | 1.0 | 1.0 | 1.5 2.5 | | 0.5 | 1.2 | 0.5 0.6 | | 0.1 | 4.5 | 1.2 3.6 | then the input is:
X = np.array([[1.0],[0.5],[0.1]]) Y = np.array([[1.0],[1.2],[4.5]]) Z = np.array([[1.5,2.5],[0.5,0.6],[1.2,3.6]]) pval = CCIT(X,Y,Z)
Generate Test and Train set for converting CI testing into Binary Classification Arguments: X_in: Samples of r.v. X (np.array) Y_in: Samples of r.v. Y (np.array) Z_in: Samples of r.v. Z (np.array) train_len: length of training set, must be less than number of samples k: k-nearest neighbor to be used: Always set k = 1. Output: Xtrain: Features for training the classifier Ytrain: Train Labels Xtest: Features for test set Ytest: Test Labels CI_data: Developer Use only
Generate CI,I or NI post-nonlinear samples: 1. Z is independent Gaussian 2. X = cos(<a,Z> + b + noise) and Y = cos(<c,Z> + d + noise) in case of CI Arguments: size : number of samples sType: CI,I, or NI dx: Dimension of X dy: Dimension of Y dz: Dimension of Z nstd: noise standard deviation freq: Freq of cosine function Output: allsamples --> complete data-set Note that: [X = first dx coordinates of allsamples each row is an i.i.d samples] [Y = [dx:dx + dy] coordinates of allsamples] [Z = [dx+dy:dx+dy+dz] coordinates of all samples]
Function to create several many data-sets of post-nonlinear cos transform half of which are CI and half of which are NI, along with the correct labels. The data-sets are stored under a given folder path: ############## The path should exist##################### For example create a folder ../data/dim20 first. Arguments: nsamples: Number of i.i.d samples in each data-set dx, dy, dz : Dimension of X, Y, Z nstd: Noise Standard Deviation freq: Freq. of cos function filetype: Path to filenames. if filetype = '../data/dim20/datafile', then the files are stored as '.npy' format in folder './dim20' and the files are named datafile0_20.npy .....datafile50_20.npy num_data: number of data files num_proc: number of processes to run in parallel Output: num_data number of datafiles stored in the given folder. datafile.npy files that constains an array that has the correct label. If the first label is '1' then 'datafile20_0.npy' constains a 'CI' dataset.