Skip to content
Classifier Conditional Independence Test: A CI test that uses a binary classifier (XGBoost) for CI testing
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
CCIT python3 support added Mar 21, 2019
build/lib/CCIT CCIT==0.3 Readme added Mar 21, 2019
data DataGen files added. Function to create post-nonlinear CI or NI data. Jun 13, 2017
dist CCIT==0.3 Readme added Mar 21, 2019
.DS_Store More assertations Jul 3, 2017
.gitignore pip package 3 - alpha Jul 12, 2017
ApacheLicense2.txt more packaging Jul 3, 2017
MANIFEST.in more packaging Jul 3, 2017
README.md CCIT==0.3 Readme addedv4 Mar 21, 2019
setup.py CCIT==0.3 Readme added Mar 21, 2019

README.md

CCIT

Classifier Conditional Independence Test: A CI test that uses a binary classifier (XGBoost) for CI testing

This is an implementation of the paper: https://arxiv.org/abs/1709.06138

Please cite the above paper if this package is used in any publication.

Usage for pip install

  1. pip install CCIT==0.3 or sudo -H pip install CCIT==0.3.

2(a). Now in your python script:

from CCIT import CCIT
from CCIT import DataGen

pvalue = CCIT.CCIT(X,Y,Z)    #without bootstrap

pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20)  #with 30 bootstrap iterations and 20 threads in parallel. 

2(b). If you want to test using the included DataGen module:

from CCIT import CCIT
from CCIT import DataGen

data = DataGen.generate_samples_cos(dx=1,dy=1,dz=20,sType='NI')  #non-CI dataset, pvalue should be low

X = data[:,0:1]
Y = data[:,1:2]
Z = data[:,2::]

pvalue = CCIT.CCIT(X,Y,Z)    #without bootstrap

pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20)  #with 30 bootstrap iterations and 20 threads in parallel. 

We suggest normalizing each column of the data either standard normalization or bringing all values in each column in the range [0,1], for the best performance

Note that when Z is None , it produces a pvalue for independence test between X and Y.

It is recommended to recale all columns of the data by standard deviation

Usage for pip install from github repo

  1. clone the repo.

  2. cd CCIT

  3. pip install .

  4. (Optional) from the root directory of the package, run the command

nosetests

This is a comprehensive test and may take some time to run.

  1. Now in your python script:
from CCIT import CCIT

pvalue = CCIT.CCIT(X,Y,Z)

There may be some trouble in installing the xgboost dependency. In that case it is recommended to follow the steps in https://github.com/dmlc/xgboost/blob/master/python-package/build_trouble_shooting.md for installing xgboost first. Then install CCIT from pip.

CI Tester

Functions:

  1. CCIT()
Main function to generate pval of the CI test. If pval is low CI is rejected if its high we fail to reject CI.
        X: Input X table
        Y: Input Y table
        Z: Input Z table. If None then it reverts back to Independence test between X and Y. 
        Optional Arguments:
        max_depths : eg. [6,10,13] list of parameters for depth of tree in xgb for tuning
        n_estimators: eg. [100,200,300] list of parameters for number of estimators for xgboost for tuning
        colsample_bytrees: eg. recommended [0.8] list of parameters for colsample_bytree for xgboost for tuning
        nfold: n-fold cross validation 
        feature_selection : default 0 recommended
        train_samp: -1 recommended. Number of examples out of total to be used for training. 
        threshold: defualt recommended
        num_iter: Number of Bootstrap Iterations. Default 20. Recommended 30. 
        nthread: Number of parallel thread for running XGB. Recommended number of cores in the CPU. Default 8. 
	bootstrap : True or False. If False, then num_iter is set to 1. One deterministic pval is outputted without averaging. If True, results are averaged over num_iter bootstraps and can have randomness. num_iter in this case has to be >= 20.   
        Output: 
        pvalue of the test. 
     

tl;dr version

If the dimensions of X, Y, and Z are 1,1,2 respectively and if the first three i.i.d samples are as follows:

|  X  | Y   |    Z    |
| 1.0 | 1.0 | 1.5 2.5 |
| 0.5 | 1.2 | 0.5 0.6 |
| 0.1 | 4.5 | 1.2 3.6 |

then the input is: 
X = np.array([[1.0],[0.5],[0.1]])
Y = np.array([[1.0],[1.2],[4.5]])
Z = np.array([[1.5,2.5],[0.5,0.6],[1.2,3.6]])
pval = CCIT(X,Y,Z)
  1. CI_sampler_conditional_kNN()
Generate Test and Train set for converting CI testing into Binary Classification
    Arguments:
    	X_in: Samples of r.v. X (np.array)
    	Y_in: Samples of r.v. Y (np.array)
    	Z_in: Samples of r.v. Z (np.array)
    	train_len: length of training set, must be less than number of samples 
    	k: k-nearest neighbor to be used: Always set k = 1. 
    Output:
    	Xtrain: Features for training the classifier
    	Ytrain: Train Labels
    	Xtest: Features for test set
    	Ytest: Test Labels
    	CI_data: Developer Use only

DataGen Module

Functions:

  1. generate_samples_cos()
Generate CI,I or NI post-nonlinear samples:
    
    1. Z is independent Gaussian 
    
    2. X = cos(<a,Z> + b + noise) and Y = cos(<c,Z> + d + noise) in case of CI
    Arguments:    
        size : number of samples
        sType: CI,I, or NI
        dx: Dimension of X 
        dy: Dimension of Y 
        dz: Dimension of Z 
        nstd: noise standard deviation
        freq: Freq of cosine function
    
    Output:
    	allsamples --> complete data-set
    Note that: 	
    [X = first dx coordinates of allsamples each row is an i.i.d samples]
    [Y = [dx:dx + dy] coordinates of allsamples]
    [Z = [dx+dy:dx+dy+dz] coordinates of all samples]
  1. parallel_cos_sample_gen()
Function to create several many data-sets of post-nonlinear cos transform half of which are CI and half of which are NI, along with the correct labels. The data-sets are stored under a given folder path:

	############## The path should exist#####################
	For example create a folder ../data/dim20 first. 


	Arguments:
	nsamples: Number of i.i.d samples in each data-set
	dx, dy, dz : Dimension of X, Y, Z
	nstd: Noise Standard Deviation 
	freq: Freq. of cos function 
	filetype: Path to filenames. if filetype = '../data/dim20/datafile', then the files are stored as '.npy' format in folder './dim20' 
	and the files are named datafile0_20.npy .....datafile50_20.npy
	num_data: number of data files 
	num_proc: number of processes to run in parallel 
	
	Output:
	num_data number of datafiles stored in the given folder. 
	datafile.npy files that constains an array that has the correct label. If the first label is '1' then  'datafile20_0.npy' constains a 'CI' dataset. 
You can’t perform that action at this time.