Skip to content

The performance gap bewteen KCI_UInd and KCI_CInd under a similar setting #56

@cogito233

Description

@cogito233

The issue is based on the code in Pull request #55

Here is just a weird problem with the performance gap between KCI_UInd and KCI_CInd. Intuitively, the test of $X\bot Y$ and $X\bot Y|Z=1$ (Z is a constant) should have a similar performance, or the latter test(use KCI_CInd) should have a worse performance due to it handling a more universal case. However, when I ran the code, the result is not as I excepted.

image

I test the code by a random collider dataset, which means $X\bot Z$, $X\equiv Y$; and I also visualize the test statistics, mean and var for convenient debugging. And the result shows a similar p-value of $X\bot Z$, $X\bot Y$ and a different p-value of $X\bot Z | 1$, $X\bot Y | 1$.

Following is my test code:

from icecream import ic
from causallearn.utils.cit import CIT
from tqdm import trange
import numpy as np


def generate_single_sample(type, dim):
    if (type == 'chain'):
        X = np.random.random(dim)
        Y = np.random.random(dim)+X
        Z = np.random.random(dim)+Y
        #X->Y->Z
    elif (type == 'collider'):
        # X->Y<-Z
        X = np.random.random(dim)
        Z = np.random.random(dim)
        Y = np.random.random(dim)+X+Z
    #Y = np.zeros(dim)+np.average(Y)
    return list(X)+list(Y)+list(Z)+[1]# 31 dim X:0..9; Y:10..19; Z:20..29; 1: 30

def generate_dataset(dim, size):
    dataset = []
    for i in range(size):
        datapoint = generate_single_sample('collider', dim)
        dataset.append(datapoint)
    dataset = np.array(dataset)
    return dataset


if __name__ == '__main__':
    dataset = generate_dataset(10, 1000)
    cit_tester = CIT(dataset, method = 'kci')
    #ic(cit_tester.kci(0, 20, []))
    # Origin version can not pass this due to the feature-30 have the similar value
    #ic(cit_tester.kci(0, 20, [30]))
    # The follow is from one of my recent requirements, which is using CIT to test high dim variables
    # Test high dim variables is not supported by current cit class, which is different from the documents,
    # so I also implement this function in the last commit.
    # An issue is related to the "CIT of test high dim variables" which I will put forward latter
    ic(cit_tester.kci(range(10), range(20,30), range(10,20)))
    ic(cit_tester.kci(range(10), range(20,30), []))
    ic(cit_tester.kci(range(10), range(10,20), []))
    ic(cit_tester.kci(range(10), range(20,30), [30]))
    ic(cit_tester.kci(range(10), range(10,20), [30]))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions