Fix NaN problem while normailze the data #55

cogito233 · 2022-07-10T19:15:17Z

Here is a problem when data_x/data_y/data_z have the same value, e.g. $Z = [1,1,1,\cdots,1]$, the result of stats.zscore(data) would be NaN, which caused the error in the subsequent eigh section.

tofuwen

Thanks for the great work! We really appreciate your help to make causal-learn better!

Could you please add a test plan section in your PR to tell us what you did to make sure the code did what's expected? It can be as simple as "the output of old code (that failed)". And after your new changes, the new code can work.
I suggested some super nit change --- our code base follows standard code style like this (https://google.github.io/styleguide/pyguide.html).
I am not expert in KCI, so I will let owner review whether the changes make logically sense. :)

tofuwen · 2022-07-11T09:06:53Z

causallearn/utils/KCI/KCI.py

@@ -145,8 +145,16 @@ def kernel_matrix(self, data_x, data_y):
        else:
            raise Exception('Undefined kernel function')

-        data_x = stats.zscore(data_x, axis=0)
-        data_y = stats.zscore(data_y, axis=0)
+        if (np.var(data_x)==0):


Suggested change

if (np.var(data_x)==0):

if np.var(data_x) == 0:

causallearn/utils/KCI/KCI.py

tofuwen · 2022-07-11T09:07:21Z

causallearn/utils/KCI/KCI.py

+        else:
+            data_x = stats.zscore(data_x, axis=0)
+
+        if (np.var(data_y)==0):


Suggested change

if (np.var(data_y)==0):

if np.var(data_y) == 0:

causallearn/utils/KCI/KCI.py

tofuwen · 2022-07-11T09:07:56Z

causallearn/utils/KCI/KCI.py

-        data_x = stats.zscore(data_x, axis=0)
-        data_y = stats.zscore(data_y, axis=0)
-        data_z = stats.zscore(data_z, axis=0)
+        if (np.var(data_x)==0):


same in this file

Co-authored-by: Yewen Fan <tofuwen@users.noreply.github.com>

cogito233 · 2022-07-11T11:50:14Z

Here is the code to test this commit. Should I put this code into the folder causal-learn/tests?

By the way, there is also a problem in the class cit because [origin document](https://causal-learn.readthedocs.io/en/latest/independence_tests_index/kci.html) it says X, Y, and condition_set: column indices of data but in the cit class, X, Y can only be a single int variable, so I also write some code to make it support multi variable while not change the origin usage (see the last commit).

from icecream import ic
from causallearn.utils.cit import CIT
from tqdm import trange
import numpy as np


def generate_single_sample(type, dim):
    if (type == 'chain'):
        X = np.random.random(dim)
        Y = np.random.random(dim)+X
        Z = np.random.random(dim)+Y
        #X->Y->Z
    elif (type == 'collider'):
        # X->Y<-Z
        X = np.random.random(dim)
        Z = np.random.random(dim)
        Y = np.random.random(dim)+X+Z
    #Y = np.zeros(dim)+np.average(Y)
    return list(X)+list(Y)+list(Z)+[1]# 31 dim X:0..9; Y:10..19; Z:20..29; 1: 30

def generate_dataset(dim, size):
    dataset = []
    for i in range(size):
        datapoint = generate_single_sample('collider', dim)
        dataset.append(datapoint)
    dataset = np.array(dataset)
    return dataset


if __name__ == '__main__':
    dataset = generate_dataset(10, 1000)
    cit_tester = CIT(dataset, method = 'kci')
    ic(cit_tester.kci(0, 20, []))
    
    # Origin version can not pass this due to the feature-30 have the similar value
    ic(cit_tester.kci(0, 20, [30]))
    
    # The follow is from one of my recent requirements, which is using CIT to test high dim variables
    # Test high dim variables is not supported by current cit class, which is different from the documents,
    # so I also implement this function in the last commit.
    # An issue is related to the "CIT of test high dim variables" which I will put forward latter
    ic(cit_tester.kci(range(10), range(20,30), range(10,20)))
    ic(cit_tester.kci(range(10), range(20,30), []))
    ic(cit_tester.kci(range(10), range(10,20), []))
    ic(cit_tester.kci(range(10), range(20,30), [30]))
    ic(cit_tester.kci(range(10), range(10,20), [30]))

tofuwen · 2022-07-11T12:05:16Z

causallearn/utils/cit.py

        if len(condition_set) == 0:
-            return self.kci_ui.compute_pvalue(self.data[:, [X]], self.data[:, [Y]])[0]
-        return self.kci_ci.compute_pvalue(self.data[:, [X]], self.data[:, [Y]], self.data[:, list(condition_set)])[0]
+            return self.kci_ui.compute_pvalue(self.data[:, X], self.data[:, Y])[0]


great, thanks!

QQ:

what's the reason that the old code only support one variable? Is there some special design? @MarkDana

@tofuwen There is no special design in the old code for supporting only one variable. I just aligned with other tests (used in constraint-based methods), and forgot that KCI can take in multivariate unconditional variables.

I was just about to fix it. So thanks so much for your work @cogito233 !

For https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L338,

we'll need to handle the case for iterable X, Y.

Also for the cache key in https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L339,

we'll need to handle X < Y for iterable X, Y. And also frozenset(i) as hashable key.

For

https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L338

,
we'll need to handle the case for iterable X, Y.

I have no ideas about how to change the assert in the case of kernel CIT(Concerning the whole or part of X in the conditional set). Other problems are already fixed.

Maybe $X \bot Y | X$ is also a valid expression?

Wow cool. You're so productive! Thanks for all these!

I see your point. This line was intended for correctness checks in constraint-based methods.

As for the citest itself, is X;Y|X valid? I don't know actually - but the results is expected to be always "independent" (consider X|X as degenerated const).

How about we force X not in condition_set for X: int, and len(set(condition_set).intersection(X)) == 0 for X: Iterable?

MarkDana · 2022-07-11T13:01:55Z

causallearn/utils/cit.py

+        if condition_set == None:
+            condition_set = []


This is not necessary. None condition_set is already handled here:
https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L337

MarkDana · 2022-07-11T13:04:46Z

causallearn/utils/cit.py

+        if type(X) == int:
+            X = [X]


Better with an elif to ensure X is some iterable.

And then X = list(X) - otherwise self.data[:, X] does not support X as e.g., set, or tuple with only one element.

MarkDana · 2022-07-11T13:12:54Z

Hi @cogito233 Big thanks for your work. About enabling CIT to support multivariate case (for KCI), I just added some comments. Would you be available for them? Or I could handle them after your pr is merged :)

MarkDana · 2022-07-12T03:03:51Z

causallearn/utils/KCI/KCI.py

+        if np.var(data_x) == 0:
+            data_x -= np.average(data_x)
+        else:
+            data_x = stats.zscore(data_x, axis=0)


Current np.var(data_x) does not support multi-dim data_x with some dims being constant (while others not), or all dims are constant but with different values:

In [7]: data_x = np.random.randn(100, 1) In [8]: data_y = np.ones_like(data_x) # constant In [9]: data_xy = np.hstack([data_x, data_y]) In [10]: stats.zscore(data_xy, axis=0) Out[10]: array([[ 4.15513535e-01, nan], [-1.71903423e+00, nan], [ 7.59493517e-01, nan], [-1.34182046e+00, nan], ...

For numpy floating points, how to reliably identify a constant array? var=0 is not enough:

In [16]: arr1 = np.array([1., 1., 1.]) In [17]: np.var(arr1) Out[17]: 0.0 In [18]: stats.zscore(arr1) Out[18]: array([nan, nan, nan]) ######### In [19]: arr2 = np.array([-0.087, -0.087, -0.087]) In [20]: np.var(arr2) Out[20]: 1.925929944387236e-34 In [21]: np.var(arr2) == 0 Out[21]: False In [22]: stats.zscore(arr2) Out[22]: array([nan, nan, nan]) # though np.var != 0, here it still runs to stats.zscore and returns nan

Based on above, can we just mask nan values to zero after stats.zscore?

data_x = stats.zscore(data_x, axis=0) data_x[np.isnan(data_x)] = 0.

This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).

This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).

Do we ensure this in code, i.e. ensure raw data_x doesn't contain nan value?

This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).

Do we ensure this in code, i.e. ensure raw data_x doesn't contain nan value?

Not yet. I'll do this soon after this pr is merged.

Thanks so much for your awesome contributions, @cogito233 , @MarkDana and @tofuwen ! @cogito233 Please let us know if you think the current PR is ready to go so we will solve the remaining issues in a new PR. Or we could include these in this PR if you would like to. :-)

The problem is already fixed. I guess maybe we can merge now~

Many thanks to all of your(@kunwuz, @MarkDana, @tofuwen ) help~ Since I am at the first time contributing to a community codebase and lack experience.

Thanks so much for your awesome contributions, @cogito233 , @MarkDana and @tofuwen ! @cogito233 Please let us know if you think the current PR is ready to go so we will solve the remaining issues in a new PR. Or we could include these in this PR if you would like to. :-)

Thanks so much! @tofuwen @MarkDana When you have time, could you please make a final check on the current PR and let me know when it's ready to be merged? Many thanks!

tofuwen · 2022-07-12T11:24:31Z

Thanks @cogito233 for your great work! And congrats on your first PR to improve public codebase! We really appreciate your help on making causal-learn better! :)

I'll defer to @MarkDana for the final review. I guess @MarkDana will then push another PR based on this PR, right?

tofuwen · 2022-07-12T11:28:49Z

causallearn/utils/cit.py

@@ -59,9 +60,18 @@ def _unique(column):
        }

    def kci(self, X, Y, condition_set):


@MarkDana

Currently X and Y here can be int / Iterable? This doesn't sound like a good design --- if possible, we better make it every variable type-checked.

Why not enforce X and Y here to be a list of Int? This is the most general one right? So we don't need those lines just to do type-checking --- it's usually a good idea to make code concise.

But I think this can be changed in your later PR, @MarkDana

Yes let me handle this in the later pr, or in the new KCI subclass.

MarkDana · 2022-07-12T14:08:09Z

Thanks so much @cogito233 for what you did!! Awesome work!! You made causal-learn better.
@kunwuz I think this pr is ready to be merged. Thanks!

Fix NaN problem while normailze the data

2ee93c6

tofuwen reviewed Jul 11, 2022

View reviewed changes

cogito233 and others added 3 commits July 11, 2022 12:41

Update causallearn/utils/KCI/KCI.py

7f40937

Co-authored-by: Yewen Fan <tofuwen@users.noreply.github.com>

code style change

67f6ca7

add multi-variable support to cit

ffe75f9

tofuwen reviewed Jul 11, 2022

View reviewed changes

MarkDana reviewed Jul 11, 2022

View reviewed changes

cogito233 mentioned this pull request Jul 11, 2022

The performance gap bewteen KCI_UInd and KCI_CInd under a similar setting #56

Open

cogito233 added 3 commits July 11, 2022 15:25

some marginal changes

3603458

some marginal changes 2

c58c816

add multi-variate valid check

709999d

MarkDana reviewed Jul 12, 2022

View reviewed changes

fix multi-variable nan problem; Thanks @MarkDana. :)

04e2ba5

tofuwen reviewed Jul 12, 2022

View reviewed changes

kunwuz merged commit 8badb41 into py-why:main Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NaN problem while normailze the data #55

Fix NaN problem while normailze the data #55

cogito233 commented Jul 10, 2022

tofuwen left a comment

tofuwen Jul 11, 2022

tofuwen Jul 11, 2022

tofuwen Jul 11, 2022

cogito233 commented Jul 11, 2022

tofuwen Jul 11, 2022

MarkDana Jul 11, 2022

MarkDana Jul 11, 2022

MarkDana Jul 11, 2022

cogito233 Jul 11, 2022

MarkDana Jul 11, 2022

cogito233 Jul 11, 2022

MarkDana Jul 11, 2022

MarkDana Jul 11, 2022

MarkDana commented Jul 11, 2022

MarkDana Jul 12, 2022 •

edited

tofuwen Jul 12, 2022

MarkDana Jul 12, 2022

kunwuz Jul 12, 2022

cogito233 Jul 12, 2022

kunwuz Jul 12, 2022

tofuwen commented Jul 12, 2022

tofuwen Jul 12, 2022

MarkDana Jul 12, 2022

MarkDana commented Jul 12, 2022

		@@ -59,9 +60,18 @@ def _unique(column):
		}

		def kci(self, X, Y, condition_set):

Fix NaN problem while normailze the data #55

Fix NaN problem while normailze the data #55

Conversation

cogito233 commented Jul 10, 2022

tofuwen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cogito233 commented Jul 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarkDana commented Jul 11, 2022

MarkDana Jul 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tofuwen commented Jul 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarkDana commented Jul 12, 2022

MarkDana Jul 12, 2022 •

edited