Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NaN problem while normailze the data #55

Merged
merged 8 commits into from Jul 12, 2022
Merged

Conversation

cogito233
Copy link
Contributor

Here is a problem when data_x/data_y/data_z have the same value, e.g. $Z = [1,1,1,\cdots,1]$, the result of stats.zscore(data) would be NaN, which caused the error in the subsequent eigh section.

Copy link
Contributor

@tofuwen tofuwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! We really appreciate your help to make causal-learn better!

  1. Could you please add a test plan section in your PR to tell us what you did to make sure the code did what's expected? It can be as simple as "the output of old code (that failed)". And after your new changes, the new code can work.

  2. I suggested some super nit change --- our code base follows standard code style like this (https://google.github.io/styleguide/pyguide.html).

  3. I am not expert in KCI, so I will let owner review whether the changes make logically sense. :)

@@ -145,8 +145,16 @@ def kernel_matrix(self, data_x, data_y):
else:
raise Exception('Undefined kernel function')

data_x = stats.zscore(data_x, axis=0)
data_y = stats.zscore(data_y, axis=0)
if (np.var(data_x)==0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (np.var(data_x)==0):
if np.var(data_x) == 0:

causallearn/utils/KCI/KCI.py Outdated Show resolved Hide resolved
else:
data_x = stats.zscore(data_x, axis=0)

if (np.var(data_y)==0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (np.var(data_y)==0):
if np.var(data_y) == 0:

causallearn/utils/KCI/KCI.py Outdated Show resolved Hide resolved
data_x = stats.zscore(data_x, axis=0)
data_y = stats.zscore(data_y, axis=0)
data_z = stats.zscore(data_z, axis=0)
if (np.var(data_x)==0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in this file

cogito233 and others added 3 commits July 11, 2022 12:41
@cogito233
Copy link
Contributor Author

Here is the code to test this commit. Should I put this code into the folder causal-learn/tests?

By the way, there is also a problem in the class cit because [origin document](https://causal-learn.readthedocs.io/en/latest/independence_tests_index/kci.html) it says X, Y, and condition_set: column indices of data but in the cit class, X, Y can only be a single int variable, so I also write some code to make it support multi variable while not change the origin usage (see the last commit).

from icecream import ic
from causallearn.utils.cit import CIT
from tqdm import trange
import numpy as np


def generate_single_sample(type, dim):
    if (type == 'chain'):
        X = np.random.random(dim)
        Y = np.random.random(dim)+X
        Z = np.random.random(dim)+Y
        #X->Y->Z
    elif (type == 'collider'):
        # X->Y<-Z
        X = np.random.random(dim)
        Z = np.random.random(dim)
        Y = np.random.random(dim)+X+Z
    #Y = np.zeros(dim)+np.average(Y)
    return list(X)+list(Y)+list(Z)+[1]# 31 dim X:0..9; Y:10..19; Z:20..29; 1: 30

def generate_dataset(dim, size):
    dataset = []
    for i in range(size):
        datapoint = generate_single_sample('collider', dim)
        dataset.append(datapoint)
    dataset = np.array(dataset)
    return dataset


if __name__ == '__main__':
    dataset = generate_dataset(10, 1000)
    cit_tester = CIT(dataset, method = 'kci')
    ic(cit_tester.kci(0, 20, []))
    
    # Origin version can not pass this due to the feature-30 have the similar value
    ic(cit_tester.kci(0, 20, [30]))
    
    # The follow is from one of my recent requirements, which is using CIT to test high dim variables
    # Test high dim variables is not supported by current cit class, which is different from the documents,
    # so I also implement this function in the last commit.
    # An issue is related to the "CIT of test high dim variables" which I will put forward latter
    ic(cit_tester.kci(range(10), range(20,30), range(10,20)))
    ic(cit_tester.kci(range(10), range(20,30), []))
    ic(cit_tester.kci(range(10), range(10,20), []))
    ic(cit_tester.kci(range(10), range(20,30), [30]))
    ic(cit_tester.kci(range(10), range(10,20), [30]))

if len(condition_set) == 0:
return self.kci_ui.compute_pvalue(self.data[:, [X]], self.data[:, [Y]])[0]
return self.kci_ci.compute_pvalue(self.data[:, [X]], self.data[:, [Y]], self.data[:, list(condition_set)])[0]
return self.kci_ui.compute_pvalue(self.data[:, X], self.data[:, Y])[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks!

QQ:

  1. what's the reason that the old code only support one variable? Is there some special design? @MarkDana

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tofuwen There is no special design in the old code for supporting only one variable. I just aligned with other tests (used in constraint-based methods), and forgot that KCI can take in multivariate unconditional variables.

I was just about to fix it. So thanks so much for your work @cogito233 !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the cache key in https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L339,

we'll need to handle X < Y for iterable X, Y. And also frozenset(i) as hashable key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For

https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L338

,
we'll need to handle the case for iterable X, Y.

I have no ideas about how to change the assert in the case of kernel CIT(Concerning the whole or part of X in the conditional set). Other problems are already fixed.

Maybe $X \bot Y | X$ is also a valid expression?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow cool. You're so productive! Thanks for all these!

I see your point. This line was intended for correctness checks in constraint-based methods.

As for the citest itself, is X;Y|X valid? I don't know actually - but the results is expected to be always "independent" (consider X|X as degenerated const).

How about we force X not in condition_set for X: int, and len(set(condition_set).intersection(X)) == 0 for X: Iterable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 62 to 63
if condition_set == None:
condition_set = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +64 to +65
if type(X) == int:
X = [X]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better with an elif to ensure X is some iterable.

And then X = list(X) - otherwise self.data[:, X] does not support X as e.g., set, or tuple with only one element.

@MarkDana
Copy link
Collaborator

Hi @cogito233 Big thanks for your work. About enabling CIT to support multivariate case (for KCI), I just added some comments. Would you be available for them? Or I could handle them after your pr is merged :)

Comment on lines 148 to 151
if np.var(data_x) == 0:
data_x -= np.average(data_x)
else:
data_x = stats.zscore(data_x, axis=0)
Copy link
Collaborator

@MarkDana MarkDana Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Current np.var(data_x) does not support multi-dim data_x with some dims being constant (while others not), or all dims are constant but with different values:
In [7]: data_x = np.random.randn(100, 1)

In [8]: data_y = np.ones_like(data_x) # constant

In [9]: data_xy = np.hstack([data_x, data_y])

In [10]: stats.zscore(data_xy, axis=0)
Out[10]: 
array([[ 4.15513535e-01,             nan],
       [-1.71903423e+00,             nan],
       [ 7.59493517e-01,             nan],
       [-1.34182046e+00,             nan],
       ...
  • For numpy floating points, how to reliably identify a constant array? var=0 is not enough:
In [16]: arr1 = np.array([1., 1., 1.])

In [17]: np.var(arr1)
Out[17]: 0.0

In [18]: stats.zscore(arr1)
Out[18]: array([nan, nan, nan])

#########

In [19]: arr2 = np.array([-0.087, -0.087, -0.087])

In [20]: np.var(arr2)
Out[20]: 1.925929944387236e-34

In [21]: np.var(arr2) == 0
Out[21]: False

In [22]: stats.zscore(arr2)
Out[22]: array([nan, nan, nan]) # though np.var != 0, here it still runs to stats.zscore and returns nan
  • Based on above, can we just mask nan values to zero after stats.zscore?
data_x = stats.zscore(data_x, axis=0)
data_x[np.isnan(data_x)] = 0.

This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).

Do we ensure this in code, i.e. ensure raw data_x doesn't contain nan value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).

Do we ensure this in code, i.e. ensure raw data_x doesn't contain nan value?

Not yet. I'll do this soon after this pr is merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your awesome contributions, @cogito233 , @MarkDana and @tofuwen ! @cogito233 Please let us know if you think the current PR is ready to go so we will solve the remaining issues in a new PR. Or we could include these in this PR if you would like to. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is already fixed. I guess maybe we can merge now~

Many thanks to all of your(@kunwuz, @MarkDana, @tofuwen ) help~ Since I am at the first time contributing to a community codebase and lack experience.

Thanks so much for your awesome contributions, @cogito233 , @MarkDana and @tofuwen ! @cogito233 Please let us know if you think the current PR is ready to go so we will solve the remaining issues in a new PR. Or we could include these in this PR if you would like to. :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! @tofuwen @MarkDana When you have time, could you please make a final check on the current PR and let me know when it's ready to be merged? Many thanks!

@tofuwen
Copy link
Contributor

tofuwen commented Jul 12, 2022

Thanks @cogito233 for your great work! And congrats on your first PR to improve public codebase! We really appreciate your help on making causal-learn better! :)

I'll defer to @MarkDana for the final review. I guess @MarkDana will then push another PR based on this PR, right?

@@ -59,9 +60,18 @@ def _unique(column):
}

def kci(self, X, Y, condition_set):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkDana

Currently X and Y here can be int / Iterable? This doesn't sound like a good design --- if possible, we better make it every variable type-checked.

Why not enforce X and Y here to be a list of Int? This is the most general one right? So we don't need those lines just to do type-checking --- it's usually a good idea to make code concise.

But I think this can be changed in your later PR, @MarkDana

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let me handle this in the later pr, or in the new KCI subclass.

@MarkDana
Copy link
Collaborator

Thanks so much @cogito233 for what you did!! Awesome work!! You made causal-learn better.
@kunwuz I think this pr is ready to be merged. Thanks!

@kunwuz kunwuz merged commit 8badb41 into py-why:main Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants