New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NaN problem while normailze the data #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! We really appreciate your help to make causal-learn better!
-
Could you please add a test plan section in your PR to tell us what you did to make sure the code did what's expected? It can be as simple as "the output of old code (that failed)". And after your new changes, the new code can work.
-
I suggested some super nit change --- our code base follows standard code style like this (https://google.github.io/styleguide/pyguide.html).
-
I am not expert in KCI, so I will let owner review whether the changes make logically sense. :)
causallearn/utils/KCI/KCI.py
Outdated
@@ -145,8 +145,16 @@ def kernel_matrix(self, data_x, data_y): | |||
else: | |||
raise Exception('Undefined kernel function') | |||
|
|||
data_x = stats.zscore(data_x, axis=0) | |||
data_y = stats.zscore(data_y, axis=0) | |||
if (np.var(data_x)==0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (np.var(data_x)==0): | |
if np.var(data_x) == 0: |
causallearn/utils/KCI/KCI.py
Outdated
else: | ||
data_x = stats.zscore(data_x, axis=0) | ||
|
||
if (np.var(data_y)==0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (np.var(data_y)==0): | |
if np.var(data_y) == 0: |
causallearn/utils/KCI/KCI.py
Outdated
data_x = stats.zscore(data_x, axis=0) | ||
data_y = stats.zscore(data_y, axis=0) | ||
data_z = stats.zscore(data_z, axis=0) | ||
if (np.var(data_x)==0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same in this file
Co-authored-by: Yewen Fan <tofuwen@users.noreply.github.com>
Here is the code to test this commit. Should I put this code into the folder By the way, there is also a problem in the class from icecream import ic
from causallearn.utils.cit import CIT
from tqdm import trange
import numpy as np
def generate_single_sample(type, dim):
if (type == 'chain'):
X = np.random.random(dim)
Y = np.random.random(dim)+X
Z = np.random.random(dim)+Y
#X->Y->Z
elif (type == 'collider'):
# X->Y<-Z
X = np.random.random(dim)
Z = np.random.random(dim)
Y = np.random.random(dim)+X+Z
#Y = np.zeros(dim)+np.average(Y)
return list(X)+list(Y)+list(Z)+[1]# 31 dim X:0..9; Y:10..19; Z:20..29; 1: 30
def generate_dataset(dim, size):
dataset = []
for i in range(size):
datapoint = generate_single_sample('collider', dim)
dataset.append(datapoint)
dataset = np.array(dataset)
return dataset
if __name__ == '__main__':
dataset = generate_dataset(10, 1000)
cit_tester = CIT(dataset, method = 'kci')
ic(cit_tester.kci(0, 20, []))
# Origin version can not pass this due to the feature-30 have the similar value
ic(cit_tester.kci(0, 20, [30]))
# The follow is from one of my recent requirements, which is using CIT to test high dim variables
# Test high dim variables is not supported by current cit class, which is different from the documents,
# so I also implement this function in the last commit.
# An issue is related to the "CIT of test high dim variables" which I will put forward latter
ic(cit_tester.kci(range(10), range(20,30), range(10,20)))
ic(cit_tester.kci(range(10), range(20,30), []))
ic(cit_tester.kci(range(10), range(10,20), []))
ic(cit_tester.kci(range(10), range(20,30), [30]))
ic(cit_tester.kci(range(10), range(10,20), [30])) |
if len(condition_set) == 0: | ||
return self.kci_ui.compute_pvalue(self.data[:, [X]], self.data[:, [Y]])[0] | ||
return self.kci_ci.compute_pvalue(self.data[:, [X]], self.data[:, [Y]], self.data[:, list(condition_set)])[0] | ||
return self.kci_ui.compute_pvalue(self.data[:, X], self.data[:, Y])[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks!
QQ:
- what's the reason that the old code only support one variable? Is there some special design? @MarkDana
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tofuwen There is no special design in the old code for supporting only one variable. I just aligned with other tests (used in constraint-based methods), and forgot that KCI can take in multivariate unconditional variables.
I was just about to fix it. So thanks so much for your work @cogito233 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need to handle the case for iterable X
, Y
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for the cache key in https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L339,
we'll need to handle X < Y
for iterable X
, Y
. And also frozenset(i)
as hashable key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
,
we'll need to handle the case for iterableX
,Y
.
I have no ideas about how to change the assert in the case of kernel CIT
(Concerning the whole or part of X
in the conditional set
). Other problems are already fixed.
Maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow cool. You're so productive! Thanks for all these!
I see your point. This line was intended for correctness checks in constraint-based methods.
As for the citest itself, is X;Y|X
valid? I don't know actually - but the results is expected to be always "independent" (consider X|X
as degenerated const).
How about we force X not in condition_set
for X: int
, and len(set(condition_set).intersection(X)) == 0
for X: Iterable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
causallearn/utils/cit.py
Outdated
if condition_set == None: | ||
condition_set = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary. None condition_set is already handled here:
https://github.com/cmu-phil/causal-learn/blob/ffe75f95c4003fa7e9d7d5f3bbec4ace90ed3a41/causallearn/utils/cit.py#L337
if type(X) == int: | ||
X = [X] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better with an elif
to ensure X is some iterable.
And then X = list(X)
- otherwise self.data[:, X]
does not support X
as e.g., set
, or tuple
with only one element.
Hi @cogito233 Big thanks for your work. About enabling |
causallearn/utils/KCI/KCI.py
Outdated
if np.var(data_x) == 0: | ||
data_x -= np.average(data_x) | ||
else: | ||
data_x = stats.zscore(data_x, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Current
np.var(data_x)
does not support multi-dimdata_x
with some dims being constant (while others not), or all dims are constant but with different values:
In [7]: data_x = np.random.randn(100, 1)
In [8]: data_y = np.ones_like(data_x) # constant
In [9]: data_xy = np.hstack([data_x, data_y])
In [10]: stats.zscore(data_xy, axis=0)
Out[10]:
array([[ 4.15513535e-01, nan],
[-1.71903423e+00, nan],
[ 7.59493517e-01, nan],
[-1.34182046e+00, nan],
...
- For numpy floating points, how to reliably identify a constant array?
var=0
is not enough:
In [16]: arr1 = np.array([1., 1., 1.])
In [17]: np.var(arr1)
Out[17]: 0.0
In [18]: stats.zscore(arr1)
Out[18]: array([nan, nan, nan])
#########
In [19]: arr2 = np.array([-0.087, -0.087, -0.087])
In [20]: np.var(arr2)
Out[20]: 1.925929944387236e-34
In [21]: np.var(arr2) == 0
Out[21]: False
In [22]: stats.zscore(arr2)
Out[22]: array([nan, nan, nan]) # though np.var != 0, here it still runs to stats.zscore and returns nan
- Based on above, can we just mask nan values to zero after
stats.zscore
?
data_x = stats.zscore(data_x, axis=0)
data_x[np.isnan(data_x)] = 0.
This operation is safe, since we would (expect to) check any raw data_x
(before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).
Do we ensure this in code, i.e. ensure raw data_x doesn't contain nan value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation is safe, since we would (expect to) check any raw data_x (before kci) so it does not contain nan values, the nan returned in the normalized array is only due to constant (not original nan values).
Do we ensure this in code, i.e. ensure raw data_x doesn't contain nan value?
Not yet. I'll do this soon after this pr is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your awesome contributions, @cogito233 , @MarkDana and @tofuwen ! @cogito233 Please let us know if you think the current PR is ready to go so we will solve the remaining issues in a new PR. Or we could include these in this PR if you would like to. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is already fixed. I guess maybe we can merge now~
Many thanks to all of your(@kunwuz, @MarkDana, @tofuwen ) help~ Since I am at the first time contributing to a community codebase and lack experience.
Thanks so much for your awesome contributions, @cogito233 , @MarkDana and @tofuwen ! @cogito233 Please let us know if you think the current PR is ready to go so we will solve the remaining issues in a new PR. Or we could include these in this PR if you would like to. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cogito233 for your great work! And congrats on your first PR to improve public codebase! We really appreciate your help on making causal-learn better! :) I'll defer to @MarkDana for the final review. I guess @MarkDana will then push another PR based on this PR, right? |
@@ -59,9 +60,18 @@ def _unique(column): | |||
} | |||
|
|||
def kci(self, X, Y, condition_set): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently X and Y here can be int / Iterable? This doesn't sound like a good design --- if possible, we better make it every variable type-checked.
Why not enforce X and Y here to be a list of Int? This is the most general one right? So we don't need those lines just to do type-checking --- it's usually a good idea to make code concise.
But I think this can be changed in your later PR, @MarkDana
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let me handle this in the later pr, or in the new KCI subclass.
Thanks so much @cogito233 for what you did!! Awesome work!! You made causal-learn better. |
Here is a problem when$Z = [1,1,1,\cdots,1]$ , the result of
data_x/data_y/data_z
have the same value, e.g.stats.zscore(data)
would beNaN
, which caused the error in the subsequenteigh
section.