Rewrite CITests as a class && re-use covariance matrix for fisherz #46

MarkDana · 2022-07-02T12:13:13Z

Updated files:

cit.py: main change: integrate CITests functions to a class
PC.py, FCI.py, CDNOD.py, Fas.py, SkeletonDiscovery.py: adjust accordingly for the new CIT class.
GraphClass.py: remove unnecessary attributes (e.g., data, cache) in the CausalGraph class.
TestPC.py: there is randomness in mvpc, and thus we need two lines of test: w/ or wo/ randomness.
TestFCI.py, TestMVPC_mv_fisherz_test.py: minor issues (e.g., file directory).

Why this update is needed:

1. Speedup. In current pc and fisherz, the correlation matrix over a same dataset data is re-computed at every call on fisherz, which wastes lots of time. Thanks @kunwuz for raising this speedup!
However, if we simply add a if indep_test == fisherz line and pass the pre-computed corr_mat to fisherz at every possible usage, code will be super lengthy and ugly: indep_test is used everywhere in constraint-based methods, and parameters passing will be complicated.
An example of above: for speedup, we added CITest cache and cardinalities of gsq/chisq tests before. However, to achieve this, these parameters are passed all across modules, e.g. here, here, and here.
2. To make further changes on CITests easier. Now all attributes of a specific CItests is stored in its CIT object, including data, cache, and: corr_mat for fisherz, cardinalities for gsq/chisq, user-specified parameters for kci, etc.

How to use the CIT class

For users, any algorithms can be run end-to-end in a same way as before, e.g.,

from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz

cg = pc(data, 0.05, fisherz)

though they may notice that now causallearn.utils.cit.fisherz is a string "fisherz", instead of the function before.

For developers, now we should write code as:

from causallearn.utils.cit import CIT

fisherz_obj = CIT(data, "fisherz") # construct a CIT instance with data and method name
pValue = fisherz_obj(X, Y, S) # a simple call is ok. no need to consider cache/corr_mat etc. by yourself.

while before we write code as:

from causallearn.utils.cit import fisherz

# some complex if..then.. considerations before calculation, e.g., corr_mat, cache.
pValue = fisherz(data, X, Y, S)

Test plan:

To ensure the new CIT class is correct and does not change the logic of the original code (as of commit 94d1536):

python -m unittest TestPC    # should pass
python -m unittest TestFCI    # should pass
python -m unittest TestCDNOD    # should pass
python -m unittest TestMVPC    # should pass
python -m unittest TestMVPC_mv_fisherz_test.py    # should pass
python -m unittest TestMVPC_mv_fisherz.py    # should pass except for test_pc_with_mv_fisherz_MCAR_data_assertion, which cannot pass in the original code.

Speedup gain:

E.g., for TestPC.TestPC.test_pc_with_fisher_z, it takes 24.812s in a run of the original code. Now it takes 3.748s.

Todo:

cit.py: L113: np.corr numerical issues with nan values.
To affect as less code as possible, now we deleted unnecessary attributes e.g., data, cache in the CausalGraph class. While we still have test attributes (a CIT object), to adapt to cg.ci_test methods in the original codes. But, for simplicity, should test be part of a graph class, eventually?

…rence

tofuwen · 2022-07-02T14:27:04Z

I have some general design comments about this PR, how about we find some time to discuss?

It would be better to split the diff into smaller chunks (actually each commit should be a single PR to make the review easier --- this PR have about 1300 significant lines, and we should target at <200 lines in each PR). In industry, we can use things called "stacked diff", i.e. it allows you to send out new PR based on your previous PR (instead of current code trunk), so splitting PR wouldn't block you. If you split PRs, when you old PR get accepted, you can push them first.
For huge PR like this, it would be good to go through the design first then implement. Otherwise, we may still need some huge changes. :)

tofuwen

I didn't spend time reviewing cit.py --- I guess it's mainly copy and paste and we have cit tests already?

I think the PR is generally good to push --- only some nits comments.

Thanks the great work! It not only fix the old issues, but also significantly increased the code quality. :)

tofuwen · 2022-07-02T14:58:40Z

causallearn/graph/GraphClass.py

-        self.data = None  # store the data
-        self.test = None  # store the name of the conditional independence test
-        self.corr_mat = None  # store the correlation matrix of the data


for the attributes you deleted, did you make sure that no code referenced it anymore? :)

I verified this by PyCharm -> Find Usages and there is no other usages in the project.

Is this a reliable way to find usages or do you have any other recommendations?

I don't know. Maybe try it using other variables to see whether it did a great job? Normally these kind of things should rely on simple tests (if we have 100% test coverage hhh) && smart IDE.

tofuwen · 2022-07-02T14:58:56Z

causallearn/graph/GraphClass.py

        self.labels = {}
        self.prt_m = {}  # store the parents of missingness indicators
-        self.mvpc = False
-        self.cardinalities = None  # only works when self.data is discrete, i.e. self.test is chisq or gsq


same as above

tofuwen · 2022-07-02T14:59:59Z

causallearn/search/ConstraintBased/CDNOD.py



-def cdnod(data: ndarray, c_indx: ndarray, alpha: float = 0.05, indep_test=fisherz, stable: bool = True,
+def cdnod(data: ndarray, c_indx: ndarray, alpha: float = 0.05, indep_test: str = fisherz, stable: bool = True,


standard coding style here:

no spaces between = in function declaration

check examples in https://google.github.io/styleguide/pyguide.html

Cool! Fixed.

tofuwen · 2022-07-02T15:03:28Z

causallearn/search/ConstraintBased/PC.py

    if mvpc:  # missing value PC
-        if indep_test == fisherz:
-            indep_test = mv_fisherz
+        if indep_test == fisherz: indep_test = mv_fisherz


hmmm, should have two lines here

tofuwen · 2022-07-02T15:06:41Z

tests/TestPC.py

 import unittest
 import hashlib
 import numpy as np
+np.random.seed(42)


how about we have two tests here?

Also please do not fix random seed for whole file here.

How about we have one test with seed fixed, one test without seed fixed? And only fix seed in the fix-seed function?

tofuwen · 2022-07-02T15:30:52Z

Great, after fixing all the nits above, I think this PR is ready to go!

MarkDana added 8 commits July 2, 2022 20:04

main change: integrate CITests functions to a class

11e2d90

Apply changes of cit to PC and MVPC

5794310

Apply changes of cit to FCI

a803c0e

Apply changes of cit to CDNOD

8da1655

data, cache etc. are unnecessary for graph class. removed

a809b9f

remove unnecessary cache from fas

df9a2b4

apply changes of cit to SkeletonDiscovery

f653504

respective tests to ensure modification on cit makes no logical diffe…

9ea7945

…rence

tofuwen reviewed Jul 2, 2022

View reviewed changes

MarkDana added 2 commits July 2, 2022 23:32

nits: two coding style issues fixed

5edd04a

random seed: two tests w/ or wo/ randomness

d9e87b0

MarkDana changed the title ~~Rewrite CITests as a class~~ Rewrite CITests as a class && re-use covariance matrix for fisherz Jul 2, 2022

kunwuz merged commit 24687d9 into py-why:main Jul 2, 2022

tofuwen mentioned this pull request Jul 6, 2022

KCI-tests: Remove TestKCI and add assertions to TestCIT_KCI #51

Merged

MarkDana mentioned this pull request Jul 19, 2022

Refactor CITs in oop way #62

Merged



		def cdnod(data: ndarray, c_indx: ndarray, alpha: float = 0.05, indep_test=fisherz, stable: bool = True,
		def cdnod(data: ndarray, c_indx: ndarray, alpha: float = 0.05, indep_test: str = fisherz, stable: bool = True,

Rewrite CITests as a class && re-use covariance matrix for fisherz #46

Rewrite CITests as a class && re-use covariance matrix for fisherz #46

Uh oh!

Conversation

MarkDana commented Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated files:

Why this update is needed:

How to use the CIT class

Test plan:

Speedup gain:

Todo:

Uh oh!

tofuwen commented Jul 2, 2022

Uh oh!

tofuwen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tofuwen commented Jul 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MarkDana commented Jul 2, 2022 •

edited

Loading