Skip to content

Conversation

@MarkDana
Copy link
Collaborator

@MarkDana MarkDana commented Jul 2, 2022

Updated files:

  • cit.py: main change: integrate CITests functions to a class
  • PC.py, FCI.py, CDNOD.py, Fas.py, SkeletonDiscovery.py: adjust accordingly for the new CIT class.
  • GraphClass.py: remove unnecessary attributes (e.g., data, cache) in the CausalGraph class.
  • TestPC.py: there is randomness in mvpc, and thus we need two lines of test: w/ or wo/ randomness.
  • TestFCI.py, TestMVPC_mv_fisherz_test.py: minor issues (e.g., file directory).

Why this update is needed:

  • 1. Speedup. In current pc and fisherz, the correlation matrix over a same dataset data is re-computed at every call on fisherz, which wastes lots of time. Thanks @kunwuz for raising this speedup!
  • However, if we simply add a if indep_test == fisherz line and pass the pre-computed corr_mat to fisherz at every possible usage, code will be super lengthy and ugly: indep_test is used everywhere in constraint-based methods, and parameters passing will be complicated.
  • An example of above: for speedup, we added CITest cache and cardinalities of gsq/chisq tests before. However, to achieve this, these parameters are passed all across modules, e.g. here, here, and here.
  • 2. To make further changes on CITests easier. Now all attributes of a specific CItests is stored in its CIT object, including data, cache, and: corr_mat for fisherz, cardinalities for gsq/chisq, user-specified parameters for kci, etc.

How to use the CIT class

  • For users, any algorithms can be run end-to-end in a same way as before, e.g.,
from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz

cg = pc(data, 0.05, fisherz)

though they may notice that now causallearn.utils.cit.fisherz is a string "fisherz", instead of the function before.

  • For developers, now we should write code as:
from causallearn.utils.cit import CIT

fisherz_obj = CIT(data, "fisherz") # construct a CIT instance with data and method name
pValue = fisherz_obj(X, Y, S) # a simple call is ok. no need to consider cache/corr_mat etc. by yourself.

while before we write code as:

from causallearn.utils.cit import fisherz

# some complex if..then.. considerations before calculation, e.g., corr_mat, cache.
pValue = fisherz(data, X, Y, S)

Test plan:

To ensure the new CIT class is correct and does not change the logic of the original code (as of commit 94d1536):

python -m unittest TestPC    # should pass
python -m unittest TestFCI    # should pass
python -m unittest TestCDNOD    # should pass
python -m unittest TestMVPC    # should pass
python -m unittest TestMVPC_mv_fisherz_test.py    # should pass
python -m unittest TestMVPC_mv_fisherz.py    # should pass except for test_pc_with_mv_fisherz_MCAR_data_assertion, which cannot pass in the original code.

Speedup gain:

E.g., for TestPC.TestPC.test_pc_with_fisher_z, it takes 24.812s in a run of the original code. Now it takes 3.748s.

Todo:

  • cit.py: L113: np.corr numerical issues with nan values.
  • To affect as less code as possible, now we deleted unnecessary attributes e.g., data, cache in the CausalGraph class. While we still have test attributes (a CIT object), to adapt to cg.ci_test methods in the original codes. But, for simplicity, should test be part of a graph class, eventually?

@tofuwen
Copy link
Contributor

tofuwen commented Jul 2, 2022

I have some general design comments about this PR, how about we find some time to discuss?

  1. It would be better to split the diff into smaller chunks (actually each commit should be a single PR to make the review easier --- this PR have about 1300 significant lines, and we should target at <200 lines in each PR). In industry, we can use things called "stacked diff", i.e. it allows you to send out new PR based on your previous PR (instead of current code trunk), so splitting PR wouldn't block you. If you split PRs, when you old PR get accepted, you can push them first.

  2. For huge PR like this, it would be good to go through the design first then implement. Otherwise, we may still need some huge changes. :)

Copy link
Contributor

@tofuwen tofuwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't spend time reviewing cit.py --- I guess it's mainly copy and paste and we have cit tests already?

I think the PR is generally good to push --- only some nits comments.

Thanks the great work! It not only fix the old issues, but also significantly increased the code quality. :)

Comment on lines -38 to -40
self.data = None # store the data
self.test = None # store the name of the conditional independence test
self.corr_mat = None # store the correlation matrix of the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the attributes you deleted, did you make sure that no code referenced it anymore? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this by PyCharm -> Find Usages and there is no other usages in the project.

Is this a reliable way to find usages or do you have any other recommendations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Maybe try it using other variables to see whether it did a great job? Normally these kind of things should rely on simple tests (if we have 100% test coverage hhh) && smart IDE.

self.labels = {}
self.prt_m = {} # store the parents of missingness indicators
self.mvpc = False
self.cardinalities = None # only works when self.data is discrete, i.e. self.test is chisq or gsq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above



def cdnod(data: ndarray, c_indx: ndarray, alpha: float = 0.05, indep_test=fisherz, stable: bool = True,
def cdnod(data: ndarray, c_indx: ndarray, alpha: float = 0.05, indep_test: str = fisherz, stable: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard coding style here:

no spaces between = in function declaration

check examples in https://google.github.io/styleguide/pyguide.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Fixed.

if mvpc: # missing value PC
if indep_test == fisherz:
indep_test = mv_fisherz
if indep_test == fisherz: indep_test = mv_fisherz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, should have two lines here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/TestPC.py Outdated
import unittest
import hashlib
import numpy as np
np.random.seed(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we have two tests here?

Also please do not fix random seed for whole file here.

How about we have one test with seed fixed, one test without seed fixed? And only fix seed in the fix-seed function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tofuwen
Copy link
Contributor

tofuwen commented Jul 2, 2022

Great, after fixing all the nits above, I think this PR is ready to go!

@MarkDana MarkDana changed the title Rewrite CITests as a class Rewrite CITests as a class && re-use covariance matrix for fisherz Jul 2, 2022
@kunwuz kunwuz merged commit 24687d9 into py-why:main Jul 2, 2022
@MarkDana MarkDana mentioned this pull request Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants