-
Notifications
You must be signed in to change notification settings - Fork 231
Use d-separation as CIT in tests, to ensure PC's correctness #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Local cache resume example
| return virtual_cit(0, 1, tuple(cond_set_bgn_0)) | ||
|
|
||
|
|
||
| class D_Separation(CIT_Base): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really hacky... Normally we shouldn't write hacky code like this.
D_separation is not a Conditional Independence Test, right? We probably shouldn't inherit CIT_base class.
You can think about this OOP design --- normally we need to follow the logic, if D-Separation and CIT has something in common (like here, data is the same, and you call want to return some value), you can add another layer of abstraction under CIT_base.
We need to strictly follow the logical structure in code whenever possible. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much. I see your point! (Sorry I just saw your message..
My current codes are mainly for convenience - so that we can call d-separation just as if we call a citest (same as fisherz or kci). D-separation indeed has many things in common with citest (e.g., i/o), though yes, logically it is not a citest.
By "another layer of abstraction under CIT_base", are you suggesting something like CIT_base -> D_separation_base -> D_separation? Then this looks almost the same as CIT_base -> D_separation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOP is not just chain, it should be a DAG.
For example, you can design things like:
Data_base -> CIT_base -> ....
Data_base -> D_separation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
A bit confused: then maybe we'll have to move all of our functionalities in CIT_base (e.g., input check, cache, etc) to Data_base - while they are not attributes about data, and D_separation requires no data?
To me, here D_separation is more like a duck type? Though in definition, it is NOT a citest (not a statistical one but a graphical one), in our context (to test the algorithm's correctness), we call it, use it and evaluate it all like a citest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I really didn't think much --- yeah, you are right, D-separation requires no data.
My point is just: in OOP, just think about what needs to be abstracted and shared.
So what's common between D_separation and CIT_base? The cache, and other related utils. Then probably make a base named Cache_base, and maybe you can design things like:
Cache_base -> CIT_base -> ....
Cache_base -> D_separation
Usually don't inherit things that you don't need and don't make OOP design not consistent with the logical structure (hacky like this will usually create troubles in the future.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @tofuwen. Cool! Cache_base -> CIT_base -> .... and Cache_base -> D_separation now looks logically reasonable. Though practically I still have this concern:
What is the difference set CIT_base\Cache_base? In other words, what is something shared by FisherZ and Chisq, but not used in D_separation? There is only one thing, data.
Therefore, Cache_base should contain cache-related utilities and input/output checks, and CIT_base\Cache_base should be only about data. However, if we do so, some problems arise:
- Cache-related utilities and data are relatively coupled (e.g., data hash check, parameters check). It may not be clean or easy to decouple the two.
Cache_base- or naming it more accurately, e.g.,Cache_for_constraint_base- is it something deserving a base class treatment, or just some utility functions belonging to the CITs? It's natural to understandKCIas a child class ofCIT_base, but it seems weird to seeCIT_baseas a child class ofCache_base. Even without cache, CIT is still CIT.- After all, d-separation is here implemented only to check the algorithms' correctness in tests. It is not some key function for users. So, is it really necessary to do the above refactor (separate out a
Cache_base) only for d-separation - while sacrificing all the other main-function parts (mentioned above in the 2nd point)? - And, can d-separation be seen as a citest? Maybe it depends. On how general we see them. For example,
d-separation is a criterion for deciding, from a given a causal graph, whether a set X of variables is independent of another set Y, given a third set Z.
(http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html)
I will think more about how to put d-separation in our package in a both logically reasonable and functionally clean way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with you for the most part. I think you convinced me: I agree that my suggestions seem to add lots of extra work to make the (not very necessary design) better, which I don't think justify the increasing complexity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Thanks so much for this! The separated class for d-separation that you suggested would still be the perfect one, as long as we had enough time - maybe in a future refactor on citest.
| print('test_pc_load_bnlearn_discrete_datasets passed!\n') | ||
|
|
||
| # Test the usage of local cache checkpoint (check speed). | ||
| def test_pc_with_citest_local_checkpoint(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the tests here!
Cheers! Now we can guarantee our PC algorithm is indeed written correctly. Great work!! :)
tofuwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! We are almost done!
|
|
||
| data = np.zeros((100, len(truth_dag.nodes))) # just a placeholder | ||
| cg = pc(data, 0.05, d_separation, True, 0, -1, true_dag=true_dag_netx) | ||
| shd = SHD(truth_cpdag, cg.G) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one final questions: why we don't have assert here?
Should we assert shd = 0 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, we should! Thanks for pointing this out!
tofuwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the awesome work! cc @kunwuz to merge
Note first:
This pr is based off of #64 to prevent conflict on
TestPC.py. So please review #64 first.Updated files:
cit.py: addclass D_Separation(CIT_Base).TestPC.py: addtest_pc_load_bnlearn_graphs_with_d_separation.Test plan:
It should pass (but in long time (around 3hrs), since in big graphs: 1) more tests are needed, and 2) d-separation check takes more time to traverse paths). Here is an example output:
Good news from above:
By this integration test, we know that our current implementation of pc (or more specifically, at least
pc(stable=True, uc_rule=0, uc_priority=-1)) is correct, i.e., it will return the true CPDAG with asymptotic CI tests.