Skip to content

Conversation

@MarkDana
Copy link
Collaborator

Note first:

This pr is based off of #64 to prevent conflict on TestPC.py. So please review #64 first.

Updated files:

  • cit.py: add class D_Separation(CIT_Base).
  • TestPC.py: add test_pc_load_bnlearn_graphs_with_d_separation.

Test plan:

python -m unittest TestPC.TestPC.test_pc_load_bnlearn_graphs_with_d_separation

It should pass (but in long time (around 3hrs), since in big graphs: 1) more tests are needed, and 2) d-separation check takes more time to traverse paths). Here is an example output:

Depth=3, working on node 7: 100%|██████████| 8/8 [00:00<00:00, 6425.59it/s]
asia (8 nodes/8 edges): used 0.03562s, SHD: 0
Depth=3, working on node 4: 100%|██████████| 5/5 [00:00<00:00, 5866.16it/s]
cancer (5 nodes/4 edges): used 0.00914s, SHD: 0
Depth=3, working on node 4: 100%|██████████| 5/5 [00:00<00:00, 6878.16it/s]
earthquake (5 nodes/4 edges): used 0.00925s, SHD: 0
Depth=6, working on node 10: 100%|██████████| 11/11 [00:00<00:00, 5876.62it/s]
sachs (11 nodes/17 edges): used 0.13420s, SHD: 0
Depth=3, working on node 5: 100%|██████████| 6/6 [00:00<00:00, 7246.13it/s]
survey (6 nodes/6 edges): used 0.01338s, SHD: 0
Depth=5, working on node 36: 100%|██████████| 37/37 [00:00<00:00, 8497.93it/s]
alarm (37 nodes/46 edges): used 7.56482s, SHD: 0
Depth=7, working on node 47: 100%|██████████| 48/48 [00:00<00:00, 3848.28it/s]
barley (48 nodes/84 edges): used 701.48529s, SHD: 0
Depth=7, working on node 19: 100%|██████████| 20/20 [00:00<00:00, 7347.47it/s]
child (20 nodes/25 edges): used 0.97726s, SHD: 0
Depth=8, working on node 26: 100%|██████████| 27/27 [00:00<00:00, 7440.13it/s]
insurance (27 nodes/52 edges): used 21.79384s, SHD: 0
Depth=7, working on node 31: 100%|██████████| 32/32 [00:00<00:00, 4703.78it/s]
water (32 nodes/66 edges): used 122.41981s, SHD: 0
Depth=16, working on node 55: 100%|██████████| 56/56 [00:00<00:00, 6383.16it/s]
hailfinder (56 nodes/66 edges): used 365.14324s, SHD: 0
Depth=18, working on node 69: 100%|██████████| 70/70 [00:00<00:00, 3397.89it/s]
hepar2 (70 nodes/123 edges): used 7126.62941s, SHD: 0
Depth=9, working on node 75: 100%|██████████| 76/76 [00:00<00:00, 7891.84it/s]
win95pts (76 nodes/112 edges): used 167.08235s, SHD: 0
test_pc_load_bnlearn_graphs_with_d_separation passed!

Ran 1 test in 8514.537s

Good news from above:

By this integration test, we know that our current implementation of pc (or more specifically, at least pc(stable=True, uc_rule=0, uc_priority=-1)) is correct, i.e., it will return the true CPDAG with asymptotic CI tests.

return virtual_cit(0, 1, tuple(cond_set_bgn_0))


class D_Separation(CIT_Base):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really hacky... Normally we shouldn't write hacky code like this.

D_separation is not a Conditional Independence Test, right? We probably shouldn't inherit CIT_base class.

You can think about this OOP design --- normally we need to follow the logic, if D-Separation and CIT has something in common (like here, data is the same, and you call want to return some value), you can add another layer of abstraction under CIT_base.

We need to strictly follow the logical structure in code whenever possible. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much. I see your point! (Sorry I just saw your message..

My current codes are mainly for convenience - so that we can call d-separation just as if we call a citest (same as fisherz or kci). D-separation indeed has many things in common with citest (e.g., i/o), though yes, logically it is not a citest.

By "another layer of abstraction under CIT_base", are you suggesting something like CIT_base -> D_separation_base -> D_separation? Then this looks almost the same as CIT_base -> D_separation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOP is not just chain, it should be a DAG.

For example, you can design things like:

Data_base -> CIT_base -> ....
Data_base -> D_separation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

A bit confused: then maybe we'll have to move all of our functionalities in CIT_base (e.g., input check, cache, etc) to Data_base - while they are not attributes about data, and D_separation requires no data?

To me, here D_separation is more like a duck type? Though in definition, it is NOT a citest (not a statistical one but a graphical one), in our context (to test the algorithm's correctness), we call it, use it and evaluate it all like a citest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

I really didn't think much --- yeah, you are right, D-separation requires no data.

My point is just: in OOP, just think about what needs to be abstracted and shared.

So what's common between D_separation and CIT_base? The cache, and other related utils. Then probably make a base named Cache_base, and maybe you can design things like:

Cache_base -> CIT_base -> ....
Cache_base -> D_separation

Usually don't inherit things that you don't need and don't make OOP design not consistent with the logical structure (hacky like this will usually create troubles in the future.).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @tofuwen. Cool! Cache_base -> CIT_base -> .... and Cache_base -> D_separation now looks logically reasonable. Though practically I still have this concern:

What is the difference set CIT_base\Cache_base? In other words, what is something shared by FisherZ and Chisq, but not used in D_separation? There is only one thing, data.

Therefore, Cache_base should contain cache-related utilities and input/output checks, and CIT_base\Cache_base should be only about data. However, if we do so, some problems arise:

  • Cache-related utilities and data are relatively coupled (e.g., data hash check, parameters check). It may not be clean or easy to decouple the two.
  • Cache_base - or naming it more accurately, e.g., Cache_for_constraint_base - is it something deserving a base class treatment, or just some utility functions belonging to the CITs? It's natural to understand KCI as a child class of CIT_base, but it seems weird to see CIT_base as a child class of Cache_base. Even without cache, CIT is still CIT.
  • After all, d-separation is here implemented only to check the algorithms' correctness in tests. It is not some key function for users. So, is it really necessary to do the above refactor (separate out a Cache_base) only for d-separation - while sacrificing all the other main-function parts (mentioned above in the 2nd point)?
  • And, can d-separation be seen as a citest? Maybe it depends. On how general we see them. For example,

    d-separation is a criterion for deciding, from a given a causal graph, whether a set X of variables is independent of another set Y, given a third set Z.
    (http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html)

I will think more about how to put d-separation in our package in a both logically reasonable and functionally clean way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with you for the most part. I think you convinced me: I agree that my suggestions seem to add lots of extra work to make the (not very necessary design) better, which I don't think justify the increasing complexity here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks so much for this! The separated class for d-separation that you suggested would still be the perfect one, as long as we had enough time - maybe in a future refactor on citest.

print('test_pc_load_bnlearn_discrete_datasets passed!\n')

# Test the usage of local cache checkpoint (check speed).
def test_pc_with_citest_local_checkpoint(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the tests here!

Cheers! Now we can guarantee our PC algorithm is indeed written correctly. Great work!! :)

Copy link
Contributor

@tofuwen tofuwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! We are almost done!


data = np.zeros((100, len(truth_dag.nodes))) # just a placeholder
cg = pc(data, 0.05, d_separation, True, 0, -1, true_dag=true_dag_netx)
shd = SHD(truth_cpdag, cg.G)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one final questions: why we don't have assert here?

Should we assert shd = 0 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, we should! Thanks for pointing this out!

Copy link
Contributor

@tofuwen tofuwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome work! cc @kunwuz to merge

@kunwuz kunwuz merged commit d7ad488 into py-why:main Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants