Use d-separation as CIT in tests, to ensure PC's correctness #65

MarkDana · 2022-07-26T14:26:54Z

Note first:

This pr is based off of #64 to prevent conflict on TestPC.py. So please review #64 first.

Updated files:

cit.py: add class D_Separation(CIT_Base).
TestPC.py: add test_pc_load_bnlearn_graphs_with_d_separation.

Test plan:

python -m unittest TestPC.TestPC.test_pc_load_bnlearn_graphs_with_d_separation

It should pass (but in long time (around 3hrs), since in big graphs: 1) more tests are needed, and 2) d-separation check takes more time to traverse paths). Here is an example output:

Depth=3, working on node 7: 100%|██████████| 8/8 [00:00<00:00, 6425.59it/s]
asia (8 nodes/8 edges): used 0.03562s, SHD: 0
Depth=3, working on node 4: 100%|██████████| 5/5 [00:00<00:00, 5866.16it/s]
cancer (5 nodes/4 edges): used 0.00914s, SHD: 0
Depth=3, working on node 4: 100%|██████████| 5/5 [00:00<00:00, 6878.16it/s]
earthquake (5 nodes/4 edges): used 0.00925s, SHD: 0
Depth=6, working on node 10: 100%|██████████| 11/11 [00:00<00:00, 5876.62it/s]
sachs (11 nodes/17 edges): used 0.13420s, SHD: 0
Depth=3, working on node 5: 100%|██████████| 6/6 [00:00<00:00, 7246.13it/s]
survey (6 nodes/6 edges): used 0.01338s, SHD: 0
Depth=5, working on node 36: 100%|██████████| 37/37 [00:00<00:00, 8497.93it/s]
alarm (37 nodes/46 edges): used 7.56482s, SHD: 0
Depth=7, working on node 47: 100%|██████████| 48/48 [00:00<00:00, 3848.28it/s]
barley (48 nodes/84 edges): used 701.48529s, SHD: 0
Depth=7, working on node 19: 100%|██████████| 20/20 [00:00<00:00, 7347.47it/s]
child (20 nodes/25 edges): used 0.97726s, SHD: 0
Depth=8, working on node 26: 100%|██████████| 27/27 [00:00<00:00, 7440.13it/s]
insurance (27 nodes/52 edges): used 21.79384s, SHD: 0
Depth=7, working on node 31: 100%|██████████| 32/32 [00:00<00:00, 4703.78it/s]
water (32 nodes/66 edges): used 122.41981s, SHD: 0
Depth=16, working on node 55: 100%|██████████| 56/56 [00:00<00:00, 6383.16it/s]
hailfinder (56 nodes/66 edges): used 365.14324s, SHD: 0
Depth=18, working on node 69: 100%|██████████| 70/70 [00:00<00:00, 3397.89it/s]
hepar2 (70 nodes/123 edges): used 7126.62941s, SHD: 0
Depth=9, working on node 75: 100%|██████████| 76/76 [00:00<00:00, 7891.84it/s]
win95pts (76 nodes/112 edges): used 167.08235s, SHD: 0
test_pc_load_bnlearn_graphs_with_d_separation passed!

Ran 1 test in 8514.537s

Good news from above:

By this integration test, we know that our current implementation of pc (or more specifically, at least pc(stable=True, uc_rule=0, uc_priority=-1)) is correct, i.e., it will return the true CPDAG with asymptotic CI tests.

Local cache resume example

tofuwen · 2022-07-29T06:12:26Z

causallearn/utils/cit.py

        return virtual_cit(0, 1, tuple(cond_set_bgn_0))
+
+
+class D_Separation(CIT_Base):


this is really hacky... Normally we shouldn't write hacky code like this.

D_separation is not a Conditional Independence Test, right? We probably shouldn't inherit CIT_base class.

You can think about this OOP design --- normally we need to follow the logic, if D-Separation and CIT has something in common (like here, data is the same, and you call want to return some value), you can add another layer of abstraction under CIT_base.

We need to strictly follow the logical structure in code whenever possible. :)

Thanks so much. I see your point! (Sorry I just saw your message..

My current codes are mainly for convenience - so that we can call d-separation just as if we call a citest (same as fisherz or kci). D-separation indeed has many things in common with citest (e.g., i/o), though yes, logically it is not a citest.

By "another layer of abstraction under CIT_base", are you suggesting something like CIT_base -> D_separation_base -> D_separation? Then this looks almost the same as CIT_base -> D_separation?

OOP is not just chain, it should be a DAG.

For example, you can design things like:

Data_base -> CIT_base -> ....
Data_base -> D_separation

I see!

A bit confused: then maybe we'll have to move all of our functionalities in CIT_base (e.g., input check, cache, etc) to Data_base - while they are not attributes about data, and D_separation requires no data?

To me, here D_separation is more like a duck type? Though in definition, it is NOT a citest (not a statistical one but a graphical one), in our context (to test the algorithm's correctness), we call it, use it and evaluate it all like a citest.

I see.

I really didn't think much --- yeah, you are right, D-separation requires no data.

My point is just: in OOP, just think about what needs to be abstracted and shared.

So what's common between D_separation and CIT_base? The cache, and other related utils. Then probably make a base named Cache_base, and maybe you can design things like:

Cache_base -> CIT_base -> ....
Cache_base -> D_separation

Usually don't inherit things that you don't need and don't make OOP design not consistent with the logical structure (hacky like this will usually create troubles in the future.).

Thx @tofuwen. Cool! Cache_base -> CIT_base -> .... and Cache_base -> D_separation now looks logically reasonable. Though practically I still have this concern:

What is the difference set CIT_base\Cache_base? In other words, what is something shared by FisherZ and Chisq, but not used in D_separation? There is only one thing, data.

Therefore, Cache_base should contain cache-related utilities and input/output checks, and CIT_base\Cache_base should be only about data. However, if we do so, some problems arise:

Cache-related utilities and data are relatively coupled (e.g., data hash check, parameters check). It may not be clean or easy to decouple the two.

Cache_base - or naming it more accurately, e.g., Cache_for_constraint_base - is it something deserving a base class treatment, or just some utility functions belonging to the CITs? It's natural to understand KCI as a child class of CIT_base, but it seems weird to see CIT_base as a child class of Cache_base. Even without cache, CIT is still CIT.

After all, d-separation is here implemented only to check the algorithms' correctness in tests. It is not some key function for users. So, is it really necessary to do the above refactor (separate out a Cache_base) only for d-separation - while sacrificing all the other main-function parts (mentioned above in the 2nd point)?

And, can d-separation be seen as a citest? Maybe it depends. On how general we see them. For example,

d-separation is a criterion for deciding, from a given a causal graph, whether a set X of variables is independent of another set Y, given a third set Z.
(http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html)

I will think more about how to put d-separation in our package in a both logically reasonable and functionally clean way.

Yeah, I agree with you for the most part. I think you convinced me: I agree that my suggestions seem to add lots of extra work to make the (not very necessary design) better, which I don't think justify the increasing complexity here.

Cool. Thanks so much for this! The separated class for d-separation that you suggested would still be the perfect one, as long as we had enough time - maybe in a future refactor on citest.

tofuwen · 2022-07-29T06:13:25Z

tests/TestPC.py

        print('test_pc_load_bnlearn_discrete_datasets passed!\n')
+
+    # Test the usage of local cache checkpoint (check speed).
+    def test_pc_with_citest_local_checkpoint(self):


I really like the tests here!

Cheers! Now we can guarantee our PC algorithm is indeed written correctly. Great work!! :)

tofuwen

Thanks for the great work! We are almost done!

tofuwen · 2022-08-01T13:43:12Z

tests/TestPC.py

+
+            data = np.zeros((100, len(truth_dag.nodes)))  # just a placeholder
+            cg = pc(data, 0.05, d_separation, True, 0, -1, true_dag=true_dag_netx)
+            shd = SHD(truth_cpdag, cg.G)


one final questions: why we don't have assert here?

Should we assert shd = 0 here?

Oh yes, we should! Thanks for pointing this out!

tofuwen

Thanks for the awesome work! cc @kunwuz to merge

MarkDana added 6 commits July 26, 2022 18:35

Add kwargs in PC to pass additional parameters to cit

1068ffe

Add test_pc_with_citest_local_checkpoint

9716b59

Added kwargs also in CDNOD and FCI to pass params to CIT

d75b215

Merge pull request #3 from MarkDana/local_cache_resume_example

aa7f3c2

Local cache resume example

Add d-separation as a CIT method in cit.py

9ca56c9

Add test_pc_load_bnlearn_graphs_with_d_separation in TestPC

25bdf4e

tofuwen reviewed Jul 29, 2022

View reviewed changes

tofuwen reviewed Aug 1, 2022

View reviewed changes

assert shd==0 for PC with d-separation

2e1c265

tofuwen approved these changes Aug 2, 2022

View reviewed changes

kunwuz merged commit d7ad488 into py-why:main Aug 2, 2022

		return virtual_cit(0, 1, tuple(cond_set_bgn_0))


		class D_Separation(CIT_Base):

Use d-separation as CIT in tests, to ensure PC's correctness #65

Use d-separation as CIT in tests, to ensure PC's correctness #65

Uh oh!

Conversation

MarkDana commented Jul 26, 2022

Note first:

Updated files:

Test plan:

Good news from above:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tofuwen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tofuwen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants