Skip to content

Conversation

@MarkDana
Copy link
Collaborator

@MarkDana MarkDana commented Dec 18, 2021

optimize speed for PC and FCI

change overview

I mainly did two things:

  • Add cache to prevent from repeated CITests, which is especially common and wastes much time in FCI.
    • This applies to PC and FCI, and to all CITests.
  • Use matrix operation to translate For-loops in discrete CITests' conditioning subsetting.
    • This applies to all constraint-based methods, and to Chi2 and G2 tests (discrete data).

These two optimizations are faithful to original code (same calculation, same procedure, same result, no approximation). The only difference is speed. And the larger data is, the more speedup we'll get (#subsets exponential to #nodes).

result overview

  1. Make sure the new CITest is exactly the same as old one:
python -m unittest TestPC.TestPC.test_new_old_gsq_chisq_equivalent
  1. Test PC speed on discrete datasets:
  • Run:
python -m unittest TestPC.TestPC.test_bnlearn_discrete_datasets
  • Data: In original tests/ folder the discrete data are not large enough. So I used some data from bnlearn with samplesize=10,000. See ./TestData/bnlearn_discrete_10000.
  • Competitors:
    • Tetrad java version. Used the Python interface py-causal.
    • pcalg. In R language.
    • causal-learn-old. Used the latest 9b8f06f version @12/13/2021.
    • causal-learn-new (this pull request).
  • Note:
    • To test on causal-learn-old, please checkout to current main branch, copy the bnlearn data folder, copy the TestPC.py and run respectively.
    • Just a rough run. I'm sure that causal-learn-old and causal-learn-new runs exactly the same procedure. But not sure about parameters/operations inside are same in Tetrad and pcalg.
  • Result (produced on my M1max):
data (#nodes/#edges)
time (sec)
pcalg-R Tetrad-java causal-learn-old causal-learn-new ~x times faster than before
cancer 5/4 1.540 0.327 0.037 0.011 3
earthquake 5/4 1.583 0.326 0.043 0.011 4
survey 6/6 2.970 0.334 0.075 0.013 6
asia 8/8 2.999 0.678 0.134 0.023 6
sachs 11/17 3.096 2.225 4.495 0.142 32
child 20/25 56.050 18.118 14.298 0.619 23
insurance 27/52 118.203 29.115 25.032 1.377 18
water 32/66 1.553 2.839 4.276 0.316 14
alarm 37/46 110.337 7.908 14.123 0.857 16
barley 48/84 493.209 766.548 97.113 3.430 28
hailfinder 56/66 757.147 5.843 18.300 0.875 21
hepar2 70/123 / 83.793 282.980 9.508 30
win95pts 76/112 / 17.492 73.937 3.395 22
munin1 186/273 / 2258.087 8580.979 145.942 59
andes 223/338 / 191.619 1456.823 27.463 53
  1. Test FCI speed on discrete datasets:
  • Similar as above,
python -m unittest TestFCI.TestFCI.test_bnlearn_discrete_datasets
  • Result:
data (#nodes/#edges)
time (sec)
pcalg-R Tetrad-java causal-learn-old causal-learn-new ~x times faster than before
cancer 5/4 1.942 0.514 0.054 0.002 23
earthquake 5/4 1.717 0.339 0.055 0.002 32
survey 6/6 2.528 0.244 0.071 0.003 21
asia 8/8 3.165 0.600 0.153 0.031 5
sachs 11/17 44.944 2.090 6.014 0.085 71
child 20/25 88.366 4.191 19.557 0.687 28
insurance 27/52 219.912 7.686 56.823 1.764 32
water 32/66 1.376 1.513 6.619 0.364 18
alarm 37/46 169.848 3.253 23.864 0.980 24
barley 48/84 665.854 152.248 275.160 4.902 56
hailfinder 56/66 / / 38.710 1.704 23
hepar2 70/123 / 64.398 597.187 10.733 56
win95pts 76/112 / 9.538 138.113 4.093 34
munin1 186/273 / 611.075 >6 hrs 278.110 >78
andes 223/338 / 86.151 3083.325 39.021 79
  1. Test FCI on continuous dataset:
python -m unittest TestFCI.TestFCI.test_large_continuous_dataset

On ./data_linear_10.txt, old FCI uses 4.41656s, and new FCI uses 1.87639s. This difference is only due to cache.

Also checkout

Thanks Wei's for earlier optimization on FCI at commit 9b8f06f. Compare release 0.1.2.0 vs 0.1.1.9, FCI is also dozens of times faster.

@kunwuz kunwuz merged commit 1e6aa46 into py-why:main Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants