[ENH] Classifier CMI test #85

adam2392 · 2023-01-09T19:54:07Z

Signed-off-by: Adam Li adam2392@gmail.com

Addresses the CMI part of #17
Closes: #61

Changes proposed in this pull request:

Implements a classifier based approach for estimating CMI, which in turn can be used for CI testing
Refactors kernel and monte-carlo utility functions from its original utils.py location
Refactors computing training/testing dataset for CI testing into a i) generate kNN function, ii) permute NN function, iii) partition dataset function. These can be used independently in different algorithms (e.g. CMIKNN, CCIT, CCMI).
Reformatted GIN test files with black

Note: in the publication, they also support using a neural network approach, but that is outside the scope of this PR.

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.

Signed-off-by: Adam Li <adam2392@gmail.com>

codecov-commenter · 2023-01-09T20:00:31Z

Codecov Report

Merging #85 (1e0729b) into main (da36d83) will increase coverage by 0.10%.
The diff coverage is 92.65%.

@@            Coverage Diff             @@
##             main      #85      +/-   ##
==========================================
+ Coverage   88.24%   88.35%   +0.10%     
==========================================
  Files          26       28       +2     
  Lines        1650     1760     +110     
  Branches      267      276       +9     
==========================================
+ Hits         1456     1555      +99     
- Misses        114      123       +9     
- Partials       80       82       +2

Impacted Files	Coverage Δ
dodiscover/ci/ccmi_test.py	`88.15% <88.15%> (ø)`
dodiscover/ci/monte_carlo.py	`89.65% <89.65%> (ø)`
dodiscover/ci/base.py	`92.68% <94.44%> (+9.34%)`	⬆️
dodiscover/cd/bregman.py	`100.00% <100.00%> (ø)`
dodiscover/cd/kernel_test.py	`97.50% <100.00%> (ø)`
dodiscover/ci/clf_test.py	`90.47% <100.00%> (-1.13%)`	⬇️
dodiscover/ci/cmi_test.py	`94.79% <100.00%> (-1.08%)`	⬇️
dodiscover/ci/kernel_test.py	`91.12% <100.00%> (ø)`
dodiscover/ci/kernel_utils.py	`87.32% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

dodiscover/ci/ccmi_test.py

robertness

The binary p-value is weird. The reference paper says that the CMI estimate itself is a substitute for the p-value. Perhaps return the CMI estimate? It also says you could do a type of bootstrap estimate of the p-value by applying this method to many permuted datasets. Maybe explain this in the docstrings?

Other that that LGTM.

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-01-11T19:32:58Z

The binary p-value is weird. The reference paper says that the CMI estimate itself is a substitute for the p-value. Perhaps return the CMI estimate? It also says you could do a type of bootstrap estimate of the p-value by applying this method to many permuted datasets. Maybe explain this in the docstrings?

Yeah sorry this is still a draft, so it's a work-in-progress (WIP). I'll ping you when this is ready for a more in-depth review.

I need to finish the actual implementation and add some unit tests

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-01-12T20:24:11Z

Now, I just need to add a unit-test

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-01-13T21:00:47Z

The binary p-value is weird. The reference paper says that the CMI estimate itself is a substitute for the p-value. Perhaps return the CMI estimate? It also says you could do a type of bootstrap estimate of the p-value by applying this method to many permuted datasets. Maybe explain this in the docstrings?

The CMI value isn't between [0,1], so they binarize it using a threshold. Alternatively, similar to the CCIT, we can compute a null distribution. However, I found this isn't very robust, so noting it in the docs.

Signed-off-by: Adam Li <adam2392@gmail.com>

robertness

LGTM. One comment. The CMI value that is calculated as follows:

val = hxyz - (hxz + hyz - hz).mean()

Doesn't this have an asymptotic chi-squared distribution under the null hypothesis? If so, should there be an option to calculate the p-value that way?

WIP for ccmi

700cf0f

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 marked this pull request as draft January 9, 2023 19:54

robertness reviewed Jan 11, 2023

View reviewed changes

dodiscover/ci/ccmi_test.py Outdated Show resolved Hide resolved

robertness previously approved these changes Jan 11, 2023

View reviewed changes

Merge branch 'main' into ccmi

cb0dbba

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 dismissed robertness’s stale review via cb0dbba January 11, 2023 19:32

adam2392 mentioned this pull request Jan 11, 2023

Release v0.1 #89

Open

Complete merge

4b8d059

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 changed the title ~~[ENH, wip] Classifier CMI test~~ [DRAFT ENH] Classifier CMI test Jan 12, 2023

Add working version for ccmi

c4cbf2e

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 changed the title ~~[DRAFT ENH] Classifier CMI test~~ [ENH] Classifier CMI test Jan 12, 2023

adam2392 added 5 commits January 12, 2023 15:24

Add changelog

8633637

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into ccmi

36129d1

Try to add unit test

91da4bf

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into ccmi

d6a6642

Refactor and add CCMI

31a48b9

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

1e0729b

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 requested a review from robertness January 13, 2023 21:08

adam2392 marked this pull request as ready for review January 13, 2023 21:09

adam2392 enabled auto-merge (squash) January 13, 2023 21:41

adam2392 mentioned this pull request Jan 13, 2023

Add additional CI tests: Mutual Information, Chi square, Monte Carlo #17

Open

robertness approved these changes Jan 17, 2023

View reviewed changes

adam2392 merged commit 9d955c6 into py-why:main Jan 17, 2023

adam2392 mentioned this pull request Jan 20, 2023

Allow CMI to be approximated with Chi-square in large-sample sizes? #98

Open

adam2392 deleted the ccmi branch January 29, 2023 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Classifier CMI test #85

[ENH] Classifier CMI test #85

adam2392 commented Jan 9, 2023 •

edited

Loading

codecov-commenter commented Jan 9, 2023 •

edited

Loading

robertness left a comment

adam2392 commented Jan 11, 2023 •

edited

Loading

adam2392 commented Jan 12, 2023

adam2392 commented Jan 13, 2023

robertness left a comment •

edited

Loading

[ENH] Classifier CMI test #85

[ENH] Classifier CMI test #85

Conversation

adam2392 commented Jan 9, 2023 • edited Loading

Before submitting

After submitting

codecov-commenter commented Jan 9, 2023 • edited Loading

Codecov Report

robertness left a comment

Choose a reason for hiding this comment

adam2392 commented Jan 11, 2023 • edited Loading

adam2392 commented Jan 12, 2023

adam2392 commented Jan 13, 2023

robertness left a comment • edited Loading

Choose a reason for hiding this comment

adam2392 commented Jan 9, 2023 •

edited

Loading

codecov-commenter commented Jan 9, 2023 •

edited

Loading

adam2392 commented Jan 11, 2023 •

edited

Loading

robertness left a comment •

edited

Loading