Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Classifier CMI test #85

Merged
merged 10 commits into from
Jan 17, 2023
Merged

[ENH] Classifier CMI test #85

merged 10 commits into from
Jan 17, 2023

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Jan 9, 2023

Signed-off-by: Adam Li adam2392@gmail.com

Addresses the CMI part of #17
Closes: #61

Changes proposed in this pull request:

  • Implements a classifier based approach for estimating CMI, which in turn can be used for CI testing
  • Refactors kernel and monte-carlo utility functions from its original utils.py location
  • Refactors computing training/testing dataset for CI testing into a i) generate kNN function, ii) permute NN function, iii) partition dataset function. These can be used independently in different algorithms (e.g. CMIKNN, CCIT, CCMI).
  • Reformatted GIN test files with black

Note: in the publication, they also support using a neural network approach, but that is outside the scope of this PR.

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 marked this pull request as draft January 9, 2023 19:54
@codecov-commenter
Copy link

codecov-commenter commented Jan 9, 2023

Codecov Report

Merging #85 (1e0729b) into main (da36d83) will increase coverage by 0.10%.
The diff coverage is 92.65%.

@@            Coverage Diff             @@
##             main      #85      +/-   ##
==========================================
+ Coverage   88.24%   88.35%   +0.10%     
==========================================
  Files          26       28       +2     
  Lines        1650     1760     +110     
  Branches      267      276       +9     
==========================================
+ Hits         1456     1555      +99     
- Misses        114      123       +9     
- Partials       80       82       +2     
Impacted Files Coverage Δ
dodiscover/ci/ccmi_test.py 88.15% <88.15%> (ø)
dodiscover/ci/monte_carlo.py 89.65% <89.65%> (ø)
dodiscover/ci/base.py 92.68% <94.44%> (+9.34%) ⬆️
dodiscover/cd/bregman.py 100.00% <100.00%> (ø)
dodiscover/cd/kernel_test.py 97.50% <100.00%> (ø)
dodiscover/ci/clf_test.py 90.47% <100.00%> (-1.13%) ⬇️
dodiscover/ci/cmi_test.py 94.79% <100.00%> (-1.08%) ⬇️
dodiscover/ci/kernel_test.py 91.12% <100.00%> (ø)
dodiscover/ci/kernel_utils.py 87.32% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

robertness
robertness previously approved these changes Jan 11, 2023
Copy link
Collaborator

@robertness robertness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary p-value is weird. The reference paper says that the CMI estimate itself is a substitute for the p-value. Perhaps return the CMI estimate? It also says you could do a type of bootstrap estimate of the p-value by applying this method to many permuted datasets. Maybe explain this in the docstrings?

Other that that LGTM.

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

adam2392 commented Jan 11, 2023

The binary p-value is weird. The reference paper says that the CMI estimate itself is a substitute for the p-value. Perhaps return the CMI estimate? It also says you could do a type of bootstrap estimate of the p-value by applying this method to many permuted datasets. Maybe explain this in the docstrings?

Yeah sorry this is still a draft, so it's a work-in-progress (WIP). I'll ping you when this is ready for a more in-depth review.

I need to finish the actual implementation and add some unit tests

@adam2392 adam2392 mentioned this pull request Jan 11, 2023
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 changed the title [ENH, wip] Classifier CMI test [DRAFT ENH] Classifier CMI test Jan 12, 2023
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 changed the title [DRAFT ENH] Classifier CMI test [ENH] Classifier CMI test Jan 12, 2023
@adam2392
Copy link
Collaborator Author

Now, I just need to add a unit-test

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

The binary p-value is weird. The reference paper says that the CMI estimate itself is a substitute for the p-value. Perhaps return the CMI estimate? It also says you could do a type of bootstrap estimate of the p-value by applying this method to many permuted datasets. Maybe explain this in the docstrings?

The CMI value isn't between [0,1], so they binarize it using a threshold. Alternatively, similar to the CCIT, we can compute a null distribution. However, I found this isn't very robust, so noting it in the docs.

Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link
Collaborator

@robertness robertness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One comment. The CMI value that is calculated as follows:

val = hxyz - (hxz + hyz - hz).mean()

Doesn't this have an asymptotic chi-squared distribution under the null hypothesis? If so, should there be an option to calculate the p-value that way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants