Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add the classifier CI test #28

Merged
merged 21 commits into from
Sep 22, 2022
Merged

[ENH] Add the classifier CI test #28

merged 21 commits into from
Sep 22, 2022

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Aug 24, 2022

Addresses one of: #17

Changes proposed in this pull request:

  • Implements the CCIT proposed test which is a nonparametric conditional independence test that allows multivariate X, Y and Z
  • Adds a simulation function from the CCIT paper
  • The nearest-neighbor sampling procedure proposed to generate samples from the null hypothesis: $X \perp Y | Z$ is possibly of independent interest beyond just the CCIt function. I remember seeing the same algorithm in the CCMI paper too, which I am interested in for my projects. I will prolly PR that sometime soon too.

A note on class/function design for CI tests:

  • I'm more of a fan of the API for class-based CI test vs function-based. This helps demonstrate that. There are so many parameters for a sklearn classifier, and ideally a user can instantiate that separately from the CI estimator parameters.

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@codecov-commenter
Copy link

codecov-commenter commented Aug 24, 2022

Codecov Report

Merging #28 (7a56182) into main (5b7465b) will increase coverage by 3.43%.
The diff coverage is 85.14%.

@@            Coverage Diff             @@
##             main      #28      +/-   ##
==========================================
+ Coverage   68.04%   71.47%   +3.43%     
==========================================
  Files          13       16       +3     
  Lines         679      852     +173     
  Branches      126      142      +16     
==========================================
+ Hits          462      609     +147     
- Misses        175      193      +18     
- Partials       42       50       +8     
Impacted Files Coverage Δ
dodiscover/ci/typing.py 75.00% <75.00%> (ø)
dodiscover/ci/clf_test.py 83.20% <83.20%> (ø)
dodiscover/ci/simulate.py 93.75% <93.75%> (ø)
dodiscover/ci/kernel_test.py 90.34% <100.00%> (+0.13%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Signed-off-by: Adam Li <adam2392@gmail.com>
@bloebp
Copy link
Member

bloebp commented Aug 31, 2022

I'm more of a fan of the API for class-based CI test vs function-based. This helps demonstrate that. There are so many parameters for a sklearn classifier, and ideally a user can instantiate that separately from the CI estimator parameters.

There might be a misconception with the idea of the functional form. There is no need to ask for model parameters in the CI function. For instance:

def my_ci_test(X, Y, Z, ml_model):
   ...
   ml_model.fit(...)

mdl = MyComplexMLModel(**millions_of_parametrs)

p_value = my_ci_test(X, Y, Z, mdl)

You can also expect a factory instead that generates a user specific ML model instead of expecting the parameter as inputs to the function.

Ultimately, you would only gain something from the class-based approach when you plan to reuse the object and want to save some passing of parameters (which is of course a valid argument). However, we should make sure that these tests are stateless then.

dodiscover/ci/clf_test.py Outdated Show resolved Hide resolved
dodiscover/ci/clf_test.py Show resolved Hide resolved
dodiscover/ci/clf_test.py Show resolved Hide resolved
@petergtz
Copy link
Member

Ultimately, you would only gain something from the class-based approach when you plan to reuse the object and want to save some passing of parameters (which is of course a valid argument). However, we should make sure that these tests are stateless then.

Agree with @bloebp here. Classes make sense when something requires a longer life-span, e.g. because you want to pass it around and it gets invoked at a different place than where it's constructed. This is typically also the case in bigger systems where you want to separate wiring of objects from invoking them. Another case is, when something requires multiple steps to build, such as a graph object.

But in scenarios where we instantiate a class and then immediately invoke the object as suggested by the unit test:

    ci_estimator = ClassifierCITest(clf, random_state=rng)

    _, pvalue = ci_estimator.test(df, {"x"}, {"x1"})
    assert pvalue > 0.05
    _, pvalue = ci_estimator.test(df, {"x"}, {"z"})
    assert pvalue < 0.05

the benefit of a class seems really questionable. Seems like this could be written as:

    _, pvalue = classifier_ci_test(df, {"x"}, {"x1"}, clf)
    assert pvalue > 0.05
    _, pvalue = classifier_ci_test(df, {"x"}, {"z"}, clf)
    assert pvalue < 0.05

And this would save a lot of bookkeeping code in the implementation too.

What do you think?

@adam2392
Copy link
Collaborator Author

adam2392 commented Aug 31, 2022

And this would save a lot of bookkeeping code in the implementation too.

What do you think?

Okay this makes a lot of sense! Let me start another issue to track the refactoring of the CI tests into functions. Also tagging #26 since we discussed moving these altogether from do discover + dowhy to another repo.

Is it okay if we leave as is in this repo, so I can convert them all at once?

If so, then I will fix some of the documentation steps that were raised in the review, and then refactor class -> functions in the next PR.

Copy link
Collaborator

@robertness robertness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we expand this method to work with a neural-net based classifier using Keras or Pytorch? It likely feels like overkill but it would be a first step to telling a story to the broader community that this repo is using cutting edge "deep" methods. We could also workshop an example with multidimensional variables later (e.g. pixels) though not in this PR.

dodiscover/ci/clf_test.py Outdated Show resolved Hide resolved
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

adam2392 commented Sep 20, 2022

I've addressed the comments and now just need a PR approval to move forward. Just a note, that the poetry.lock file update took over 500 LOC when adding flaky. The actual LOC diff for the CCIT implementation and unit test is only around 300-400 LOC.

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
robertness
robertness previously approved these changes Sep 21, 2022
Copy link
Collaborator

@robertness robertness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is at a good place and we can get it merged. If we need to make changes later we can.

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

@robertness can you re-approve this PR? I had to merge in changes from main and resolve conflicts, which dismissed your approval.

Not sure why tho... it seems that's slightly redundant, since it's fairly often that one might need to rebase/merge in changes from main. I guess it's to protect changes from conflicting.

Copy link
Collaborator

@robertness robertness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adam2392 adam2392 merged commit 0771188 into py-why:main Sep 22, 2022
@adam2392 adam2392 deleted the ccit branch September 22, 2022 20:24
darthtrevino pushed a commit that referenced this pull request Sep 27, 2022
* Adding classifier CI test (ccit) and unit test
* Adds a simulation module for simulating non-linear additive noise models
* Add flaky to the unit test suite CCIT
* Address typing issues with regards to adding sklearn and pytorch NN modules as the "classifier"

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Chris Trevino <darthtrevino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants