-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REFACTOR] Fix types for CI tests and refactor types of inputs for CI tests #25
Conversation
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
This should address some issues raised in the OG PR #18. Lmk wdyt. |
Codecov Report
@@ Coverage Diff @@
## main #25 +/- ##
===========================================
+ Coverage 60.00% 76.83% +16.83%
===========================================
Files 1 8 +7
Lines 5 367 +362
Branches 0 59 +59
===========================================
+ Hits 3 282 +279
- Misses 2 66 +64
- Partials 0 19 +19
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Signed-off-by: Adam Li <adam2392@gmail.com>
dodiscover/ci/base.py
Outdated
self, df: pd.DataFrame, x_var: Any, y_var: Any, z_covariates: Any = None | ||
self, | ||
df: pd.DataFrame, | ||
x_var: Column, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I just realized you expect the column name of the df
here and not the data itself (was thinking of the API as here https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py#L25). But then you would need to allow multiple names, similar to z_covariates: Set[Column]
, since your test can involve multivariate inputs (the independence test itself then would need to throw an error if it doesn't support it. But most of them do, like HSIC or KCI).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now and simplicity sake, I left the X/Y variables as a univariate variable column in the dataframe. We can definitely extend later and involve multivariate inputs as you said (and error otw).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would already limit it for now? For instance, this would not allow to verify a local Markov condition where you would need to check independence between a target and multiple non-descendant nodes given its parents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be possible by just running each test multiple times?
But yeah I guess I'm just not 100% sure how to extend these tests for handling multivariate input slash which ones are capable. That was my main bottleneck tbh. Was hoping someone with more expertise in the community there can help.
E.g. Does FisherZ, G test handle multivariate, or should we just raise an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be possible by just running each test multiple times?
No, testing variables pairwise vs jointly is not the same. For instance, in case of an XOR Z := X ⊕ Y
, the independence test would indicate that Z
and X
or Z
and Y
are independent, but it would correctly indicate that Z
and (X, Y)
are dependent. See for instance this code snippet:
import numpy as np
from dowhy.gcm import independence_test
X = np.random.choice(2, 500)
Y = np.random.choice(2, 500)
Z = np.bitwise_xor(X, Y)
print(independence_test(X, Z))
>>> 0.65
print(independence_test(np.column_stack([X, Y]), Z))
>>> 0.0
Therefore, you (unfortunately) need to add all variables.
I am more familiar with the kernel tests (where multivariate inputs are not a problem), see for instance:
- HSIC: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py#L338
- KCI: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py#L232
- RCIT: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py#L493
- RIT: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/kernel.py#L431
- Regression based: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/independence_test/regression.py#L14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that's true. Good example! Hmm okay how about I enforce an error currently for multivariate input for these Fisher/G^2, but allow it in general because I think my implementation of KCI is essentially the same. I'll modify the typing now. In that case, x_vars
and y_vars
will have type Set[Column]
.
^ Now that you bring up those tests, I think it might also make sense to refactor a copy into dodiscover (for now). With the goal of eventually offloading them in general. Just so dodiscover can run structure learning w/o needing to import everything in dowhy. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of a creating a copy, we can already discuss the creation of a separate package. I would expect that we get more tests over time and independence testing is a crucial feature for various functionalities (not only causal discovery). Here, I would also prefer to utilize the HSIC and KCI implementations from the causal-learn package, since they are well maintained and optimized. I already prepared a wrapper for that previously, but couldn't use it due to the license. Now, this is possible.
Lets for now keep it as it is and pick this up in our next meeting. If we all agree, I think we can create the new package rather quickly and move the parts over. This also gives a chance to revisit the API (somehow, I missed the previous discussion on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Adam Li <adam2392@gmail.com>
I will merge this for the sake of consolidating the skeleton PR. Since we might move the CI tests anyways, these can easily be changed when we move them to a new PR. Moreover, if there's any high level design issues, then I can refactor them then. |
… tests (py-why#25) - Improve type checking for CI tests - add fingerprint for circleCI Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li adam2392@gmail.com
Just adds the necessary types for each CI test.
Changes proposed in this pull request:
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting