Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PC tutorial using ASIA data #67

Merged
merged 8 commits into from
Jan 3, 2023
Merged

PC tutorial using ASIA data #67

merged 8 commits into from
Jan 3, 2023

Conversation

robertness
Copy link
Collaborator

@robertness robertness commented Dec 8, 2022

Fixes issue 66.

Changes proposed in this pull request:
Adds a tutorial for the PC algo on a public dataset.

How to review this PR

Here is the learned CPDAG on the ASIA network.

image

Here is the comparable result from pc.stable in the bnlearn package.

image

So our performance is similar to bnlearn's implementation. But it doesn't reconstruct the graph very well. Here is the ground truth network for reference:

image

So it is not doing as well on this data, and therefore the tutorial is not telling a compelling story. I suspect we could improve the causal discovery and the narrative if we add some constraints. Any suggestions?

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

@adam2392
Copy link
Collaborator

adam2392 commented Dec 8, 2022

Oh interesting, so our PC algo incorrectly orients L -> S <- B, whereas it should be the other way around. This means that the algorithm incorrectly found that $L \perp B$ and $L \not\perp B | S$.

Are the alpha levels and CI tests the same used in dodiscover and R's pcstable?

Re constraints:

  • We can impose the constraint that S must cause L and B because smoking causes lung cancer and not the other way around. We should note that the constraint may make the resulting graph not a valid CPDAG.
  • Another thing we could do is just note this as an imperfection of causal discovery algorithms when data is unfaithful. I think IIRC, the ASIA dataset is unfaithful. This is seen because it is very hard to detect the edge between E and X.
  • Another thing we can do is also run the ConservativePC to demonstrate that the CPDAG there is more "robust"

WDYT?

@robertness robertness changed the title FCI tutorial using ASIA data PC tutorial using ASIA data Dec 8, 2022
@robertness
Copy link
Collaborator Author

Are the alpha levels and CI tests the same used in dodiscover and R's pcstable?

I used .05 in bnlearn's pcstable. What's the default here?

@robertness
Copy link
Collaborator Author

We can impose the constraint that S must cause L and B because smoking causes lung cancer and not the other way around. We should note that the constraint may make the resulting graph not a valid CPDAG.

So I once created an algorithm that would modify the CPDAG to account for edges fixed by interventions, constraints, and graph priors. Do you think we could use something like that here?

@robertness
Copy link
Collaborator Author

robertness commented Dec 8, 2022

  • Another thing we could do is just note this as an imperfection of causal discovery algorithms when data is unfaithful. I think IIRC, the ASIA dataset is unfaithful. This is seen because it is very hard to detect the edge between E and X.

How about the ALARM network? I think that was the first successful use case of the PC algo.

@robertness
Copy link
Collaborator Author

  • Another thing we can do is also run the ConservativePC to demonstrate that the CPDAG there is more "robust"

Can you elaborate? How would this change things?

@adam2392
Copy link
Collaborator

adam2392 commented Dec 8, 2022

How about the ALARM network? I think that was the first successful use case of the PC algo.

+1

I used .05 in bnlearn's pcstable. What's the default here?

Also 0.05. Hmm perhaps there is a bug in the implementation of our CI test and/or the PC algo itself. Are you able to check through the separating sets? Cuz the skeleton learned looks the same in both, so this is good. Therefore the error must be in what the separating sets are or the orientation phase of the PC algo itself.

So I once created an algorithm that would modify the CPDAG to account for edges fixed by interventions, constraints, and graph priors. Do you think we could use something like that here?

What does this do?

Can you elaborate? How would this change things?

If we add background knowledge, the returned graph is no longer necessarily a Markov equivalence class of the DAG. It is an esoteric point for now, but it's something to note I would say.

For example, say you get the true CPDAG:

X - Y - Z

and $X \not\perp Z, X \perp Z | Y$, then you apply prior knowledge to say $X \rightarrow Y$, then $X \rightarrow Y - Z$ is not a CPDAG for the CI statements written.

In this simple setup, assuming the conditional independences learned were correct, then you would automatically have $X \rightarrow Y \rightarrow Z$, since a collider is not possible. But in general, I think this problem is open on how to systematically combine prior knowledge w/ causal orientation rules.

@robertness
Copy link
Collaborator Author

What does this do?

So for score-based algorithms, one learns a DAG and then converts it to a PDAG. In bnlearn, you convert to a PDAG using cpdag. I notice now the algo has a wlbl argument which allows you to apply constraints that would force some edges to stay oriented. This is what I was thinking of. My algorithm extended from constraints to causal priors and interventions as well, but those extensions don't matter quite yet.
I suppose in a constraint algo like the PC algo, we would do something like have the constraints force certain edges to be oriented when otherwise they'd be undirected. But we're already doing this via the contraints in the context object, correct?

@robertness
Copy link
Collaborator Author

robertness commented Dec 9, 2022

We can impose the constraint that S must cause L and B because smoking causes lung cancer and not the other way around. We should note that the constraint may make the resulting graph not a valid CPDAG.

Ok, I tried imposing the constraint and it doesn't seem to work. Is this a bug or did I make an error?

included_edges = nx.DiGraph([('S', 'L'), ('S', 'B')])
context = make_context().variables(data=data).edges(include=included_edges).build()

ci_estimator = GSquareCITest(data_type="discrete")
pc = PC(ci_estimator=ci_estimator)

def convert_to_int(df):
    for var in df.columns:
        data[var] = [1 if x == "yes" else 0 for x in data[var]]
    return df
data_mod = convert_to_int(data)

pc.fit(data_mod, context)
graph = pc.graph_

draw(graph)

image

@robertness
Copy link
Collaborator Author

Created an issue to address the need to convert characters to ints in the data: #69

1 similar comment
@robertness
Copy link
Collaborator Author

Created an issue to address the need to convert characters to ints in the data: #69

@adam2392
Copy link
Collaborator

adam2392 commented Dec 9, 2022

We can impose the constraint that S must cause L and B because smoking causes lung cancer and not the other way around. We should note that the constraint may make the resulting graph not a valid CPDAG.

Ok, I tried imposing the constraint and it doesn't seem to work. Is this a bug or did I make an error?

I did not thoroughly test adding constraints. However, I think this is a bug in the implementation, or perhaps just a miscommunication of how the constraints are applied. This is related to #46, which we should probably revisit.

I think the issue could be that inside the skeleton learning, we have the following function:

                    # ignore fixed edges
                    if (x_var, y_var) in self.context.included_edges.edges:
                        continue

@robertness
Copy link
Collaborator Author

robertness commented Dec 10, 2022

I did not thoroughly test adding constraints. However, I think this is a bug in the implementation, or perhaps just a miscommunication of how the constraints are applied. This is related to #46, which we should probably revisit.

I think the issue could be that inside the skeleton learning, we have the following function:

I'm going to create a bug issue since we have reproducibility with my above code, and link to issue 46.

@robertness
Copy link
Collaborator Author

robertness commented Dec 10, 2022

@adam2392 I cleaned up the narrative to discuss the less than ideal results. I created an issue to do another notebook that demonstrates the use of constraints, once that issue is fixed. Can you approve?

@adam2392
Copy link
Collaborator

@adam2392 I cleaned up the narrative to discuss the less than ideal results. I created an issue to do another notebook that demonstrates the use of constraints, once that issue is fixed. Can you approve?

Hi @robertness I will try to get to this before EOY. So my hypothesis is that there is a runtime-issue that is created when you assume edge-constraints before the skeleton is discovered.

@emrekiciman
Copy link
Member

Approved. Chatting with Robert, it looks like the notebook itself is working correctly, even though it is uncovering issues in the library.

emrekiciman
emrekiciman previously approved these changes Dec 18, 2022
Copy link
Member

@emrekiciman emrekiciman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertness I have a few comments that would just make the overall maintenance easier:

  1. is it possible to use bnlearn's API to pull the asia.csv, so that way it is not merged in with the source code?
# Example of interactive plotting
import bnlearn as bn

# Load example dataset
df = bn.import_example(data='asia')

Ref: https://erdogant.github.io/bnlearn/pages/html/Plot.html?highlight=asia

You would just delete asia.csv file and then add bnlearn to the doc dev list of dependencies: [tool.poetry.group.docs.dependencies] inside the pyproject.toml file.

  1. I fixed the CI, so now there are some spelling issues caught in the notebook:
examples/notebooks/example-pc-algo.ipynb:155: distinquish ==> distinguish
[42](https://github.com/py-why/dodiscover/actions/runs/3728444238/jobs/6323464519#step:7:43)
examples/notebooks/example-pc-algo.ipynb:178: implemention ==> implementation
  1. If you move the notebook to doc/tutorial/markovian, then this will cleanly separate simple example scripts and more involved Jupyter notebook tutorials. I would classify this notebook as a tutorial.

Lmk if you have any questions, or think something could be changed.

@adam2392
Copy link
Collaborator

Approved. Chatting with Robert, it looks like the notebook itself is working correctly, even though it is uncovering issues in the library.

Okay sounds good to me. I think the notebook itself is fine then. I left some minor comments to make sure the tutorials sections are relatively lightweight/clean. They should be easily resolved and then I'll approve and merge!

I'll work on debugging moving the edge constraints to post-skeleton-discovery.

@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2022

Codecov Report

Merging #67 (45eeb1a) into main (4d9a788) will not change coverage.
The diff coverage is 60.00%.

❗ Current head 45eeb1a differs from pull request most recent head f34981d. Consider uploading reports for the commit f34981d to get more accurate results

@@           Coverage Diff           @@
##             main      #67   +/-   ##
=======================================
  Coverage   82.13%   82.13%           
=======================================
  Files          20       20           
  Lines        1304     1304           
  Branches      228      229    +1     
=======================================
  Hits         1071     1071           
  Misses        152      152           
  Partials       81       81           
Impacted Files Coverage Δ
dodiscover/constraint/pcalg.py 79.80% <60.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@robertness
Copy link
Collaborator Author

@adam2392 I moved to docs, removed the asia data, and added bnlearn as a dependency.

@adam2392 adam2392 mentioned this pull request Dec 29, 2022
5 tasks
@robertness robertness force-pushed the tutorial branch 3 times, most recently from 4c8a83b to ac070eb Compare December 31, 2022 03:16
@robertness robertness force-pushed the tutorial branch 5 times, most recently from 965cee8 to fb03346 Compare December 31, 2022 04:04
Signed-off-by: Robert Ness <robertness@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
adam2392
adam2392 previously approved these changes Jan 3, 2023
Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now works. A few notes:

  1. I fixed the CI. There were some installation issues that just needed to be iterated on based on the error messages that were provided.
  2. I updated the notebook and in turn found a minor bug to fix in the PC algorithm orientation phase.
  3. In the future, @robertness can you push to a branch on your fork and then start a PR from the fork rather than a branch on the main repo?

Unfortunately, unsure why the Windows build is failing. It seems in general poetry has difficulty with Windows... @darthtrevino any ideas?

@adam2392 adam2392 merged commit e94e038 into main Jan 3, 2023
@adam2392 adam2392 deleted the tutorial branch January 3, 2023 04:21
@robertness
Copy link
Collaborator Author

Will use my own fork in future. Thanks @adam2392

adam2392 added a commit to adam2392/dodiscover that referenced this pull request Jan 6, 2023
* Create PC algo tutorial
* Add updated poetry lock file
* Fix notebook and update docs for CI. Fix code spell

Signed-off-by: Robert Ness <robertness@gmail.com>
Co-authored-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add notebook tutorial for PC algorithm
4 participants