Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build an EFO term "precision" classification pipeline #2

Open
1 task
eric-czech opened this issue Jul 13, 2023 · 7 comments
Open
1 task

Build an EFO term "precision" classification pipeline #2

eric-czech opened this issue Jul 13, 2023 · 7 comments

Comments

@eric-czech
Copy link

eric-czech commented Jul 13, 2023

There are several EFO term classifications that could be useful. I propose we start with trying to assign a certain precision to terms based on the following definitions:

  • high: High precision terms have the greatest ontological specificity, sometimes (but not necessarily) correspond to small groups of relatively homogeneous patients, often have greater diagnostic certainty and typically represent the forefront of clinical practice, i.e. they're closest to precision/personalized medicine. Examples:
  • medium: Medium precision terms are the ontological ancestors of high precision terms (if any are known), often include indications in later stage clinical trials and generally reflect groups of patients assumed to be suffering from a condition with a shared, or at least similar, physiological or environmental origin. Examples:
  • low: Low precision terms are the ontological ancestors of both medium and high precision terms, group collections of diseases with some, but often not many, shared characteristics, maybe be named in early stage clinical trials and typically connote a relatively heterogenous patient population. They are also often terms used within the ontology for organizational purposes or completeness. Examples:

note: More examples like this are given in #3.

I like this description of the task and these names/definitions more than the disease "subtype", "root" and "area" idea we had used before internally, and it better captures what I was initially trying to accomplish with that work anyhow. I'm certainly open to discussing it more though.

We can use some of the labels we already have to bootstrap this effort and I would say the next steps are:

  1. Include an LLM-derived classification of this label as a feature to be used in conjunction with ontology features to make a final classification
  2. Explore options for modeling in this task beyond applying a GBM

I'll add some more details on those steps in related issues.

TODO:

  • Add an issue with thoughts on other modeling approaches
@dhimmel
Copy link
Member

dhimmel commented Jul 25, 2023

I also prefer the proposed disease precision scale of low/medium/high to the former scale of area/root/subtype, since it avoids any confusion with the term "root". The proposed definitions and helpful, and as noted some examples would enrich the definitions, as would review of some terms that do not fit cleanly within a single category.

Since our use case at Related Sciences is ultimately for drug development, these definitions are anchored to clinical characteristics and specificity. I imagine there might be some more non-clinical characteristics of each precision level that could enrich and solidify the definitions to support broader applications.

Looping in @DnlRKorn @matentzn @zoependlington @nicolevasilevsky @d0choa in case you have feedback on whether classifying diseases based on precision would be useful and whether the low/medium/high scale and definitions make sense. This classification is something we're initially planning to perform on EFO but could extend to MONDO as well.

@matentzn
Copy link

@dhimmel I think from our perspective, the most significant distinction is between "true diagnosable disease" and "disease grouping"! This is sort of related to the "precision" mentioned here, but maybe needs other kinds of evidence, like "mentions in PubMed" etc.

@dhimmel
Copy link
Member

dhimmel commented Jul 27, 2023

the most significant distinction is between "true diagnosable disease" and "disease grouping"

Thanks @matentzn for weighing in. I think a true diagnosable disease might be the union of the medium and high precision buckets, while disease grouping would be low. We also could create a 2-class outcome in addition that could be predicted from the feature set we create, which should include features like publication mentions and other things.

Linking a related issue at monarch-initiative/mondo#685.

Also I notice EFO:0000574 / lymphoma has two subsets according to the EBI OLS browser: disease_grouping, ordo_group_of_disorders. @zoependlington or others: where are these subsets defined, how are they assigned, and is their more documentation on them?

@matentzn
Copy link

Most of these come from Mondo, and are the consequence in metamodelling of ontologies aligned with mondo. For example, there is a group of OrDO classes explicitly defined as groupings in ORDO, which make up that subset. For the more general disease_grouping subset, I think this was a fairly incomplete attempt to manually curate disease groupings. @nicolevasilevsky would know best!

@eric-czech
Copy link
Author

Linking #5, which added the labeled data from EFO v3.43.0

@eric-czech
Copy link
Author

Related to #13, I wanted to add two visuals that clearly communicate something about why we did this originally:

  1. A list of terms by precision level

Screen Shot 2023-09-20 at 4 09 38 PM

Code: random_efo_term_samples.ipynb

  1. What happens when you use this to remove the most general terms in EFO
Screen Shot 2023-09-16 at 3 30 58 PM

As a graph, EFO is very difficult to work with. Removing the most general terms (i.e. precision=low) greatly improves our ability to segregate out more clinically relevant disease hierarchies/subgraphs.

Note: "post-processing" in this case means creating a new graph with all nodes and edges for low precision terms removed.

@dhimmel
Copy link
Member

dhimmel commented Sep 22, 2023

A list of terms by precision level

Nice very helpful. What do you think about recreating the classification example table but for each row including paired low, medium, high triplets. Such that the medium term would be a descendant of the low term, and the high precision term would be a descendant of the medium. This would exclude some terms from being selected, for example by excluding medium terms with no low ancestors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants