Build an EFO term "precision" classification pipeline #2

eric-czech · 2023-07-13T16:27:44Z

There are several EFO term classifications that could be useful. I propose we start with trying to assign a certain precision to terms based on the following definitions:

high: High precision terms have the greatest ontological specificity, sometimes (but not necessarily) correspond to small groups of relatively homogeneous patients, often have greater diagnostic certainty and typically represent the forefront of clinical practice, i.e. they're closest to precision/personalized medicine. Examples:
medium: Medium precision terms are the ontological ancestors of high precision terms (if any are known), often include indications in later stage clinical trials and generally reflect groups of patients assumed to be suffering from a condition with a shared, or at least similar, physiological or environmental origin. Examples:
low: Low precision terms are the ontological ancestors of both medium and high precision terms, group collections of diseases with some, but often not many, shared characteristics, maybe be named in early stage clinical trials and typically connote a relatively heterogenous patient population. They are also often terms used within the ontology for organizational purposes or completeness. Examples:

note: More examples like this are given in #3.

I like this description of the task and these names/definitions more than the disease "subtype", "root" and "area" idea we had used before internally, and it better captures what I was initially trying to accomplish with that work anyhow. I'm certainly open to discussing it more though.

We can use some of the labels we already have to bootstrap this effort and I would say the next steps are:

Include an LLM-derived classification of this label as a feature to be used in conjunction with ontology features to make a final classification
Explore options for modeling in this task beyond applying a GBM

I'll add some more details on those steps in related issues.

TODO:

Add an issue with thoughts on other modeling approaches

The text was updated successfully, but these errors were encountered:

dhimmel · 2023-07-25T14:53:37Z

I also prefer the proposed disease precision scale of low/medium/high to the former scale of area/root/subtype, since it avoids any confusion with the term "root". The proposed definitions and helpful, and as noted some examples would enrich the definitions, as would review of some terms that do not fit cleanly within a single category.

Since our use case at Related Sciences is ultimately for drug development, these definitions are anchored to clinical characteristics and specificity. I imagine there might be some more non-clinical characteristics of each precision level that could enrich and solidify the definitions to support broader applications.

Looping in @DnlRKorn @matentzn @zoependlington @nicolevasilevsky @d0choa in case you have feedback on whether classifying diseases based on precision would be useful and whether the low/medium/high scale and definitions make sense. This classification is something we're initially planning to perform on EFO but could extend to MONDO as well.

matentzn · 2023-07-27T08:11:28Z

@dhimmel I think from our perspective, the most significant distinction is between "true diagnosable disease" and "disease grouping"! This is sort of related to the "precision" mentioned here, but maybe needs other kinds of evidence, like "mentions in PubMed" etc.

dhimmel · 2023-07-27T16:21:09Z

the most significant distinction is between "true diagnosable disease" and "disease grouping"

Thanks @matentzn for weighing in. I think a true diagnosable disease might be the union of the medium and high precision buckets, while disease grouping would be low. We also could create a 2-class outcome in addition that could be predicted from the feature set we create, which should include features like publication mentions and other things.

Linking a related issue at monarch-initiative/mondo#685.

Also I notice EFO:0000574 / lymphoma has two subsets according to the EBI OLS browser: disease_grouping, ordo_group_of_disorders. @zoependlington or others: where are these subsets defined, how are they assigned, and is their more documentation on them?

matentzn · 2023-07-27T16:24:59Z

Most of these come from Mondo, and are the consequence in metamodelling of ontologies aligned with mondo. For example, there is a group of OrDO classes explicitly defined as groupings in ORDO, which make up that subset. For the more general disease_grouping subset, I think this was a fairly incomplete attempt to manually curate disease groupings. @nicolevasilevsky would know best!

refs related-sciences/nxontology-ml#2 (comment)

eric-czech · 2023-09-20T12:13:30Z

Linking #5, which added the labeled data from EFO v3.43.0

eric-czech · 2023-09-20T20:21:28Z

Related to #13, I wanted to add two visuals that clearly communicate something about why we did this originally:

A list of terms by precision level

Code: random_efo_term_samples.ipynb

What happens when you use this to remove the most general terms in EFO

As a graph, EFO is very difficult to work with. Removing the most general terms (i.e. precision=low) greatly improves our ability to segregate out more clinically relevant disease hierarchies/subgraphs.

Note: "post-processing" in this case means creating a new graph with all nodes and edges for low precision terms removed.

dhimmel · 2023-09-22T00:05:21Z

A list of terms by precision level

Nice very helpful. What do you think about recreating the classification example table but for each row including paired low, medium, high triplets. Such that the medium term would be a descendant of the low term, and the high precision term would be a descendant of the medium. This would exclude some terms from being selected, for example by excluding medium terms with no low ancestors.

eric-czech mentioned this issue Jul 13, 2023

Experiment with LLM-derived EFO term classifications #3

Closed

eric-czech mentioned this issue Aug 8, 2023

Compare few-shot GPT4 features to embedding features for EFO term precision classification #8

Closed

dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Sep 6, 2023

EFO: extract subsets

650c09f

refs related-sciences/nxontology-ml#2 (comment)

dhimmel mentioned this issue Sep 6, 2023

Prepare the MONDO outreach presentation materials #13

Closed

eric-czech mentioned this issue Sep 21, 2023

Add EFO TA features to the term precision classification model #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build an EFO term "precision" classification pipeline #2

Build an EFO term "precision" classification pipeline #2

eric-czech commented Jul 13, 2023 •

edited

dhimmel commented Jul 25, 2023

matentzn commented Jul 27, 2023

dhimmel commented Jul 27, 2023

matentzn commented Jul 27, 2023

eric-czech commented Sep 20, 2023

eric-czech commented Sep 20, 2023

dhimmel commented Sep 22, 2023

Build an EFO term "precision" classification pipeline #2

Build an EFO term "precision" classification pipeline #2

Comments

eric-czech commented Jul 13, 2023 • edited

dhimmel commented Jul 25, 2023

matentzn commented Jul 27, 2023

dhimmel commented Jul 27, 2023

matentzn commented Jul 27, 2023

eric-czech commented Sep 20, 2023

eric-czech commented Sep 20, 2023

dhimmel commented Sep 22, 2023

eric-czech commented Jul 13, 2023 •

edited