Update to: Core Types to Support the Discovery of Life Sciences Resources #2711

RichardWallis · 2020-09-17T10:36:08Z

This is a replacement for PR #2699 which required some work to get it to pass CI tests.
See that PR for details

Not as an extension yet...

And fix hasSequence domains and ranges

and missing property sources

so the unit can be included as well

so the unit can be specified

adjusting domains and ranges when needed. And updating temperature range from Text to Quantity

and add inverseOf statements and links

Commented out classes that fail to work

- Tweaks to wording of terms - Fixes to make valid HTML - Omitting keywords property

These files are all available on BioChemEntity branch

…as-release-1

danbri · 2020-09-17T20:15:33Z

See also @Tpt 's Yago variant, https://github.com/yago-naga/yago4/blob/master/src/data/bioschemas.ttl

AlasdairGray · 2020-10-20T10:35:54Z

@danbri any update on the progress of the inclusion of these types?

github-actions · 2020-12-20T02:07:26Z

This pull request is being tagged as Stale due to inactivity.

stain · 2021-02-22T16:28:15Z

Inactivity seems to be from schema.org side, any idea why?

danbri · 2021-04-01T14:01:52Z

Short version: My sense is that we should get this into Pending, with a view towards them becoming part of core schema.org as evidence of data-consuming applications is collected. Based on the experience of the last few years, we should also expand our notion of "data-consuming applications" to cover developer and datascientist -facing applications, such as public open data knowledge graphs. I believe the bioschemas schemas have great potential, but we have work to do yet to determine quite what level of detail is going to prove appropriate for this kind of vocabulary.

Next steps: I've asked @RichardWallis to take a look at some minor fixes to the PR, to mark these terms as part of the Pending area of schema.org, and remove any conflicts (e.g. SchemaExamples/schemaexamples.py needs removing).

Status and Context and expectation setting

When the Bioschemas activity was first suggested we (Schema.org leads) were initially wary of bringing Schema.org into an area where there were a great number of existing scientific and research data ontologies, unless there was a serious prospect of the schemas being used in substantive user-benefitting applications that could guide our decision making. For general consumer topics (reviews, ratings, photos, etc.) Schema.org as a unifying vocabulary made clear sense and was guided by user-facing applications. As we touched on deeper scientific topics where many levels of detail are potentially applicable, the territory felt different.

I spoke about this at the Elixir 2016 All Hands, and in particular emphasized that it could be counterproductive to add this kind of vocabulary with the expectation of it primarily being used in general web search engine product features. We didn't want life-science site publishers to be disappointed if they added the markup to their sites and did not subsequently feel they were benefitting from having done so (e.g. in the Google case, by the markup being used by one of the features in Google Search's list of structured data features). And I didn't want to run into people at conferences a few years later and be told "we added all this markup to our site and it hasn't done us any good at at all!".

Although these considerations apply to all schema.org additions, Bioschemas was an effort to move Schema.org towards covering scientific concepts and data structures in more detail than we had approached before. Schema.org has always focussed on schemas that are used, in the sense of consumed/interpreted by products, in user-facing features and applications. Without this, it is difficult to judge appropriate levels of detail, and it can be difficult for publishers to justify the effort of adding the markup.

The expectation originally was that the bioschemas project would work equally on the data publishing, and the data-consumption side of making these schemas part of a healthy ecosystem. I think what we've seen is a lot more success on the former side than on the latter (and that is no fault of any individual or group who has been part of the bioschemas effort).

Pending

By bringing these terms into schema.org's Pending area, schema.org (per our standard documentation) sets the following expectations:

The Pending Section is a staging area for work-in-progress terms which have yet to be accepted into the core vocabulary. Pending terms are subject to change and should be used with caution.
Implementors and publishers are cautioned that terms in the pending extension may lack consensus and that terminology and definitions could still change significantly after community and steering group review. Consumers of schema.org data who encourage use of such terms are strongly encouraged to update implementations and documentation to track any evolving changes, and to share early implementation feedback with the wider community.

This is loosely analogous to language W3C uses for Working Drafts, and I highlight it here because it is important to acknowledge that the bioschemas vocabulary has been the product of a significant and expert-informed process over the last few years, and in particular it has been created, amended and developed in collaboration with many authoritative publishers of bioinformatics / lifesciences data.

It may be that the vocabulary in its schema.org incarnation will evolve further, but readers arriving here without knowledge of its origins should know that there have been substantial and long-running, expert-led collaborations leading to these designs.

Our challenge now will be to address any technical and usability integration issues between these schemas and the rest of Schema.org, and to move the focus towards data-consuming applications, so that we can understand whether the level of detail, definitions, properties proposed here are sufficient to meet the needs of user-facing applications.

The Bioschemas project provides some supporting tooling, and there are other opensource tools (e.g. Gleaner.io, Schemarama that may be helpful to those developing applications.

Schema.org for Knowledge Graph Exchange

As we look to support the use of schema.org data in new and interesting areas, we should also take care to be open-minded about what counts as "using" Schema.org in a data-consuming application.

For example, at Google we made some investigations into whether Schema.org extended with Bioschemas is sufficiently expressive to capture a useful "knowledge graph for lifesciences" subset extracted from Wikidata.org. Would such a database be a user-facing use of the data, or a workflow / infrastructural step towards an environment where user-facing applications could eventually be created? It is a little of both. While we can declare developers to be a kind of user we care about, these kinds of generic application do not always provide guidance that can help scope and shape schema design.

Such "knowledge graph exchange" scenario for using Schema.org-based data are part of a larger trend. For example:

Yago, which converts Wikidata to use Schema.org vocabulary.
Ozymandias, "a biodiversity knowledge graph of Australian taxa and taxonomic publications".
Springer Nature's SciGraph, "collates information from across the research landscape, i.e. the things, documents, people, places and relations of importance to the science and scholarly domain."
DataCommons.org, "Datacommons.org is an open knowledge repository hosted by Google that provides a unified view across multiple public datasets, combining economic, scientific and other open datasets into an integrated data graph." (wikipedia, github).

I believe we should as a project explicitly declare these kinds of open data sharing, "knowledge graph exchange" initiatives as being amongst the kinds of data-consuming application that justify additions and changes to Schema.org. They are very much in the spirit of the project, but some thought is needed on how to operationalize this.

This doesn't mean that just spinning up an RDF database with some test data in would be sufficient; rather that we would be acknowledging data scientists, developers and others who work with data as being important user constituencies. Just as schema.org serves non-technical search engine end-users who are looking for jobs, recipes, reviews, events, datasets or fact checks on the various search engines, it can also support developers and data scientists who work with aggregations of schema.org data. As the DataCommons.org site says,

We cleaned and processed the data so you don't have to. Data about particular entities are aggregated from different sources for a unified view.

This kind of service (provided also by Wikidata et al.) can add huge value and help others meet the needs of their users.

The clarification to be made here is that our exit criteria for moving terms out of "Pending" status into the Schema.org core vocabulary should consider public, opendata knowledge graph use (SPARQL/RDF, Property Graphs, etc.) as important evidence towards demonstrating the usefulness of schema.org schema designs.

To @stain's point, it is true that we have been a little blocked at Schema.org in terms of knowing how to handle the Bioschemas proposals, since they do make significant amounts of great data accessible via schema.org markup, even if the data-consuming applications we collectively anticipated back in 2016 have yet to emerge.

Schema.org in the past has suffered from "build it and they'll come" optimism, and contains a number of schema designs which lack substantive data-consuming implementations. This is why we introduced the notion of "pending", so that there is an opportunity to surface potentially valuable schema designs, while also flagging up that we believe there may be possible tweaks ahead as data-consuming implementations surface.

If we clarify "user-facing, data-consuming application" to include open data-sharing "knowledge graph" systems like Wikidata, Yago, SN SciGraph, Ozymandius, Data Commons, I believe this opens up a roadmap for bringing Bioschemas (and similar proposals) into Schema.org, without setting unrealistic expectations about the schema details being used. In particular it gives us a new focal point for articulating questions about the user needs being met by schema designs; we can ask about the kinds of queries supported by the combination of these schemas with opendata that uses the schemas.

Framed in this way I'm a lot more comfortable bringing these schemas into Pending, as it gives a plausible path for progressing things further. @AlasdairGray et al., does that work for you?

ljgarcia and others added 30 commits November 16, 2018 11:07

Create BiologicalEntity class.

f40e6a8

Not as an extension yet...

Started drafting BioChemEntity type

e56fdf5

Merge branch 'master' into BioChemEntity

a827c82

Add BioChemEntity properties

79fbaec

Add ChemicalSubstance

47165f4

Add DNA

653b61e

Add gene

d59b1ea

Fix typos

3c0ea6b

Add enzyme type

342bc02

Add molecular entity

f2c4a10

Add dot at the end of comments

0221a15

And fix hasSequence domains and ranges

Add protein type and properties

8a09726

Add RNA

0cad8c2

Add sequence annotation

35c7ea4

and missing property sources

Add sequence range

2fa4128

Add sequence range

deae246

change number to toext for temperature

e7c1cbc

so the unit can be included as well

Add temperature comment

ac1a68a

Change molecular weigth to text

e570ab5

so the unit can be specified

Specializes some Number ranges

3e693f3

Add phenotype

3c5718d

Add range to expressedIn

7c3e1d1

Add protein structure

91f9349

adjusting domains and ranges when needed. And updating temperature range from Text to Quantity

Add lab protocol

40c8bd3

Add interpro entries (sequence matching model)

893a642

and add inverseOf statements and links

Enabled bio extension

0c2039b

Commented out classes that fail to work

Removed old definition of BiologicalEntity

b5ff137

Fixed issues with Enzyme

3e01d3e

- Tweaks to wording of terms - Fixes to make valid HTML - Omitting keywords property

Fixes in HTML syntax to make MolecularEntity display

f44d179

Fixed html syntax

1123735

Alasdair Gray and others added 19 commits April 15, 2020 09:39

Clarified smiles representations

b517d99

Merge branch 'main' into Bioschemas-release-1

f0373fc

Converted BioChemEntity to turtle representation

9d37deb

Converted to turtle

9cd7034

Converted to turtle

b81541a

Converted to turtle

4efb623

Converted to turtle

74c7be4

Converted to turtle

44b2dc7

Converted to turtle

a3d97f6

Moved coding notes to Bioschemas/specifications repo

5ea9233

Updated links to coding notes

0e0cc5f

Updated description of deployments

afdee9d

Merge branch 'main' into Bioschemas-release-1

03f74af

Merge branch 'main' into Bioschemas-release-1

ac85537

Removed files for types not being submitted

39de40b

These files are all available on BioChemEntity branch

Removed BioSample from first submission

32f9adb

Merge commit '314c190ae37e97da4d0d3c695a849f272ce589bc' into Bioschem…

7668fb0

…as-release-1

Merge branch 'main' into Bioschemas-release-1

fd01e14

Tweaks to pass tests and remove not needed old files

fd4a8a4

RichardWallis mentioned this pull request Sep 17, 2020

Core Types to Support the Discovery of Life Sciences Resources #2699

Closed

github-actions bot added the no-pr-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Dec 20, 2020

This was referenced Apr 1, 2021

Bring proposed Bioschemas terms into Pending area #2862

Open

Introduce BioSchemas terms into pending #2863

Merged

MatthiasWiesmann closed this Sep 16, 2024

MatthiasWiesmann deleted the bioPR branch September 16, 2024 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to: Core Types to Support the Discovery of Life Sciences Resources #2711

Update to: Core Types to Support the Discovery of Life Sciences Resources #2711

RichardWallis commented Sep 17, 2020

danbri commented Sep 17, 2020

AlasdairGray commented Oct 20, 2020

github-actions bot commented Dec 20, 2020

stain commented Feb 22, 2021

danbri commented Apr 1, 2021

Update to: Core Types to Support the Discovery of Life Sciences Resources #2711

Update to: Core Types to Support the Discovery of Life Sciences Resources #2711

Conversation

RichardWallis commented Sep 17, 2020

danbri commented Sep 17, 2020

AlasdairGray commented Oct 20, 2020

github-actions bot commented Dec 20, 2020

stain commented Feb 22, 2021

danbri commented Apr 1, 2021

Status and Context and expectation setting

Pending

Schema.org for Knowledge Graph Exchange