Introducing Bioschemas: promoting schema.org in the life sciences #1028

Open
martin-nc opened this Issue Mar 16, 2016 · 7 comments

Projects

None yet

5 participants

@martin-nc

The life science research community comprises a large number of diverse organisations consuming and/or producing data on the web. The community is very active in adopting standards and common APIs for specific types of data, but there isn’t a standard lightweight format that these organisations use to publish all their information, and many don't have the resources or expertise to create APIs for others to access their data.

Bioschemas is a project to promote the use of Schema.org markup in life sciences, as a way to address this. We are hoping to encourage life science organisations to adopt Schema.org markup, since it doesn’t require programming skills, it is widely adopted and well documented, and it makes sense anyway for SEO. We could then scrape web pages to access what will then be consistently formatted information.

Organisations involved in Bioschemas include ELIXIR, Pistoia Alliance, GOBLET, TeSS, BioSharing and BBMRI. (I work for ELIXIR, an organisation that is funded by European governments to build a sustainable infrastructure for life science information. It is one of the founders of Bioschemas.)

Bioschemas aims to create specifications for each type needed in the life sciences. Each specification will contain:

  1. The Schema.org properties to be used to describe each type (Event, Organization etc). This may contain new properties to be proposed to the Schema.org community.
  2. Recommendations on how to use the Schema.org properties. We include features such as controlled vocabularies, cardinality (is one value expected or many?), and minimum fields. Notice that these features are not supported by Schema.org, so will not be discussed here. It is an extra layer of detail that we are asking life scientists to add. We are not asking the Schema.org community to support the concepts.

Here is an example: the specification for
events
.

Our general approach is to:

  • Be inductive and empirical. Discover what information people are already producing in life science and match these to existing types and properties in Schema.org.
  • Use the existing Schema.org properties and types as much as possible.
  • Where that’s not possible, then keep the required new property or type as abstract as possible (i.e. not specific to life science) and propose a new property or type to the Schema.org community. For example, if there was a demand for a property ‘PhD supervisor’ on Person, then we could abstract that to ‘mentor’ or ‘teacher’ so that it is applicable and useful beyond life science.
  • Be open throughout the process. Any new type or property proposal will obviously be discussed here, but anyone who is interested in applying Schema.org to life science is also welcome to join us on Github, comment on the specifications and join our mailing lists.
  • Use what we create. See the use case below.

Example use case: A small marine metagenomics research group publishes its events on its website. These get limited publicity because the website isn’t well used. They don’t have the time or expertise to create an API and haven’t got an iCal feed.

Then they code their events with Schema.org markup through a plugin for a popular open source CMS (Wordpress, Joomla, Drupal) or through an online Schema.org markup generator. We write a script to scrape their site and add their events to a database of other life science events (an events portal).

We write a Javascript widget that can query this database, and can be easily embedded into the institute’s website. The institute copy and paste the widget code into their site and set a simple configuration option to show only marine metagenomics events. Now their own events are publicised across the life science community via the events portal, and they get to embed other events of interest to their specialist audience on their own site.

Sorry for the long post, but we’ll be posting to this community in the next few weeks, so I thought I’d give some background! Anyway do let us know if you have any thoughts on the project.

@jaygray0919

I note in your email the plan to use BioJS. I took a quick peak and see some overlap with d3.js. Many of us have significant investment in d3.js education and libraries. You may wish to consider complementing d3.js rather than taking a a potentially orthogonal approach.

@rajido
rajido commented Mar 16, 2016

BioJS is technology agnostic so you can use any JavaScript library you like like d3. If you see some of the BioJS components some of them are built using d3. BioJS is meant to facilitate discovery, build community and introduce best practices for javascript visualisation in life sciences. So definitely we are looking for a complementing approach.

@martin-nc

@jaygray0919 Sorry for the late reply. The goal of Bioschemas is to encourage the use of schema.org markup in life sciences, at least for some types of content (e.g. events, organisations, people). Other than that it is entirely technology agnostic. It's not in its scope to nudge people towards using a particular technology for harvesting, analysing or visualising the marked up information. The javascript example I gave was simply for illustration, to show how the information could be re-purposed if it was coded consistently.

Sorry if it looked like I was pushing any technology here. We absolutely don't want to re-invent the wheel and ignore the work other people have done!

@joncison

So I think it's natural we have a new group for "Tool": it's always been in ELIXIR / bio.tools plan to extend schema.org in a bio.tools-compatible way. How do I make a start with this? @rajido - what practical steps should I take?

@martin-nc

@joncison (If you don’t mind me jumping in here) I’ll email you about your question, but in general if anyone is interested in having a new class/type in Bioschemas then they can just email all@bioschemas.org, or open a new Github issue.

This is the process we've been using so far (adopted to apply to tools):

  1. Look at how other people describe tools. This involves visiting sites that host or list tools and note the fields/properties used by each site to describe the tools. For example, here is a spreadsheet we used for the Events specification.
  2. Look for common properties used across websites and come up with the most important properties shared by most sites. You can also think if there are other properties missing, and consult with the Tools community to see if these missing properties would be useful.
  3. Look at schema.org and see if there is a type that fits what you want. You might want to browse the list of ‘types’ on schema.org to see which might be possibilities. I'd imagine it'd come under CreativeWork. There is WebApplication, for example.
  4. See if you can match the properties of this type to the properties you need. We can always ask the schema.org community to adopt new properties if they don’t already exist. We’ve got more chance of schema.org adoption if we keep to what’s there already, and any new suggestions are kept as generic as possible, so they’re useful across domains.
  5. Compile a specification like the Events one and open it to comments to people who you think might be interested, and to the schema.org community.
  6. Start using the specification in the real world!

You are welcome to lead this process for tools or delegate to whoever you see fit, and I can help too. We hope to have instruction on our website soon.

@joncison

Thanks a lot! Really helpful. The good news is that 1. and 2. are done and resulted in https://github.com/bio-tools/biotoolsxsd, which leaves 3 - 5. As for 6, we'll need a new registration mechanism in bio.tools to cope.

You could help a lot by prodding me, in case this thread goes cold.

@danbri
Contributor
danbri commented Jul 15, 2016

On the datasets side of things, you might care to look at these proposed changes I have just merged into our draft next release:

#1247 -> http://webschemas.org/docs/releases.html#g1083

These largely come from considerations around improving the usability of our Dataset and its integration with the rest of schema.org, in particular aiming for adoption by scientific dataset publishers including lifesciences. Feedback welcomed here or in #1083.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment