Skip to content

FairData

Philippe Rocca-Serra edited this page Apr 7, 2020 · 30 revisions

Fair Data

As an initial remark about data sharing: SARS-CoV-2 genomes are sequenced by a variety of different institutions, who submit their results to GISAID.org. From there, these data are only accessible after making a user account and then clicking through the UI to get the record you want. Simply fetching all genomes (it's only a few hundred, and they're 30k bases each so it's not a huge set) is currently not possible at all, let alone via an API.

Strategy

FAIRification Strategy Stawman for Monday 0900 CEST meeting

Communication

We have set-up a Web location where you can deposit data that should be "fairified". please submit your data to that folder as follows:

  1. Create a ZIP file containing:
  • your data
  • a metadata file explaining license and citation at a minimum (more is better!)
  • a data dictionary to explain e.g. what the column headers mean
  • contact info for you so that we can ask questions
  1. upload put it wherever you wish on the Web. If you want to put it into our repository (if it isn't HUGE!), you can do so as follows:

curl -v -L -X PUT -H "Accept: text/turtle" -H "Content-type: application/zip" -u hackathon:b**h******n --data-binary @sampledata.zip http://ldp.cbgp.upm.es:8890/DAV/coronavirus/To_Be_FAIRified/sampledata.zip

  1. indicate in the "To Be Fairified" section at the bottom of this page:
  • your Name
  • your slack ID
  • The URL to your zip file
  1. we will add it to the FAIR transformation queue!

Participants

  • Mark Wilkinson (coordinator)
  • Michel Dumontier
  • Stian Soiland-Reyes
  • Philippe Rocca-Serra
  • Evangelos Pafilis
  • Susanna-Assunta Sansone
  • Lynn Schriml

Ideas

Repackage SARS-CoV-2 sequences

FAIRify (add metadata, identifiers, etc) reproducible research

For instance describe/package as an RO-Crate: (MDW: note that I have spoken with the RO Crate team, and they think the use of LDP as the container system for Crates would be a good idea. that's what I plan to do...)

Details on the Linked Data Platform server

NOTE: I am looking for someone who knows how to configure Virtuoso for HTTPS... thanks!

I have created a Linked Data Platform endpoint on my institutional server in Madrid for us to use for back-end storage. It uses Virtuoso's LDP implementation (so we get SPARQL over the Linked Data submitted to that server):

(see below for the curl command to push data into that Container. Please choose a unique identifier for your crate, make it an LDP container, and then push the crate into it.)

There is NOTHING on that server that is in any way valuable - it is entirely used for FAIR training - so we can make as many mistakes as we need to and I can wipe the DB and start again if necessary. Alternately, you can download the image linked above, and run it on localhost for your tests.

Please be "good citizens" and start by creating a sub-container inside of the /coronavirus/ container where you can store your information. Please remember that LDP Containers have a trailing slash! I believe that the Virtuoso implementation of LDP can ingest both Turtle and JSON-LD for the purposes of SPARQL, but I have only ever tried Turtle so I cannot promise the latter. The SPARQL endpoint is: https://w3id.org/FAIR_Training_LDP/sparql

Typical Interaction:

To create your "home" or "unique crate" Container:
Create a file "container.ttl" that contains a small piece of turtle:

@prefix ldp: <http://www.w3.org/ns/ldp#> . <> a ldp:Container.

To upload this to the server:

curl -v -H "Accept: text/turtle" -H "Content-type: text/turtle" -u hackathon:********** --data-binary @container.ttl -H "Slug: myCrateName" http://ldp.cbgp.upm.es:8890/DAV/coronavirus/ro-crates/

(note that the trailing slash is required for containers! If you miss it, you will get a 301 redirect)

To create an ldp:Resource, the RDF should have the rdf:type ldp:Resource .

For more complex interactions, see the options in the HTTP headers.

Accumulate missing properties

A page where people can deposit any properties/classes that are currently missing from existing ontologies. https://docs.google.com/document/d/1HWp2EvTRCn-lNSoN5RF_XLcbbT9j8IrGkMQXcgjdbTI/edit#heading=h.rbnwes4ofzsi (shared with the Ontology team)

Workflow Hub - registering COVID-19 workflows as FAIR

Working with ELIXIR effort, this project proposes to set up an early pre-production instance of the EOSC-Life Workflow Hub, covid19.workflowhub.eu, to be a registry that gather the COVID-19 workflows and their metadata. Part of the tasks here is also to curate the existing workflows and help making them interoperable, reusable and reproducible.

The curated metadata will be in a FAIR format based on RO-Crate and BioSchemas annotations and where possible contributed back to the workflow's origin GitHub repositories.

For details, tasks and participants, see sub-topic Workflow Hub.


Currently Being FAIRified

Mark Wilkinson @Mark Wilkinson Ministerio de Sanidad sobre coronavirus España (link to up-to-date data in bottom left)

Mark Wilkinson @Mark Wilkinson SARS-CoV-2 sequences GenBank

Philippe Rocca-Serra SARS-CoV-2 exposed CACO-2 cell - protein profiling - proteome analysis

  • available from dedicated GitHub repository
  • original data available from PRIDE with accession number PXD107710
  • reannotated dataset available from Zenodo DOI
    • metadata available as a ISA format (ISA-Tab and ISA-JSON)
    • raw data available as mzML format, converted from raw MS files
    • derived data available as R ready csv file, long table layout, ready for consumption by ggplot2 R library.
    • bundled as a bdbag archive.
    • release via Zenodo

TO BE FAIRified