New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scope of Schema.org for research data #2059

Open
vsoch opened this Issue Sep 12, 2018 · 57 comments

Comments

Projects
None yet
@vsoch

vsoch commented Sep 12, 2018

hey schema.org team!

We are putting together an organizational and data movement strategy for research computing and the library at Stanford, and I wanted to ask how the schemas fit into the domain of research data. I will describe the ideas we are discussing first to give you some context.

  • a researcher will start a new study and create a definition of data. Let's say some set of images and annotations. We will want them to be matched to a particular data format (e.g., DICOM images) that has a set of metadata (e.g., some subset of header fields, or Radlex terms).
  • Ideally, this image format will have a particular organization and metadata (something schema.org can represent?) and this will drive tools / software to move it around, and perhaps first put it where it will be used by the researcher (Google Cloud).
  • on Google Cloud,you can imagine it will be in Object Storage, with object level metadata and if needed, something like BigQuery to handle scaled queries.
  • Then the data will be moved to library archive, more of a filesystem setup, where it will be accessible by URL.

As it moves around, the organizational schema will help to guide interaction. It will help with validation and query and integration with tools built around it. For many of these organizations, they will come from the research domains themselves. For example, the brain imaging data structure BIDS is already being widely used across the neuroimaging community and software.

For the definition of the organization and data, I'm wondering how schema.org can fit in. I saw that Natasha (previously at Stanford!) at Google for Google Datasets (see this article) mentioned schema.org, and it definitely seems relevant for web page content and making it searchable. In that we want our strategy to be easy to sync with what the larger community is doing, I wanted to ask about research data? How can we work together and leverage the resources here so that our datasets can eventually integrate too into tools provided by schema.org, Google Datasets, and be useful for searching for our researchers after archive? How can we contribute templates and other tooling here to help toward this? Thanks for your help!

@rvguha

This comment has been minimized.

Contributor

rvguha commented Sep 12, 2018

@thadguidry

This comment has been minimized.

thadguidry commented Sep 12, 2018

damn it @rvguha can't we at least TRY to keep the conversation public , until it doesn't have to be ???
If you cannot, please at least summarize the public bits of the conversation you have, back into this issue for the benefit of all. Thanks man ! :)

@akuckartz

This comment has been minimized.

akuckartz commented Sep 13, 2018

@vsoch The "larger community" does not only consist of Google.

@vsoch

This comment has been minimized.

vsoch commented Sep 13, 2018

That’s not what I meant :*(

@vsoch

This comment has been minimized.

vsoch commented Sep 13, 2018

@akuckartz would you care to support that statement with a description of what the larger community is doing, per your thoughts?

@rvguha

This comment has been minimized.

Contributor

rvguha commented Sep 13, 2018

@vsoch

This comment has been minimized.

vsoch commented Sep 13, 2018

yeah! @thadguidry and @akuckartz I hope we can continue our conversation here, can I tell you how excited I am to be starting work on this project? I understand your concern, and let's keep discussion going here! One small note - please take note I'm about to be hit by a hurricane (east coast) so if there is a bit of delay in my response, I'm probably just away from power or internet. 🚢 ⛵️

@rvguha

This comment has been minimized.

Contributor

rvguha commented Sep 13, 2018

@danbri

This comment has been minimized.

Contributor

danbri commented Sep 13, 2018

@thadguidry - I appreciate your enthusiasm to collaborate but there's nothing wrong with @rvguha (or anyone else) expressing an interest in directly meeting up with other members of this community, especially given that he's in the Stanford area etc etc.

Saying "damn it" in Github comes across way more snarkily than it might in real life amongst people who know each other better. While you might mean it with a smile it's not a great example to set.

From https://schema.org/docs/howwework.html

Participants in community group and Github discussions are expected to respect the W3C code of ethics and professional conduct, as well as each other.

While I don't take use of "damn it" as breaking those rules but it needlessly nudges things towards being a more hostile and critical environment. Anyway, FWIW I'd also be happy to see more discussion of the original ideas here, but since the original post also had several Google mentions, maybe those Google-specific aspects are better explored elsewhere.

(As an aside -- I've been working with Guha on this RDF stuff since before Google even existed as a company, I'd hope our commitment to the bigger picture might be clear by now...)

@akuckartz

This comment has been minimized.

akuckartz commented Sep 13, 2018

@vsoch In addition to schema.org there exist several other parallel activities with overlapping (but certainly not identical) requirements and stakeholders. One of them is the W3C Dataset Exchange Working Group (DXWG) which is creating a revised Version of DCAT. See https://w3c.github.io/dxwg/

This is not an "either or" but I suggest that you also look at what DXWG is doing. In Europe DCAT is used frequently by public administrations and in Germany there is a new legal requirement for all branches of the public administration to describe Open Data using a DCAT profile. I suppose that will also have some influence in research communities.

And yes, stay safe!

@danbri

This comment has been minimized.

Contributor

danbri commented Sep 13, 2018

@akuckartz yes, we have the same "backbone structure" in Schema.org as DCAT is based on. This is thanks to Jim Hendler and his group's advocacy for us to adopt that design some years ago

The basic structure is

(my DataCatalog) ---dataset(inverse=includedInDataCatalog)  ---> (my Dataset) ----distribution--> (my DataDownload)

... which shared strengths and weaknesses with DCAT, e.g. there is scope for documenting patterns for time series-based collections, etc. The similarity means that if you have basic DCAT it is pretty easy to generate Schema.org Dataset markup, and vice-versa.

I am a member of W3C DXWG WG representing Google, and liaising to Schema.org. There are some notes from the last f2f meeting that I made, towards using JSON-LD's @ context feature to integrate DCAT / Linked Data approaches and schema.org here: https://docs.google.com/document/d/16c_STDu8Dzj-ioRNuGS2tlIFJamlx0-vRKBaPA5Wzfc/edit

Beyond this high level DCAT / Schema.org/Dataset approach (which is barely a change from classical 1990s Dublin Core), there are lots of other aspects to dataset description opening up, and lots of questions about how different standards plug together in practice, even just looking at W3C stuff like CSVW and Data Cube. I've recently been spending a lot of time around Fact Checking initiatives, in the context of misinformation and Schema.org's Claim and ClaimReview markup. In that context there are some DXWG discussions on representing caveats and footnotes from statistical data at https://lists.w3.org/Archives/Public/public-dxwg-wg/2017Jul/0041.html which may be of interest here. There are also efforts like Bioschemas who are starting to crawl this data and elaborate a few schema.org additions that help bridge the cross-domain dataset descriptions with domain-specific identifiers and ontologies.

@DmPo

This comment has been minimized.

DmPo commented Sep 13, 2018

@thadguidry

This comment has been minimized.

thadguidry commented Sep 13, 2018

@danbri Noted about use of "damn it". Sorry, bad day at Ericsson. My apologies. Thanks for noting you also would like to see as much discussion as can be had in this issue as well.

@vsoch

This comment has been minimized.

vsoch commented Sep 13, 2018

hey everyone this information is really fantastic! What I'm doing is starting my exploration from the point of version control - any schema that we use and then tooling to move the data it describes is going to start with Github (and I'm realizing, Git Annex). So my plan is the following:

which may be of interest here. There are also efforts like Bioschemas who are starting to crawl this data and elaborate a few schema.org additions that help bridge the cross-domain dataset descriptions with domain-specific identifiers and ontologies.

Having templates and entrypoints for say, a biologist to easily plug into the right schema, and then use it to move data from a local place to Google Cloud (I don't mean to preference a vendor but the Stanford hospitals are heavily invested and this will be a first use case!) and then to the library archive when "live work" is done is my desired goal.

I'm also really glad to hear that the the various technologies are related - it makes life much easier, but also says good things about the people and communities involved.

Early Goals

So here is what I'm setting out to do first - and of course this will change as I learn more! I think in my first "dummy test" I am going to try and see how far I can get doing the following:

Use Case Driving Goal

  1. Start with dataset locally, some file organization and metadata
  2. Choose an organization from a nice repository / web interface
  3. Use provided validators to put data into chosen file organization, and add files
  4. Use datalad (git-annex) to move files to a Google Cloud Storage. Do magic.
  5. Use datalad again to move to some (fake) archive, basically another server

Additional Tooling will Mean

  • Search of library datasets (likely driven by datalad)
  • Tools to easily generate and share the metadata templates
  • Validators of file organizations / metadata for some dataset

I will link this issue in my notes so I remember to post updates along the way! I really like Github issues for this kind of discussion, and to be honest get a little lost in Google Doc comment chains. If it's okay with you, we might keep this issue open to have further discussion in the coming weeks.

@danbri

This comment has been minimized.

Contributor

danbri commented Sep 18, 2018

@DmPo interesting- do you have a pointer to the details? Is the newish Claim type of any use for your approach?

@DmPo

This comment has been minimized.

DmPo commented Sep 18, 2018

Hi @danbri, I will have a pointer later this week - launching a new site right now. Yes, sameAs is a good idea - it can be used for linking reposts of a fake to the "original" fake. This makes easier debunking of the reposts of the same fake.

@vsoch

This comment has been minimized.

vsoch commented Sep 19, 2018

hey everyone! A quick update after some weekend work. I will start with the use case and walk through the steps I've taken so far (and where noted, where I have a question or two). Just as a note I sent this out to a few of you via old school email too :)

  1. I started with a dataset. it describes Container recipes. This is my dummy case - I want to use them on Google Cloud and then archive them somewhere.
  2. I realize that there is no description in schema.org to describe a Container. So I creating tooling that:
    a. starts with Google drive files, asks the user to write the specification and export to .tsv
    b. fork a template repository and just add their files to a folder! Connecting to continuous integration then is ready to go to generate the specification using a docker container. The version controlled finished bundles are sent back to Github pages (and they could be rendered pretty here, but I haven't made a template yet)
    c. The next step would be that the user does a PR with the generated files to (wherever the schema.org specifications are submit? **I would want some guidance here about "what files do I need to submit, and where/how, for schema.org?" From what I can tell I need to generate some JSON-LD and examples, and provide a web interface, and I can do this (just need a bit of time!)

For all of the above, I'm incapable of moving forward without having a nice web interface to describe the process (beyond the Github READMEs) and I always want a "specifications" repo for a user to be able to submit a specification output to (for one cohesive specifications repo) so I'm going to work on this next. Once I have this web interface and Container specification list (the discussion I hope to start here) I'd like to submit it proper to schema.org (as an extension?) and then go back to describing my Dockerfile dataset! I am thinking of using Datalad to move things around, and notably for the git-annex functionality.

And I'd really like to bring those interested here into working on this tooling under openschemas! Just let me know you are interested and I can add you. I realize that some of the tech / other open science related things don't fit well under bioschemas (where I was originally contributing) so I made this organization, and it falls nicely alongside the openbases that are provided templates for doing reproducible things for open science, generally.

Full circle, cue Lion King music :) That's the plan for now! I don't think I need any help or have question beyond the bolded above. Whatever information you relay to me I can also write clearly into the web interface I'm making, so it will be good use of time. I'm going to work on that later today - likely taking a quick break to go for a run 🏃

@vsoch

This comment has been minimized.

vsoch commented Sep 20, 2018

Quick update! @ricardoaat has taught me the updated specification format to feed into the web interface (and schema.org) so I'm updating map2model to generate that (see this quick links in this issue), and I've prepared the web interface I was talking about to serve them (a sibling of the beautiful bioschemas.org!). I'll thus:

  • update map2model
  • generate the Container specifications using it
  • then better understand how to contribute / submit the specifications for schema.org, and ultimately use to describe the Dockerfiles dataset.

Note that the repos / site are pretty bare bones, I'll be working heavily on them this week.

@vsoch

This comment has been minimized.

vsoch commented Sep 21, 2018

Quick update:

  • spec-container is now mostly good to be an example of a "template builder" - meaning I export tsv files from a template, dump them in a folder, and then connect to continuous integration to generate the files, a nice web UI to show the drafts, and then the content to submit as a specification (see links / images in README).
  • specifications is then where the user can contribute a <NAME>.html specification recipe when it's ready! (The finished file generated from the builder above). Although it's a separate repository, a la magic of Github pages (and similar templating!) it renders cleanly into the specifications URL

I'm working on testing for the submissions to specifications next (in map2spec) and then I'll finish up the Container* family of new specifications (hopefully this weekend?) and then (finally) try using the definitions to describe the dinosaur Dockerfile dataset (ContainerRecipe)

@vsoch

This comment has been minimized.

vsoch commented Sep 22, 2018

Small distraction - I converted the spec-container fully into a spec-template that is now added as an openbases template. This means adding a new badge for specification builders ("spec") that uses the same red from schema.org (the darker one) as a fun easter egg :) The full template falls within openbases because the user just needs to fork, add some file content, and then build on circle and they get artifacts / ghpages artifacts, and a web interface for their drafts.

Same plan mentioned before for next steps (testing then Container definitions!) @rajido also reached out to me today and we are going to talk about labeling the biocontainers, which will be wicked!

@vsoch

This comment has been minimized.

vsoch commented Sep 25, 2018

<update> I started ContainerRecipe draft. The delay is because I set up the entire validation library in openschemas-python, and then integrated it into the specifications repo, so now the specifications repo is ready to have the files (like ContainerRecipe.html) submit and tested properly with PR (see the second to last tab here). I'll be doing the Image/Distribution and others soon, and then submitting to the specifications repository for more feedback (and testing of course), along with more robust docs for openschemas-python </update>

@vsoch

This comment has been minimized.

vsoch commented Sep 26, 2018

Another update. We now have ContainerImage! I'm hoping for discussion / feedback from OCI, and then to generate some kind of official submission for a schema.org extension. Is there a documentation base for how to do that? Discussion with OCI (I hope) will happen here --> opencontainers/image-spec#751 (comment) and that's also where you can find links, if interested.`

My next steps would be to:

  • get feedback
  • submit to schema.org as extensions
  • tag container recipes (Dockerfiles) in a repository
  • test moving around from local --> Google Storage --> archive (maybe using datalad?),
  • move on to larger databases like Biocontainers and Singularity Hub, of course

And in there somewhere I'll clean up the docs and write some "hey you can do it too!" material to help researchers with weird datatypes that warrant a specification to contribute.

@RichardWallis

This comment has been minimized.

Contributor

RichardWallis commented Sep 26, 2018

Developing / testing / proposing updates, additions, extension, etc. to Schema.org - background reading:

@vsoch

This comment has been minimized.

vsoch commented Oct 1, 2018

okay so I'm trying to follow some of the documentation and from what I can tell:

  • I'm supposed to open an issue to track progress of a suggested addition, an example is this one that serves this proposal for Legislation
  • there is supposed to be some webby place to describe it.

@satra advised me that I shouldn't have new properties so I just removed any new ones (with parent "Container") from the proposed one that I made,but I'm a bit confused about this because the example about definitely has a bunch that are categorized under "Legislation" (and that is the proposal).

I had started one here mimicking Bioschemas, but it seems like the suggested thing is to use the app that is provided at this repo. Should I blow up what I've done and start again? As a newcoming I find this entire process and the discussion really confusing, for what it's worth. There is no clear checklist or set of steps for starting from scratch to making a submission other than long verbose pages across many places and I'm doing my best but struggling quite a bit. Guidance (specific, stepwise things) would be appreciated! Thank you!

@vsoch

This comment has been minimized.

vsoch commented Oct 1, 2018

And just a quick comment, this is a huge issue, from the view of a developer:

We expect collaboration will focus more on schemas and examples than on our supporting software.

If an expert with a schema or domain does not find it easy to contribute, that is a failure state I think. I want to suggest that we can achieve both good software and specifications, and they can live together in harmony.

@rvguha

This comment has been minimized.

Contributor

rvguha commented Oct 2, 2018

@vsoch

This comment has been minimized.

vsoch commented Oct 2, 2018

Thank you for this feedback @rvguha ! I had just cloned this repository, and was going to give a crack at creating a Dockerized local template that could be run to produce a (local generation) of the example site. But it sounds like this isn't priority, at least as long as there is some method to share a specification suggestion? In this light, does the template that I created, perhaps with better description of the classes, suffice for discussion? I was aiming to make it easier for contribution - so far the workflow is:

  • generate your original descriptions and specifications on Google Drive Spreadhsheets
  • download and all to a template repository that builds the page(s) I linked above, for one or more specifications.

The template generates the static files of the specifications in yaml, and I would want this template to also generate (whatever format of file) is needed for a "real / final / etc." contribution to schema.org (json-ld?).

The goal of the steps above is that any domain expert can generate a version controlled, and easy to have discussion over web interface and "final submission" files without knowing anything about software engineering / version control etc. This is a very easy path to contribution, but if it's the case that the templates I've made thus far aren't producing the kind of content that could be discussed, I need to step back and re-work them. If they are in the right direction, then I would suggest that I:

  1. put more time into the human elements (describing the properties / examples) and
  2. then adding the data structure format that would be needed for a final submission to schema.org

And although it's not priority, is there interest in generating a dockerized local version of the app engine deployment as a (possibly better) alternative to preview / discuss a specification contribution? As the Legislature group did, it looks really sharp. I notice that the template here has a nice switcher at the bottom for looking through different data structures of the specifications, and minimally I'd like to add that to the template that I am working on, because I'm guessing one or more of those structures is the final file(s) to be submit.

As for rationale - holy cow! It's so badly needed it almost feels silly to restate, but I can briefly comment now. Containers are sort of hugely important for anything and everything! A container is the currency of reproducible deployments, and with technology like Singularity even of scientific compute (because we can run on shared cluster resources). Without a specification that describes the components of images, recipes, and distrubtions (registries) we are living in a messy universe where the best I can do to find a "tensorflow" container is do a search on Docker Hub and pray that the random container I choose (based on tags or a name) might be what I'm looking for. The way we solve this problem is by way of getting Google's Help, meaning having containers join into Dataset search. In that schema.org specifications are the driver of this, we simply must have these definitions. They must follow the OCI specifications, and move in parallel with them so that we aren't inventing new things and making it harder. This (in my mind) seems like such an easy problem to solve, the slow and hard parts are just putting these pieces together. This is what I'm trying to do - and since I've found this challenging I've been trying to make a template and more programmatic way for (the next person who wants to contribute) to make it easier.

Some quick links:

  • The Container Advisory Group started by Nvidia and now including representatives from all over industry and academia has discovery as one of its primary challenges.
  • OCI is where the specifications can already come from. But in that OCI doesn't feed into Google Dataset search, we need to capture the equivalent of an Abstract class of OCI to serve from here.
  • BOF (birds of a feather) at Supercomputing, PEARC, always has big group meetings about Container Curation / organization. Everyone talks a lot and I don't see anyone ever work hard on fixing this.
  • I've also had calls with other supercomputing centers that want to just talk about the problem of "Container Curation." The strategy now seems to be to get API endpoints of all the registries and make a massive list. Biocontainers worked well doing this for a while, but it looks like it became really hard to do this. I'm not sure of the status currently!

In a nutshell, we need a scaled way to not only provide a list of containers, but better information about them that can be searched. I think this goes beyond the abilities of what each tiny maintainer can do,but a massive search engine like Google can do fairly easily. It would be trivial for the maintainers of:

  • Docker Hub
  • Quay.io
  • Biocontainers
  • Singularity Hub

and other registries to be able to add metadata tags to the container registry pages, and then have a beautiful way for a scientist to not find just a tensorflow container, but the tensorflow container for the purpose that he/she needs! Without this, the universe of containers is just a mess. So much awesome development and tooling will come, and an ability to better compare methods and software when we have this labeling.

Please let me know your feedback! I don't want to just talk about these things, I want to make them happen.

@danbri

This comment has been minimized.

Contributor

danbri commented Oct 2, 2018

These investigations sound very interesting but I'm having trouble keeping the various levels of abstraction clear in my head, to be honest. Am I right that we're talking both about the specific tooling for Schema.org collaboration ("I had just cloned this repository, and was going to give a crack at creating a Dockerized local template that could be run") as well as working towards new schema.org schemas to describe things around containerized software, services and datasets?

If that's correct I'd suggest breaking out the Schema.org project tooling aspects into another issue. @RichardWallis has been working to minimize some of our dependency on AppEngine; today's commits bring us closer to a static file-based generation and serving model. For those looking at dockerizing schema.org's current tooling, the Travis-CI config might be useful.

On the vocabulary front there is a grey area in between http://schema.org/Dataset and the more software-oriented types, where container technology is increasingly central. This came up a bunch when I was talking to folk like bioschemas. It would be useful to understand how container descriptions could be integrated into both schema.org-based Dataset description, and also into W3C DXWG WG's DCAT efforts. The structures are very similar. Perhaps @agbeltran can offer some advice, as she has a link in both those efforts.

@vsoch

This comment has been minimized.

vsoch commented Oct 2, 2018

Am I right that we're talking both about the specific tooling for Schema.org collaboration ... as well as working towards new schema.org schemas to describe things around containerized software, services and datasets?

Correct! I had only intended to do the second, but I found the first so hard that I decided to try and help as I was doing the second. I'm mostly done with the goals I had outlined for the first, and am very happy to discontinue working on it in favor of the second thing (my original goal).

If that's correct I'd suggest breaking out the Schema.org project tooling aspects into another issue. @RichardWallis has been working to minimize some of our dependency on AppEngine; today's commits bring us closer to a static file-based generation and serving model. For those looking at dockerizing schema.org's current tooling, the Travis-CI config might be useful.

This sounds great! I am a big fan of continuous integration :) @RichardWallis I won't throw any more wrenches into the mix, I'll stick with the openschemas templates I'm using now because I'm mostly done, but please reach out to me if I can be of help.

On the vocabulary front there is a grey area in between http://schema.org/Dataset and the more software-oriented types, where container technology is increasingly central.

Exactly. This is why I created openschemas. Take a look there at the link in the first lines of the opening paragraph, which I wrote some time ago now. It makes this exact point. :)

This came up a bunch when I was talking to folk like bioschemas. It would be useful to understand how container descriptions could be integrated into both schema.org-based Dataset description, and also into W3C DXWG WG's DCAT efforts.

I don't think containers belong with Datasets, or with Biology related things. A container is definitely not the right place for a dataset (although it interacts with them) and there is definitely no exclusive tie to bioschemas / biology or even a scientific domain. It's an open source technology that is useful for many things, and deserves it's own sort of bucket (or minimally shouldn't be forced artificially into a bucket it doesn't belong).

The structures are very similar. Perhaps @agbeltran can offer some advice, as she has a link in both those efforts.

That would be great! Here are the full set of specifications, for recipes (e.g., a Dockerfile), images (e.g., the actual container binary, which might have some data but really is moreso likely to be software to interact with data) and then the base of those things is the more abstract Container).

https://openschemas.github.io/specifications/

Happy Hacktoberfest everyone!! 🎃

@thadguidry

This comment has been minimized.

thadguidry commented Oct 2, 2018

Hi @vsoch

OCI already has good overlap in its Container Image image-spec annotations I just noticed, with our Schema.org/CreativeWork and this is a great starting point.

For datasets that are provided sometimes within Container Images... how is the metadata typically captured for those datasets ? What metadata standards are used commonly in Scientific realms besides those listed here if you know ?
(Myself, Dan, and others are aware that there are other standards used to capture metadata about datasets, such as https://frictionlessdata.io/specs/data-package/ )

@danbri

This comment has been minimized.

Contributor

danbri commented Oct 2, 2018

@vsoch - thanks, I have a clearer picture now. Just replying quickly on one point

I don't think containers belong with Datasets, or with Biology related things. A container is definitely not the right place for a dataset (although it interacts with them) and there is definitely no exclusive tie to bioschemas / biology or even a scientific domain.

I entirely agree. It's rather that the usecase arises there (and elsewhere). Organizations, events, scholarly articles etc also have identifiable and nameable relationships with data and datasets, but are fundamentally different. My thinking was just that we may want to seek out sanity checks from those working under the "dataset metadata" and lifescience/bio banners, to help work out which schemas meet which needs.

@vsoch

This comment has been minimized.

vsoch commented Oct 2, 2018

@thadguidry oh interesting! I think an annotation for a container coincides with a LABEL (e.g., it's called LABEL in the dockerfile, or %labels for a Singularity container, so the field that I have for annotations would perhaps be a new property classified as a kind of creative work?

For containers, it's typically bad practice to try and use them to provide datasets, at least any significant ones in terms of size. For container metadata, however, the standard is to provide it via an inspect command (e.g., docker inspect <container> or the similar inspect I implemented back in the day for Singularity (singularity inspect <container>). The metadata usually consists of:

  • labels (annotations as we are discussing)
  • environment
  • runscript (or more generally, entrypoints)
  • the original recipe (in singularity, the flag is actually for a "deffile" so it's -d)
  • help sections (SIngularity only)

I'm not super experienced with metadata for datasets, but I'd expect to see minimal things like versions and software versions. Another important bit for containers (not listed above) are build time needs / host dependencies (e.g., nvidia-docker or similar). You can see an example of fields that a user wants to tag for a database of dockerfiles here --> vsoch/dockerfiles#4 it's primarily tags based on "what software is inside here?" and "what vendors are relevant?"

@thadguidry

This comment has been minimized.

thadguidry commented Oct 2, 2018

@vsoch Just so you know... We have an equivalent Tags convention in Schema.org with our https://schema.org/keywords property.

You might also already be aware of this, but... Images and Containers are 2 different things... I was previously talking about Images, where you would get output about metadata of what an image and its possible data might contain with the output of docker image inspect MyImageName

@vsoch

This comment has been minimized.

vsoch commented Oct 2, 2018

ah this is very good! I'll see if I can integrate these things (tags and annotations discussed above) into the container specification(s), and also provide a "Where does it fit?" simple diagram to share here. Likely tomorrow - need to eat some dinner. 🍽

@thadguidry

This comment has been minimized.

thadguidry commented Oct 2, 2018

@vsoch no problem. Also we have a "generic" Key:Value system that can be used when there is no Schema.org Property already created yet ... https://schema.org/PropertyValue

So right now, in my opinion, I would say Schema.org has 100% of what you need to describe structured data about Containers and Images and their metadata. (The existing properties we have might not be the best fitting, but they can fit and be understandable for most search engines and structured data parsers) If you don't find a property within Schema.org to hold your structured data about Containers and Images and Datasets...then let's talk about those, and we can gladly point to possible candidates. (incidentally, a few weeks ago, I did look at your use case with BIDS data and specifically around how PyBids handles the metadata around it within its functions here: https://github.com/bids-standard/pybids/blob/master/bids/variables/variables.py and I didn't see anything that Schema.org couldn't handle currently in some fashion, again, perhaps not always the best fitting, but it could be expressed with Schema.org's current Types and Properties ) So when metadata doesn't seem to fit well, we can talk about those.

@ptsefton

This comment has been minimized.

ptsefton commented Oct 2, 2018

Hi all, I'd like you point you to an effort we have been working on that uses Schema.org for packaging data (which is not containers or images). This uses almost 100% pure schema.org to describe data, and seems to be compatible with Google's dataset search. https://github.com/UTS-eResearch/datacrate/tree/master/spec/1.0.

See the parts about file provenance using schema:CreateAction.

Does any of this help?

@charlesvardeman

This comment has been minimized.

charlesvardeman commented Oct 2, 2018

Not certain if you will find this useful. As part of the DASPOS project, we developed a "Computational Environment" ontology design pattern (http://ceur-ws.org/Vol-2043/paper-03.pdf) in collaboration with CERN to capture the provenance of environment where HEP calculations are performed. As part of the process, we looked at both VMWare and Dockerfile vocabulary to inform the pattern that it captures a broad set of the vocabulary. The pattern can be populated via a script with instances from Wikipedia (and Wikidata). The OWL for the pattern is in (https://github.com/Vocamp/computationalEnvironmentODP). There is a matching "Computational Activity" pattern (https://github.com/Vocamp/ComputationalActivity) that captures the provenance around a computational execution that links to the Computational Environment pattern. See the concept map: https://github.com/Vocamp/ComputationalActivity/blob/master/concept-map/computationalActivity.pdf. We didn't get to the (general) patterns to describe the underlying data sets used in a particular computational activity.

I had a student work on a proof of concept "smart containers" (http://linkedscience.org/wp-content/uploads/2015/04/paper2.pdf) tool that wrapped the docker command line tool capture the provenance of the Docker operations using the ODPs and attach them as a label to the docker image. the code for this, somewhat functional prototype is in the smartcontainers https://github.com/crcresearch/smartcontainers repo.

@vsoch

This comment has been minimized.

vsoch commented Oct 3, 2018

hey everyone! I did a reverse of plan - I realized that I needed to bring up discussion for "Where does it fit" before updating the specification, so let's start with that :)

I'll start with (texty) discussion here about where I think each component fits into the currently existing schema.org. After discussion here, I'll update the specifications files to reflect what we discuss. I apologize in advance for probably not using terms / descriptors / properties correctly - please feel free to correct where I'm off.

Previous Art

There is an ontology that describes virtualization
but I think the detail might be too much for the goals of schema.org. However, this led me to
step back and approach this by asking the simple question of how much do we need to represent to
achieve the current goals?

What are your goals, dinosaur?

  • I want to describe and then programmatically find software and data containers:
  • for a particular domain of research
  • with specific software or libraries, down to the detail of versions
  • with support to run on some specific host

This will organize our container universe, and be essential not just for academia, but for industry and all domains it touches.

What level of abstraction is ideal?

Representation of containers that is too detailed is actually just as bad (I think) as not having enough representation, period. It might be useful for knowing the version of a kernel if I want to know if I can use a Singularity Container there (for example, Centos 6, no overlayfs, ruhroh) but there are a lot of intricate details that might be useful in only 1% of cases. And having all the extra support for those 1% actually makes the specifications really complex and confusing. So for this first go, I would suggest we try to hit the core needs of the top 80%, and favor simplicity with the mindset that if additional need is there, the community will step in and express it.

Where do these Container specifications fit in?

The original specifications I had in mind were:

  • Container
  • ContainerRecipe
  • ContainerImage
  • ContainerRuntime
  • ContainerDistribution

but now I realize I think it's a bit more to that. Let's talk about this, and I'll address them one by one via questions, and explaining my thought process.

Question 1: Where does container fit in?

I started with a very simple question.

Is a linux container a kind of software?

Meaning SoftwareApplication. In that it's a binary, I think that we could fit it under SoftwareApplication. That would look like this:

Thing > CreativeWork > SoftwareApplication > Container

That is the easy answer, because it fits into existing specifications in schema.org.
But more correctly, if we are to also eventually model virtual machines, then we
really need hypervisors too. And hmm, I don't think a virtual machine is a kind of SoftwareApplication persay, it's something else. It has software applications! So let's move it up, and I don't even think it belongs under CreativeWork to be grouped with poetry and books and what not. It doesn't have a good parent in the base. We would need something like:

Thing > Virtualization > Container
Thing > Virtualization > Hypervisor

But at some point, we are going to care about operating systems, hardware, and hosts. The
hosts are the machines with hardware, and the operating systems are what the virtualization
deploys. So we need something like:

Thing > Hardware
Thing > OperatingSystem
Thing > SoftwareHost
        - runsOn Hardware
        - supports Virtualization
        - has OperatingSystem

And our virtualization then relates to those things:

Thing > Virtualization
        - isSupportedBy SoftwareHost
        - has OperatingSystem

and now adding containers and hypervisors, they can inherit through this graph

Thing > Virtualization > Hypervisor
Thing > Virtualization > Container
                         - has OperatingSystem
                         - buildFrom ContainerRecipe (sometimes)
                         - has SoftwareApplication (sometimes)
                         - has annotations

So while I don't think we want a super detailed organization of hardware and virtualization,
I think it should be represented on a high level because it's going to be the case that
these are important parts of describing containers. For the above - I am modeling the level
of Container instead of just ContainerImage because I'm not sure if we can have a Container
that isn't associated with an image. It could be that the properties above should just belong with
a ContainerImage that is the child of Virtualization.

Question 2: Where does container recipe fit in?

A container recipe refers to a set of build steps for a container, which is a binary that has an operating system and associated libraries. Examples include Dockerfiles and Singularity recipes.

If we want a quick and dirty solution, in that it's a template or script, a ContainerRecipe (sort of) fits under the category of SoftwareSourceCode. But I'm not sure I would call a container itself software, and then we are walking the fine line of not properly distinguishing a Singularity or Docker container from, for example, software that runs them (e.g., docker or singularity). But it does fit the spirit, so maybe it can be a kind of SoftwareSourceCode?

Thing > CreativeWork > SoftwareSourceCode > ContainerRecipe

If we consider a container to be a kind of software (it is a binary...) then that fits pretty cleanly. Does anyone else have thoughts about this?

Question 3: Where does container image fit in?

The container image isn't the running instance, but the binary (for Singularity the actual file) that generates the instance. It's weird, yeah. If we consider Container to be the more abstract thing, if it could be the case that there are kinds of containers that don't require images, then it would be a child:

Thing > Virtualization > Container > ContainerImage

And then the instance I think would coincide with the container runtime, so we would have this:

Thing > Virtualization > Container > ContainerImage > ContainerRuntime

The OCI has specifications for images and for runtimes, so this is logical to model both. The ContainerRuntime is the instance generated from the ContainerImage, which is a type of Container, a kind of Virtualization technology.

Question 4: Where does container distribution fit in?

A container distribution is a container registry (Docker Hub, Singularity Hub, Quay.io, Biocontainers, etc.) Would it be a subtype of a collection?

Thing > CreativeWork > Collection > ContainerDistribution
                                    has Container (many)

So basically, it's a collection of Container, or ContainerImage. The OCI also has a specification for registries (generally the information they serve and manifests) so we would represent that here.

@ptsefton data crate looks really cool and I definitely think it could be useful once we have these definitions (I've added an issue to give it a try along with datalad for my test dataset!) and @charlesvardeman this is also very useful - have you thought about bringing up these descriptors with the OCI maintainers so it can be linked to a container runtime? I would say we would want to model them separately in schema.org (the idea of the Computational Environment) and then say something like ContainerRuntime needs ComputationalEnvironment ... but since the strategy is to go by the properties of OCI (and not roll our own, to the best that we can) I think the most logical avenue is to try and integrate there first, OR contribute an independent specification of a Computational Environment here, and then convince OCI to embrace it too?

Wait... what about this computational environment?

Actually, this is a very good point, because a computational environment might describe one or more hosts, and this is the kind of thing you would call any sort of cluster. But again I would challenge us to reduce the complexity to a level that can be extended, but doesn't over-complicate based on the goals of having it. In my little framework I'm describing here, based on looking at your chart, it would seem that you are suggesting a ComputationalEnvironment is parent to all these things? Something like:

Thing > Hardware
Thing > OperatingSystem
Thing > ComputationalEnvironment
Thing > ComputationalEnvironment > SoftwareHost
                                   - runsOn Hardware
                                   - supports Virtualization
                                   - has OperatingSystem
  • Is hardware by itself a kind of computational environment? I don't think so.
  • Is a software host by itself a kind of computational enviroment? I think so.
  • Is an operating system a kind of computational environment? I don't think so, because without a host the operating system is just... data bytes? This is getting very fun and weird for my head to wrap around :)

Here is where it gets kind of cool! I find this interesting because a computational environment could moreso refer to a collection of hosts and hardware (e.g., Kubernetes, SLURM / SGE) - meaning multiple SoftwareHosts OR for the humans among us, just a single SoftwareHost. So instead of SoftwareHost being a kind of ComputationalEnvironment, it becomes a link / (properties?) instead:

Thing > Hardware
Thing > OperatingSystem
Thing > SoftwareHost
Thing > ComputationalEnvironment 
          - has SoftwareHost (many) each of which...
            - runsOn Hardware
            - supports Virtualization
            - has OperatingSystem

I like that better :) Let's put this all together to look at in one place, and I'll leave it open for discussion! I want to suggest that I can take charge of creating specifications for review for the Thing > Virtualization hierarchy, and perhaps @charlesvardeman your group has developed the Computational Environment specifications, and could define a subset to fit into schema.org?

Thing > Hardware
Thing > OperatingSystem
Thing > SoftwareHost
Thing > ComputationalEnvironment 
          - has SoftwareHost (many) each of which...
            - runsOn Hardware
            - supports Virtualization
            - has OperatingSystem

Thing > Virtualization
        - isSupportedBy SoftwareHost

Thing > Virtualization > Hypervisor
Thing > Virtualization > Container
                         - has OperatingSystem
                         - buildFrom ContainerRecipe (sometimes)
                         - has SoftwareApplication (sometimes)
                         - has annotations

Thing > Virtualization > Container > ContainerImage > ContainerRuntime
Thing > CreativeWork > Collection > ContainerDistribution
                                    has Container (many)

Let's circle back to our original points - first the goals. The above would allow for nice labeling of containers with software and data, for containers served in registries, that then could be indexed by Google and the properties exposed for not just discovery, but for "grid type" analyses to answer questions like "What is the optional computational environment to run Container X?"

Second a mindset of simplicity - for many of these, we can leave them to be very simple / general (like shells) and have the community come in and make contributions for the details. I think it's our job to set up the skeleton / framework, and not to try and get the entire detailed thing perfectly.

That was a lot more than 0.02, so I'll say there is my 2 dollar pancake. 🥞 :)

@charlesvardeman

This comment has been minimized.

charlesvardeman commented Oct 4, 2018

@vsoch I would be happy to help. I’m traveling over the next couple of days and need some time to digest your suggestions and write some comments. One other comment that may give food for thought. We started developing a pattern call ComputationalObservation that was akin to O&M or SOSA observation for a computational result. Here is a brief talk that I gave to ontolog.
http://ontolog.cim3.net/file/work/OntologySummit2015/2015-03-05_OntologySummit2015_Beyond-Semantic-Sensor-Network-Ontologies-2/Track-B_OntologySummit2015_CharlesVardemann_2015-03-05.pdf. Side 18 shows how computational mode, algorithm, SoftwareAsCode, Library and Execution tie together for a computational observation.

@vsoch

This comment has been minimized.

vsoch commented Oct 4, 2018

Have a safe trip, and looking forward to your thoughts! ✈️

If the container is the box of pancake mix, the cabinet and then house are the computational environment and hardware, respectively, this is one level deeper - the computational observation is actually everything that goes into creation of the pancake mix (the algorithm to grind the flour, the kitchen it was done in, the amounts, etc.). I think this is important too, and probably should be represented independently even from the small hierarchy we are discussing. For example, you can easily have a component of a ComputationalObservation without any Hardware or a Container. For an algorithm, well couldn't that even be in my head? 🤔

Anyhoo, have a safe trip and let's talk about all of the above when you have some time! I want to also look more closely at how schema.org is generating the final specifications because given the need for json-ld and similar, I'm now not totally happy with just having yaml. But I don't think there is rush for this development because as @rvguha pointed out, the important first thing to do is have discussion (and we can do that right without any special tools :) )

@HughP

This comment has been minimized.

HughP commented Oct 4, 2018

@vsoch

I'm just getting caught up on this thread, and I was reading:

Thing > ComputationalEnvironment > SoftwareHost
- runsOn Hardware
- supports Virtualization
- has OperatingSystem

And that sounds a bit like some some of the features in the DOAP ontology. Are you aware of that ontology? and is it useful for what you are doing here? - I'm suggesting that it might be and there is already some use of it on PyPI and several other software repositories. Here is the github link for the project: https://github.com/ewilderj/doap . Some years ago there was a paper about DOAP presented ad the DCMI meeting: DCMI-Tools: Ontologies for Digital Application Description.

@vsoch

This comment has been minimized.

vsoch commented Oct 4, 2018

Thanks @HughP.

To all in the discussion - to be clear my goals right now aren’t to delve into describing software, repos, or projects- some of the extended discussion about was just suggestion for how additional (more detailed) descriptions of software or projects could fit into what I’m thinking of. My primary goal is to describe the levels needed up to having a container. Indeed in containers there is software that might be described by these additional ontologies and down the line I would definitely enjoy helping add these to schema.org, but right now they are out of scope for this discussion.

@thadguidry

This comment has been minimized.

thadguidry commented Oct 5, 2018

@vsoch The world moves quickly in regards to Container standardization (and around metadata) I don't want to fill up a book in this comment to you...but suggest a few other resources for further study which i did myself last weekend ... https://docs.ansible.com/ansible-container/ and https://galaxy.ansible.com/docs/contributing/creating_role.html#role-metadata

For us in Ericsson, we are fully embracing Ansible Container along with OCI image-spec.
For us in Ericsson, the sharing of metadata of Containers will be handled in Container registries themselves.
If search engines want/need to crawl public Container registries then that's fine.
But my personal opinion upon quite a bit of reflection over the past weekend, is that I think OCI should be the primary movers in this space, rather than Schema.org.
And you should contribute to that effort directly HERE -> https://github.com/opencontainers/image-spec/blob/master/schema/content-descriptor.json

@vsoch

This comment has been minimized.

vsoch commented Oct 5, 2018

I definitely agree that OCI is leading in the space, and this is why I am mirroring the specs. I can't comment on Ericsson, I don't know anything about that company other than maybe a phone company?

I am definitely in support of the idea to develop where the community is moving and thriving. The missing component of going directly to that effort THERE is that (and please correct me if I'm wrong) there is no connection between that initiative and plugging into a super-power search engine like Google, for having containers indexed by Google Datasets. You can make the greatest of standards or other but when push comes to shove, if you don't set it up with the right plumbing it's not going to be useful to the graduate student sitting in his dorm room trying to do an efficient search for something. My understanding is that there is no direct feed from the work of OCI into such a global tool, beyond a small strategy of having individual registries providing the result of the effort through their individual APIs. Is this incorrect?

I'm rather marginal / not opinionated about which standard to embrace, so I don't have opinion on "the standard that is best" but I do want to choose the deployment infrastructure that can best and most efficiently distribute the search. That (in my mind) looks more like Google than Docker Hub or similar. This is the reason I've taken this route - schema.org is the gateway to that amazing resource.

@vsoch

This comment has been minimized.

vsoch commented Oct 5, 2018

Anyway - @charlesvardeman when you get back, I started with a very basic template for Hardware --> https://openschemas.github.io/spec-hardware/. This is the skeleton level of representation that I would want to have for Hardware, and Virtualization, and then I can develop container (and others with expertise in the previously mentioned would work on those).

The idea is that you can easily make changes, open a PR to test, and then merge will update the web interface. Then you can ask others for review discussion, and "publication" at https://openschemas.github.io/specifications is just moving a file and then another pull request. And everyone here - this is where I'd want to ask for advise, help, etc. on converting my front matter (yml) to json-ld and "whatever format is needed for schemaorg submission." But as it was pointed out, we should have discussion about the specifications here before that.

Back to @charlesvardeman (and others interested!) my understanding is that schema.org can embrace other ontology definitions, so if you would be interested to contribute to these templates, what I have done so far is downloading the tsv files from the Google Sheet mentioned in the README (but there isn't any reason you can't open them (with tab separation) on your computer. Right now, there isn't anything special there - it's just inheriting properties of "Thing." You can make changes to your heart's content, then PR and bring in others for discussion.

Anyway - I'm still looking forward to feedback on the above. In the meantime I'll start skeleton templates for the hierarchy I defined above, and can try to figure out how to make json-ld if nobody has a script or example.

@vsoch

This comment has been minimized.

vsoch commented Oct 5, 2018

and I want to ask @rvguha how can the work that @thadguidry linked at opencontainers be linked to Google Dataset search? If we can do that, and the development / community is more active there, it might be best to pursue that connection instead.

@vsoch

This comment has been minimized.

vsoch commented Oct 5, 2018

And @thadguidry I think we need to figure out how to work together. The expertise for containers indeed comes from opencontainers, but the expertise for everything else that might be modeled (e.g., look at the list of things in schema.org --> https://schema.org/docs/full.html) comes from there.

We must be able to have the work that is being done by opencontainers represented in that larger graph, otherwise it's a limited view of a small domain with no understanding of how it fits into a big picture.

@ptsefton

This comment has been minimized.

ptsefton commented Oct 5, 2018

@thadguidry

This comment has been minimized.

thadguidry commented Oct 5, 2018

@vsoch If data is not in Container Images...then I don't see why Google Datasets would bother indexing them in some fashion? But if Google Datasets is open to the idea that Datasets could be found in ANY format or package as so happens A LOT in Science domains (including Container Images as a package format)... then that would be @rvguha to talk to...not me or Ericsson my employer. I can only speak to what technologies we at Ericsson internally use to see what data/software might be lurking inside Container Images. You are on the right track Vanessa to solving your discoverability problems.

NOTE: Careful with just the casual use of the term "Container" (which are ephemeral) versus the more appropriate term for your use case of "Container Image".

@vsoch

This comment has been minimized.

vsoch commented Oct 5, 2018

Definitely something I am aware of!. Albeit there is a lot more I'm not aware of, I'm doing my best :) I'll let @rvguha chime in on how the two world can work together - it would be perfect if the experts in a domain can easily "plug in" their work to the larger graph.

@danbri

This comment has been minimized.

Contributor

danbri commented Oct 5, 2018

I can speak specifically for the Google Dataset Search work if you have questions relevant to Schema.org discussions, but please do remember this forum is not really about Google products. At Google we have a wide variety of data- related activities going on (as you'd expect), with Schema.org as a common element in several. As far as Google Dataset Search's definition of "Dataset", it is based on schema.org/Dataset and the definitions are intentionally inclusive (we gave several examples but also say "Anything that looks like a dataset to you). As far as all that relates to Schema.org and containers, I'll just repeat my observation above that software and data and containers are intimately inter-related in practice but there's little to be gained from trying to define exactly where the one stops and the other begins.

@thadguidry

This comment has been minimized.

thadguidry commented Oct 5, 2018

I agree with Dan on those points.

@vsoch

This comment has been minimized.

vsoch commented Oct 5, 2018

software and data and containers are intimately inter-related in practice

Yes, but that doesn't mean that Container == software, or that Container == dataset. They are very different. It's like saying that Refrigerator == food, or Refrigerator == Mom's leftovers. Mom's leftovers and the food are found in the refrigerator, but both can exist without the Refrigerator (and vice versa). Datasets are entirely different beasts from containers, and to clump a container under the "Dataset" tag under the guise that "Containers interact with datasets" is far too abstract.

but there's little to be gained from trying to define exactly where the one stops and the other begins.

There is absolutely everything to gain, because the Container must work and be optimized for a computational environment. As soon as you remove the idea of a "Container" from the hierarchy then you are forced to deal with groupings of software (many), environments (many), and metadata (also many) and you lose the vehicle to easily group those things (the container). So yes, the relationship is strong between the groups, but schema.org needs something that looks like a Container. I really like the idea of being able to connect OCI with schema.org, so let's talk about that.

@jaygray0919

This comment has been minimized.

jaygray0919 commented Oct 5, 2018

my impression here is that the objective of the proposal is to provide a 'meta-data-means' to define Docker and Docker parts. A related analogy is a meta-data-means to define the components of a Blockchain. We have some experience with this and have pushed the specification of an item in a Docker container and a Blockchain to the lowest possible level. Then we use conventional mereology (hasPart, isPartOf and more specific mereology relationships) to define the container. Several simple reasoning applications demonstrate that this approach works. Bottom line: success is based on the quality of the @Class hierarchy. Once that is done, one can use @Property to declare the specific mereolgy relationships. The result is valid in GSDTT (the harvester we are most concerned with at this time) and can be reasoned using HermiT.

@vsoch

This comment has been minimized.

vsoch commented Oct 5, 2018

@thadguidry with respect to opencontainers, I did open an issue on that exact repo over 2 weeks ago, but nobody from the community there responded. opencontainers/image-spec#751.

@vsoch

This comment has been minimized.

vsoch commented Oct 24, 2018

hey everyone,

I was able to have good conversation with OCI, and minimally was able to come up with a quasi reasonable proposal (see summary and links from link above). The next step was to meet with @rvguha, which was supposed to be Friday, but now it's moved to sometime in January. This is out of my hands at this point, and to be honest I'm feeling down about it. We are going to be over 4 months by the time anything can even be talked about, and I find this really depressing. Thanks again for the good discussion here, and have a good holiday season.

@vsoch

This comment has been minimized.

vsoch commented Nov 6, 2018

I decided waiting for a meeting was not going to accomplish much very quickly. There is much I can do before that. A verbose writeup is here:

https://vsoch.github.io/2018/schemaorg/

TLDR: I created a schemaorg python module that makes it possible for developers to interact with the specifications, and spit out templates that can be put in webby places (and then parsed by the Google bots). This is different than the python you have in this repository that serves the schema.org website. An example of the output is here and view the source to see the embedded json-ld. More examples are served at the repository, and note that the extraction of metadata can go much further than what I did here. We can do some really cool stuff to parse the files list from container-diff, for example.

If you want to share the above, here is a link: https://twitter.com/vsoch/status/1059611828765052928
It's really important that we keep moving to tackle this problem of container discoverability. Thanks everyone, and still have a good holiday season, whatever one(s) you celebrate. I just tend to eat all the things :) 🦃 🍪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment