-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scope of Schema.org for research data #114
Comments
Hi Vanessa,
Would love to talk to you about this. Can send me email at guha@google.com?
guha
…On Wed, Sep 12, 2018 at 3:30 AM, Vanessa Sochat ***@***.***> wrote:
hey schema.org team!
We are putting together an organizational and data movement strategy for
research computing and the library at Stanford, and I wanted to ask how the
schemas fit into the domain of research data. I will describe the ideas we
are discussing first to give you some context.
- a researcher will start a new study and create a definition of data.
Let's say some set of images and annotations. We will want them to be
matched to a particular data format (e.g., DICOM images) that has a set of
metadata (e.g., some subset of header fields, or Radlex terms).
- Ideally, this image format will have a particular organization and
metadata (something schema.org can represent?) and this will drive
tools / software to move it around, and perhaps first put it where it will
be used by the researcher (Google Cloud).
- on Google Cloud,you can imagine it will be in Object Storage, with
object level metadata and if needed, something like BigQuery to handle
scaled queries.
- Then the data will be moved to library archive, more of a filesystem
setup, where it will be accessible by URL.
As it moves around, the organizational schema will help to guide
interaction. It will help with validation and query and integration with
tools built around it. For many of these organizations, they will come from
the research domains themselves. For example, the brain imaging data
structure BIDS <http://bids.neuroimaging.io> is already being widely used
across the neuroimaging community and software.
For the definition of the organization and data, I'm wondering how
schema.org can fit in. I saw that Natasha (previously at Stanford!) at
Google for Google Datasets (see this article
<https://www.blog.google/products/search/making-it-easier-discover-datasets/>)
mentioned schema.org, and it definitely seems relevant for web page
content and making it searchable. In that we want our strategy to be easy
to sync with what the larger community is doing, I wanted to ask about
research data? How can we work together and leverage the resources here so
that our datasets can eventually integrate too into tools provided by
schema.org, Google Datasets, and be useful for searching for our
researchers after archive? How can we contribute templates and other
tooling here to help toward this? Thanks for your help!
@cmh2166 <https://github.com/cmh2166> @hannahfrost
<https://github.com/hannahfrost> @rmarinshaw
<https://github.com/rmarinshaw>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/schemaorg/schemaorg/issues/2059>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFAlCqFv1ejorqp3BoB1weKkCW-oilfAks5uaOIugaJpZM4WlGSS>
.
|
damn it @rvguha can't we at least TRY to keep the conversation public , until it doesn't have to be ??? |
@vsoch The "larger community" does not only consist of Google. |
That’s not what I meant :*( |
@akuckartz would you care to support that statement with a description of what the larger community is doing, per your thoughts? |
Thad, Andreas,
I have been wanting to get hold of someone from the team that Vanessa is
part of, for many months, in the context of a different project. I do
sincerely apologize for not making that clear.
Please do go ahead and continue your discussion here about the original
topic.
guha
…On Wed, Sep 12, 2018 at 10:15 AM, Thad Guidry ***@***.***> wrote:
damn it @rvguha <https://github.com/rvguha> can't we at least TRY to keep
the conversation public , until it doesn't have to be ???
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-420727241>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFAlCsteGsmcyla_UxJ0O8EMVfW_lx2Aks5uaUEVgaJpZM4WlGSS>
.
|
yeah! @thadguidry and @akuckartz I hope we can continue our conversation here, can I tell you how excited I am to be starting work on this project? I understand your concern, and let's keep discussion going here! One small note - please take note I'm about to be hit by a hurricane (east coast) so if there is a bit of delay in my response, I'm probably just away from power or internet. 🚢 ⛵ |
Stay safe!
…On Thu, Sep 13, 2018 at 7:50 AM, Vanessa Sochat ***@***.***> wrote:
yeah! @thadguidry <https://github.com/thadguidry> and @akuckartz
<https://github.com/akuckartz> I hope we can continue our conversation
here, can I tell you how excited I am to be starting work on this project?
I understand your concern, and let's keep discussion going here! One small
note - please take note I'm about to be hit by a hurricane (east coast) so
if there is a bit of delay in my response, I'm probably just away from
power or internet. 🚢 ⛵️
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-421034869>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFAlCrbpgaaV6mpYe3Rm1sPYOf9RB2S9ks5uanDJgaJpZM4WlGSS>
.
|
@thadguidry - I appreciate your enthusiasm to collaborate but there's nothing wrong with @rvguha (or anyone else) expressing an interest in directly meeting up with other members of this community, especially given that he's in the Stanford area etc etc. Saying "damn it" in Github comes across way more snarkily than it might in real life amongst people who know each other better. While you might mean it with a smile it's not a great example to set. From https://schema.org/docs/howwework.html
While I don't take use of "damn it" as breaking those rules but it needlessly nudges things towards being a more hostile and critical environment. Anyway, FWIW I'd also be happy to see more discussion of the original ideas here, but since the original post also had several Google mentions, maybe those Google-specific aspects are better explored elsewhere. (As an aside -- I've been working with Guha on this RDF stuff since before Google even existed as a company, I'd hope our commitment to the bigger picture might be clear by now...) |
@vsoch In addition to schema.org there exist several other parallel activities with overlapping (but certainly not identical) requirements and stakeholders. One of them is the W3C Dataset Exchange Working Group (DXWG) which is creating a revised Version of DCAT. See https://w3c.github.io/dxwg/ This is not an "either or" but I suggest that you also look at what DXWG is doing. In Europe DCAT is used frequently by public administrations and in Germany there is a new legal requirement for all branches of the public administration to describe Open Data using a DCAT profile. I suppose that will also have some influence in research communities. And yes, stay safe! |
@akuckartz yes, we have the same "backbone structure" in Schema.org as DCAT is based on. This is thanks to Jim Hendler and his group's advocacy for us to adopt that design some years ago The basic structure is
... which shared strengths and weaknesses with DCAT, e.g. there is scope for documenting patterns for time series-based collections, etc. The similarity means that if you have basic DCAT it is pretty easy to generate Schema.org Dataset markup, and vice-versa. I am a member of W3C DXWG WG representing Google, and liaising to Schema.org. There are some notes from the last f2f meeting that I made, towards using JSON-LD's @ context feature to integrate DCAT / Linked Data approaches and schema.org here: https://docs.google.com/document/d/16c_STDu8Dzj-ioRNuGS2tlIFJamlx0-vRKBaPA5Wzfc/edit Beyond this high level DCAT / Schema.org/Dataset approach (which is barely a change from classical 1990s Dublin Core), there are lots of other aspects to dataset description opening up, and lots of questions about how different standards plug together in practice, even just looking at W3C stuff like CSVW and Data Cube. I've recently been spending a lot of time around Fact Checking initiatives, in the context of misinformation and Schema.org's Claim and ClaimReview markup. In that context there are some DXWG discussions on representing caveats and footnotes from statistical data at https://lists.w3.org/Archives/Public/public-dxwg-wg/2017Jul/0041.html which may be of interest here. There are also efforts like Bioschemas who are starting to crawl this data and elaborate a few schema.org additions that help bridge the cross-domain dataset descriptions with domain-specific identifiers and ontologies. |
Hi Vanessa,
Good question!
My (Stanford-CDDRL-backed) counter-disinfo project is using an extended
version of schema.org/ClaimReview on the basic level and it would be great
to have Stanford library resources easily integrated.
So please keep me in the loop.
Thank you!
Best regards,
Dmytro Potekhin
Founder & CEO
FakesRadar.org <https://fakesradar.org/>
…On Wed, Sep 12, 2018 at 1:30 PM Vanessa Sochat ***@***.***> wrote:
hey schema.org team!
We are putting together an organizational and data movement strategy for
research computing and the library at Stanford, and I wanted to ask how the
schemas fit into the domain of research data. I will describe the ideas we
are discussing first to give you some context.
- a researcher will start a new study and create a definition of data.
Let's say some set of images and annotations. We will want them to be
matched to a particular data format (e.g., DICOM images) that has a set of
metadata (e.g., some subset of header fields, or Radlex terms).
- Ideally, this image format will have a particular organization and
metadata (something schema.org can represent?) and this will drive
tools / software to move it around, and perhaps first put it where it will
be used by the researcher (Google Cloud).
- on Google Cloud,you can imagine it will be in Object Storage, with
object level metadata and if needed, something like BigQuery to handle
scaled queries.
- Then the data will be moved to library archive, more of a filesystem
setup, where it will be accessible by URL.
As it moves around, the organizational schema will help to guide
interaction. It will help with validation and query and integration with
tools built around it. For many of these organizations, they will come from
the research domains themselves. For example, the brain imaging data
structure BIDS <http://bids.neuroimaging.io> is already being widely used
across the neuroimaging community and software.
For the definition of the organization and data, I'm wondering how
schema.org can fit in. I saw that Natasha (previously at Stanford!) at
Google for Google Datasets (see this article
<https://www.blog.google/products/search/making-it-easier-discover-datasets/>)
mentioned schema.org, and it definitely seems relevant for web page
content and making it searchable. In that we want our strategy to be easy
to sync with what the larger community is doing, I wanted to ask about
research data? How can we work together and leverage the resources here so
that our datasets can eventually integrate too into tools provided by
schema.org, Google Datasets, and be useful for searching for our
researchers after archive? How can we contribute templates and other
tooling here to help toward this? Thanks for your help!
@cmh2166 <https://github.com/cmh2166> @hannahfrost
<https://github.com/hannahfrost> @rmarinshaw
<https://github.com/rmarinshaw>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/schemaorg/schemaorg/issues/2059>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEaASckgfh_pT02NE46CAcKmmOBCpJyWks5uaOIugaJpZM4WlGSS>
.
|
@danbri Noted about use of "damn it". Sorry, bad day at Ericsson. My apologies. Thanks for noting you also would like to see as much discussion as can be had in this issue as well. |
hey everyone this information is really fantastic! What I'm doing is starting my exploration from the point of version control - any schema that we use and then tooling to move the data it describes is going to start with Github (and I'm realizing, Git Annex). So my plan is the following:
which may be of interest here. There are also efforts like Bioschemas who are starting to crawl this data and elaborate a few schema.org additions that help bridge the cross-domain dataset descriptions with domain-specific identifiers and ontologies. Having templates and entrypoints for say, a biologist to easily plug into the right schema, and then use it to move data from a local place to Google Cloud (I don't mean to preference a vendor but the Stanford hospitals are heavily invested and this will be a first use case!) and then to the library archive when "live work" is done is my desired goal. I'm also really glad to hear that the the various technologies are related - it makes life much easier, but also says good things about the people and communities involved. Early GoalsSo here is what I'm setting out to do first - and of course this will change as I learn more! I think in my first "dummy test" I am going to try and see how far I can get doing the following: Use Case Driving Goal
Additional Tooling will Mean
I will link this issue in my notes so I remember to post updates along the way! I really like Github issues for this kind of discussion, and to be honest get a little lost in Google Doc comment chains. If it's okay with you, we might keep this issue open to have further discussion in the coming weeks. |
@DmPo interesting- do you have a pointer to the details? Is the newish Claim type of any use for your approach? |
Hi @danbri, I will have a pointer later this week - launching a new site right now. Yes, sameAs is a good idea - it can be used for linking reposts of a fake to the "original" fake. This makes easier debunking of the reposts of the same fake. |
hey everyone! A quick update after some weekend work. I will start with the use case and walk through the steps I've taken so far (and where noted, where I have a question or two). Just as a note I sent this out to a few of you via old school email too :)
For all of the above, I'm incapable of moving forward without having a nice web interface to describe the process (beyond the Github READMEs) and I always want a "specifications" repo for a user to be able to submit a specification output to (for one cohesive specifications repo) so I'm going to work on this next. Once I have this web interface and Container specification list (the discussion I hope to start here) I'd like to submit it proper to schema.org (as an extension?) and then go back to describing my Dockerfile dataset! I am thinking of using Datalad to move things around, and notably for the git-annex functionality. And I'd really like to bring those interested here into working on this tooling under openschemas! Just let me know you are interested and I can add you. I realize that some of the tech / other open science related things don't fit well under bioschemas (where I was originally contributing) so I made this organization, and it falls nicely alongside the openbases that are provided templates for doing reproducible things for open science, generally. Full circle, cue Lion King music :) That's the plan for now! I don't think I need any help or have question beyond the bolded above. Whatever information you relay to me I can also write clearly into the web interface I'm making, so it will be good use of time. I'm going to work on that later today - likely taking a quick break to go for a run 🏃♂️ |
Quick update! @ricardoaat has taught me the updated specification format to feed into the web interface (and schema.org) so I'm updating map2model to generate that (see this quick links in this issue), and I've prepared the web interface I was talking about to serve them (a sibling of the beautiful bioschemas.org!). I'll thus:
Note that the repos / site are pretty bare bones, I'll be working heavily on them this week. |
Quick update:
I'm working on testing for the submissions to specifications next (in map2spec) and then I'll finish up the Container* family of new specifications (hopefully this weekend?) and then (finally) try using the definitions to describe the dinosaur Dockerfile dataset ( |
Small distraction - I converted the spec-container fully into a spec-template that is now added as an openbases template. This means adding a new badge for specification builders ("spec") that uses the same red from schema.org (the darker one) as a fun easter egg :) The full template falls within openbases because the user just needs to fork, add some file content, and then build on circle and they get artifacts / ghpages artifacts, and a web interface for their drafts. Same plan mentioned before for next steps (testing then Container definitions!) @rajido also reached out to me today and we are going to talk about labeling the biocontainers, which will be wicked! |
|
Another update. We now have ContainerImage! I'm hoping for discussion / feedback from OCI, and then to generate some kind of official submission for a schema.org extension. Is there a documentation base for how to do that? Discussion with OCI (I hope) will happen here --> opencontainers/image-spec#751 (comment) and that's also where you can find links, if interested.` My next steps would be to:
And in there somewhere I'll clean up the docs and write some "hey you can do it too!" material to help researchers with weird datatypes that warrant a specification to contribute. |
Developing / testing / proposing updates, additions, extension, etc. to Schema.org - background reading: |
okay so I'm trying to follow some of the documentation and from what I can tell:
@satra advised me that I shouldn't have new properties so I just removed any new ones (with parent "Container") from the proposed one that I made,but I'm a bit confused about this because the example about definitely has a bunch that are categorized under "Legislation" (and that is the proposal). I had started one here mimicking Bioschemas, but it seems like the suggested thing is to use the app that is provided at this repo. Should I blow up what I've done and start again? As a newcoming I find this entire process and the discussion really confusing, for what it's worth. There is no clear checklist or set of steps for starting from scratch to making a submission other than long verbose pages across many places and I'm doing my best but struggling quite a bit. Guidance (specific, stepwise things) would be appreciated! Thank you! |
And just a quick comment, this is a huge issue, from the view of a developer:
If an expert with a schema or domain does not find it easy to contribute, that is a failure state I think. I want to suggest that we can achieve both good software and specifications, and they can live together in harmony. |
I am not sure why you shouldn't suggest new properties and classes. After
all that is the point, right?
And yes, we do need to evolve so that we can start collecting and
redistributing software. Part of the problem is that this git mailing list
is really the mailing list for the appengine code that runs the schema.org
site, and having to bring up a new one is probably not the path of least
resistance for making a contribution.
My personal preference (and I am sure this is just a reflection of my
cruftiness) is a simple English description of the terms and classes you
propose to add. On this list.
The most important part of the proposal is who you think will use it and
why the community will benefit if everyone uses it. Too many proposals are
from 'professional ontologists' (like I used to be), who would like to see
a schema for their favourite topic.
guha
…On Mon, Oct 1, 2018 at 4:45 PM Vanessa Sochat ***@***.***> wrote:
And just a quick comment, this is a huge issue, from the view of a
developer:
We expect collaboration will focus more on schemas and examples than on
our supporting software.
If an expert with a schema or domain does not find it easy to contribute,
that is a failure state I think. I want to suggest that we can achieve both
good software and specifications, and they can live together in harmony.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-426099886>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFAlCmI7O5ODhegrQjdBGnPC7VeLk_t3ks5ugqkHgaJpZM4WlGSS>
.
|
Thank you for this feedback @rvguha ! I had just cloned this repository, and was going to give a crack at creating a Dockerized local template that could be run to produce a (local generation) of the example site. But it sounds like this isn't priority, at least as long as there is some method to share a specification suggestion? In this light, does the template that I created, perhaps with better description of the classes, suffice for discussion? I was aiming to make it easier for contribution - so far the workflow is:
The template generates the static files of the specifications in yaml, and I would want this template to also generate (whatever format of file) is needed for a "real / final / etc." contribution to schema.org (json-ld?). The goal of the steps above is that any domain expert can generate a version controlled, and easy to have discussion over web interface and "final submission" files without knowing anything about software engineering / version control etc. This is a very easy path to contribution, but if it's the case that the templates I've made thus far aren't producing the kind of content that could be discussed, I need to step back and re-work them. If they are in the right direction, then I would suggest that I:
And although it's not priority, is there interest in generating a dockerized local version of the app engine deployment as a (possibly better) alternative to preview / discuss a specification contribution? As the Legislature group did, it looks really sharp. I notice that the template here has a nice switcher at the bottom for looking through different data structures of the specifications, and minimally I'd like to add that to the template that I am working on, because I'm guessing one or more of those structures is the final file(s) to be submit. As for rationale - holy cow! It's so badly needed it almost feels silly to restate, but I can briefly comment now. Containers are sort of hugely important for anything and everything! A container is the currency of reproducible deployments, and with technology like Singularity even of scientific compute (because we can run on shared cluster resources). Without a specification that describes the components of images, recipes, and distrubtions (registries) we are living in a messy universe where the best I can do to find a "tensorflow" container is do a search on Docker Hub and pray that the random container I choose (based on tags or a name) might be what I'm looking for. The way we solve this problem is by way of getting Google's Help, meaning having containers join into Dataset search. In that schema.org specifications are the driver of this, we simply must have these definitions. They must follow the OCI specifications, and move in parallel with them so that we aren't inventing new things and making it harder. This (in my mind) seems like such an easy problem to solve, the slow and hard parts are just putting these pieces together. This is what I'm trying to do - and since I've found this challenging I've been trying to make a template and more programmatic way for (the next person who wants to contribute) to make it easier. Some quick links:
In a nutshell, we need a scaled way to not only provide a list of containers, but better information about them that can be searched. I think this goes beyond the abilities of what each tiny maintainer can do,but a massive search engine like Google can do fairly easily. It would be trivial for the maintainers of:
and other registries to be able to add metadata tags to the container registry pages, and then have a beautiful way for a scientist to not find just a tensorflow container, but the tensorflow container for the purpose that he/she needs! Without this, the universe of containers is just a mess. So much awesome development and tooling will come, and an ability to better compare methods and software when we have this labeling. Please let me know your feedback! I don't want to just talk about these things, I want to make them happen. |
These investigations sound very interesting but I'm having trouble keeping the various levels of abstraction clear in my head, to be honest. Am I right that we're talking both about the specific tooling for Schema.org collaboration ("I had just cloned this repository, and was going to give a crack at creating a Dockerized local template that could be run") as well as working towards new schema.org schemas to describe things around containerized software, services and datasets? If that's correct I'd suggest breaking out the Schema.org project tooling aspects into another issue. @RichardWallis has been working to minimize some of our dependency on AppEngine; today's commits bring us closer to a static file-based generation and serving model. For those looking at dockerizing schema.org's current tooling, the Travis-CI config might be useful. On the vocabulary front there is a grey area in between http://schema.org/Dataset and the more software-oriented types, where container technology is increasingly central. This came up a bunch when I was talking to folk like bioschemas. It would be useful to understand how container descriptions could be integrated into both schema.org-based Dataset description, and also into W3C DXWG WG's DCAT efforts. The structures are very similar. Perhaps @agbeltran can offer some advice, as she has a link in both those efforts. |
Correct! I had only intended to do the second, but I found the first so hard that I decided to try and help as I was doing the second. I'm mostly done with the goals I had outlined for the first, and am very happy to discontinue working on it in favor of the second thing (my original goal).
This sounds great! I am a big fan of continuous integration :) @RichardWallis I won't throw any more wrenches into the mix, I'll stick with the openschemas templates I'm using now because I'm mostly done, but please reach out to me if I can be of help.
Exactly. This is why I created openschemas. Take a look there at the link in the first lines of the opening paragraph, which I wrote some time ago now. It makes this exact point. :)
I don't think containers belong with Datasets, or with Biology related things. A container is definitely not the right place for a dataset (although it interacts with them) and there is definitely no exclusive tie to bioschemas / biology or even a scientific domain. It's an open source technology that is useful for many things, and deserves it's own sort of bucket (or minimally shouldn't be forced artificially into a bucket it doesn't belong).
That would be great! Here are the full set of specifications, for recipes (e.g., a Dockerfile), images (e.g., the actual container binary, which might have some data but really is moreso likely to be software to interact with data) and then the base of those things is the more abstract Container). https://openschemas.github.io/specifications/ Happy Hacktoberfest everyone!! 🎃 |
Hi @vsoch OCI already has good overlap in its Container Image image-spec annotations I just noticed, with our Schema.org/CreativeWork and this is a great starting point. For datasets that are provided sometimes within Container Images... how is the metadata typically captured for those datasets ? What metadata standards are used commonly in Scientific realms besides those listed here if you know ? |
@vsoch Just so you know... We have an equivalent Tags convention in Schema.org with our https://schema.org/keywords property. You might also already be aware of this, but... Images and Containers are 2 different things... I was previously talking about Images, where you would get output about metadata of what an image and its possible data might contain with the output of |
ah this is very good! I'll see if I can integrate these things (tags and annotations discussed above) into the container specification(s), and also provide a "Where does it fit?" simple diagram to share here. Likely tomorrow - need to eat some dinner. 🍽️ |
@vsoch no problem. Also we have a "generic" Key:Value system that can be used when there is no Schema.org Property already created yet ... https://schema.org/PropertyValue So right now, in my opinion, I would say Schema.org has 100% of what you need to describe structured data about Containers and Images and their metadata. (The existing properties we have might not be the best fitting, but they can fit and be understandable for most search engines and structured data parsers) If you don't find a property within Schema.org to hold your structured data about Containers and Images and Datasets...then let's talk about those, and we can gladly point to possible candidates. (incidentally, a few weeks ago, I did look at your use case with BIDS data and specifically around how PyBids handles the metadata around it within its functions here: https://github.com/bids-standard/pybids/blob/master/bids/variables/variables.py and I didn't see anything that Schema.org couldn't handle currently in some fashion, again, perhaps not always the best fitting, but it could be expressed with Schema.org's current Types and Properties ) So when metadata doesn't seem to fit well, we can talk about those. |
Hi all, I'd like you point you to an effort we have been working on that uses Schema.org for packaging data (which is not containers or images). This uses almost 100% pure schema.org to describe data, and seems to be compatible with Google's dataset search. https://github.com/UTS-eResearch/datacrate/tree/master/spec/1.0. See the parts about file provenance using schema:CreateAction. Does any of this help? |
Not certain if you will find this useful. As part of the DASPOS project, we developed a "Computational Environment" ontology design pattern (http://ceur-ws.org/Vol-2043/paper-03.pdf) in collaboration with CERN to capture the provenance of environment where HEP calculations are performed. As part of the process, we looked at both VMWare and Dockerfile vocabulary to inform the pattern that it captures a broad set of the vocabulary. The pattern can be populated via a script with instances from Wikipedia (and Wikidata). The OWL for the pattern is in (https://github.com/Vocamp/computationalEnvironmentODP). There is a matching "Computational Activity" pattern (https://github.com/Vocamp/ComputationalActivity) that captures the provenance around a computational execution that links to the Computational Environment pattern. See the concept map: https://github.com/Vocamp/ComputationalActivity/blob/master/concept-map/computationalActivity.pdf. We didn't get to the (general) patterns to describe the underlying data sets used in a particular computational activity. I had a student work on a proof of concept "smart containers" (http://linkedscience.org/wp-content/uploads/2015/04/paper2.pdf) tool that wrapped the docker command line tool capture the provenance of the Docker operations using the ODPs and attach them as a label to the docker image. the code for this, somewhat functional prototype is in the smartcontainers https://github.com/crcresearch/smartcontainers repo. |
hey everyone! I did a reverse of plan - I realized that I needed to bring up discussion for "Where does it fit" before updating the specification, so let's start with that :) I'll start with (texty) discussion here about where I think each component fits into the currently existing schema.org. After discussion here, I'll update the specifications files to reflect what we discuss. I apologize in advance for probably not using terms / descriptors / properties correctly - please feel free to correct where I'm off. Previous ArtThere is an ontology that describes virtualization
This will organize our container universe, and be essential not just for academia, but for industry and all domains it touches.
Representation of containers that is too detailed is actually just as bad (I think) as not having enough representation, period. It might be useful for knowing the version of a kernel if I want to know if I can use a Singularity Container there (for example, Centos 6, no overlayfs, ruhroh) but there are a lot of intricate details that might be useful in only 1% of cases. And having all the extra support for those 1% actually makes the specifications really complex and confusing. So for this first go, I would suggest we try to hit the core needs of the top 80%, and favor simplicity with the mindset that if additional need is there, the community will step in and express it.
The original specifications I had in mind were:
but now I realize I think it's a bit more to that. Let's talk about this, and I'll address them one by one via questions, and explaining my thought process. Question 1: Where does container fit in?I started with a very simple question.
Meaning
That is the easy answer, because it fits into existing specifications in schema.org.
But at some point, we are going to care about operating systems, hardware, and hosts. The
And our virtualization then relates to those things:
and now adding containers and hypervisors, they can inherit through this graph
So while I don't think we want a super detailed organization of hardware and virtualization, Question 2: Where does container recipe fit in?A container recipe refers to a set of build steps for a container, which is a binary that has an operating system and associated libraries. Examples include Dockerfiles and Singularity recipes. If we want a quick and dirty solution, in that it's a template or script, a
If we consider a container to be a kind of software (it is a binary...) then that fits pretty cleanly. Does anyone else have thoughts about this? Question 3: Where does container image fit in?The container image isn't the running instance, but the binary (for Singularity the actual file) that generates the instance. It's weird, yeah. If we consider Container to be the more abstract thing, if it could be the case that there are kinds of containers that don't require images, then it would be a child:
And then the instance I think would coincide with the container runtime, so we would have this:
The OCI has specifications for images and for runtimes, so this is logical to model both. The Question 4: Where does container distribution fit in?A container distribution is a container registry (Docker Hub, Singularity Hub, Quay.io, Biocontainers, etc.) Would it be a subtype of a collection?
So basically, it's a collection of @ptsefton data crate looks really cool and I definitely think it could be useful once we have these definitions (I've added an issue to give it a try along with datalad for my test dataset!) and @charlesvardeman this is also very useful - have you thought about bringing up these descriptors with the OCI maintainers so it can be linked to a container runtime? I would say we would want to model them separately in schema.org (the idea of the Computational Environment) and then say something like Wait... what about this computational environment?Actually, this is a very good point, because a computational environment might describe one or more hosts, and this is the kind of thing you would call any sort of cluster. But again I would challenge us to reduce the complexity to a level that can be extended, but doesn't over-complicate based on the goals of having it. In my little framework I'm describing here, based on looking at your chart, it would seem that you are suggesting a
Here is where it gets kind of cool! I find this interesting because a computational environment could moreso refer to a collection of hosts and hardware (e.g., Kubernetes, SLURM / SGE) - meaning multiple SoftwareHosts OR for the humans among us, just a single SoftwareHost. So instead of SoftwareHost being a kind of ComputationalEnvironment, it becomes a link / (properties?) instead:
I like that better :) Let's put this all together to look at in one place, and I'll leave it open for discussion! I want to suggest that I can take charge of creating specifications for review for the
Let's circle back to our original points - first the goals. The above would allow for nice labeling of containers with software and data, for containers served in registries, that then could be indexed by Google and the properties exposed for not just discovery, but for "grid type" analyses to answer questions like "What is the optional computational environment to run Container X?" Second a mindset of simplicity - for many of these, we can leave them to be very simple / general (like shells) and have the community come in and make contributions for the details. I think it's our job to set up the skeleton / framework, and not to try and get the entire detailed thing perfectly. That was a lot more than 0.02, so I'll say there is my 2 dollar pancake. 🥞 :) |
@vsoch I would be happy to help. I’m traveling over the next couple of days and need some time to digest your suggestions and write some comments. One other comment that may give food for thought. We started developing a pattern call ComputationalObservation that was akin to O&M or SOSA observation for a computational result. Here is a brief talk that I gave to ontolog. |
Have a safe trip, and looking forward to your thoughts! If the container is the box of pancake mix, the cabinet and then house are the computational environment and hardware, respectively, this is one level deeper - the computational observation is actually everything that goes into creation of the pancake mix (the algorithm to grind the flour, the kitchen it was done in, the amounts, etc.). I think this is important too, and probably should be represented independently even from the small hierarchy we are discussing. For example, you can easily have a component of a ComputationalObservation without any Hardware or a Container. For an algorithm, well couldn't that even be in my head? 🤔 Anyhoo, have a safe trip and let's talk about all of the above when you have some time! I want to also look more closely at how schema.org is generating the final specifications because given the need for json-ld and similar, I'm now not totally happy with just having yaml. But I don't think there is rush for this development because as @rvguha pointed out, the important first thing to do is have discussion (and we can do that right without any special tools :) ) |
I'm just getting caught up on this thread, and I was reading:
And that sounds a bit like some some of the features in the DOAP ontology. Are you aware of that ontology? and is it useful for what you are doing here? - I'm suggesting that it might be and there is already some use of it on PyPI and several other software repositories. Here is the github link for the project: https://github.com/ewilderj/doap . Some years ago there was a paper about DOAP presented ad the DCMI meeting: DCMI-Tools: Ontologies for Digital Application Description. |
Thanks @HughP. To all in the discussion - to be clear my goals right now aren’t to delve into describing software, repos, or projects- some of the extended discussion about was just suggestion for how additional (more detailed) descriptions of software or projects could fit into what I’m thinking of. My primary goal is to describe the levels needed up to having a container. Indeed in containers there is software that might be described by these additional ontologies and down the line I would definitely enjoy helping add these to schema.org, but right now they are out of scope for this discussion. |
@vsoch The world moves quickly in regards to Container standardization (and around metadata) I don't want to fill up a book in this comment to you...but suggest a few other resources for further study which i did myself last weekend ... https://docs.ansible.com/ansible-container/ and https://galaxy.ansible.com/docs/contributing/creating_role.html#role-metadata For us in Ericsson, we are fully embracing Ansible Container along with OCI image-spec. |
I definitely agree that OCI is leading in the space, and this is why I am mirroring the specs. I can't comment on Ericsson, I don't know anything about that company other than maybe a phone company? I am definitely in support of the idea to develop where the community is moving and thriving. The missing component of going directly to that effort THERE is that (and please correct me if I'm wrong) there is no connection between that initiative and plugging into a super-power search engine like Google, for having containers indexed by Google Datasets. You can make the greatest of standards or other but when push comes to shove, if you don't set it up with the right plumbing it's not going to be useful to the graduate student sitting in his dorm room trying to do an efficient search for something. My understanding is that there is no direct feed from the work of OCI into such a global tool, beyond a small strategy of having individual registries providing the result of the effort through their individual APIs. Is this incorrect? I'm rather marginal / not opinionated about which standard to embrace, so I don't have opinion on "the standard that is best" but I do want to choose the deployment infrastructure that can best and most efficiently distribute the search. That (in my mind) looks more like Google than Docker Hub or similar. This is the reason I've taken this route - schema.org is the gateway to that amazing resource. |
Anyway - @charlesvardeman when you get back, I started with a very basic template for Hardware --> https://openschemas.github.io/spec-hardware/. This is the skeleton level of representation that I would want to have for Hardware, and Virtualization, and then I can develop container (and others with expertise in the previously mentioned would work on those). The idea is that you can easily make changes, open a PR to test, and then merge will update the web interface. Then you can ask others for review discussion, and "publication" at https://openschemas.github.io/specifications is just moving a file and then another pull request. And everyone here - this is where I'd want to ask for advise, help, etc. on converting my front matter (yml) to json-ld and "whatever format is needed for schemaorg submission." But as it was pointed out, we should have discussion about the specifications here before that. Back to @charlesvardeman (and others interested!) my understanding is that schema.org can embrace other ontology definitions, so if you would be interested to contribute to these templates, what I have done so far is downloading the tsv files from the Google Sheet mentioned in the README (but there isn't any reason you can't open them (with tab separation) on your computer. Right now, there isn't anything special there - it's just inheriting properties of "Thing." You can make changes to your heart's content, then PR and bring in others for discussion. Anyway - I'm still looking forward to feedback on the above. In the meantime I'll start skeleton templates for the hierarchy I defined above, and can try to figure out how to make json-ld if nobody has a script or example. |
and I want to ask @rvguha how can the work that @thadguidry linked at opencontainers be linked to Google Dataset search? If we can do that, and the development / community is more active there, it might be best to pursue that connection instead. |
And @thadguidry I think we need to figure out how to work together. The expertise for containers indeed comes from opencontainers, but the expertise for everything else that might be modeled (e.g., look at the list of things in schema.org --> https://schema.org/docs/full.html) comes from there. We must be able to have the work that is being done by opencontainers represented in that larger graph, otherwise it's a limited view of a small domain with no understanding of how it fits into a big picture. |
I am wondering if Hardware and Container are too broad- this is likely to
overlap with more general uses of the terms. Maybe ComputationalContainer
and ComputationalHardware?
…On Fri, 5 Oct 2018 at 11:37, Vanessa Sochat ***@***.***> wrote:
And @thadguidry <https://github.com/thadguidry> I think we need to figure
out how to work together. The expertise for containers indeed comes from
opencontainers, but the expertise for everything else that might be modeled
(e.g., look at the list of things in schema.org -->
https://schema.org/docs/full.html) comes from there.
We must be able to have the work that is being done by opencontainers
represented in that larger graph, otherwise it's a limited view of a small
domain with no understanding of how it fits into a big picture.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-427220330>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAuJ2F9kfS9fmQzA3f4ReKNkK2HMBKcfks5uhrf1gaJpZM4WlGSS>
.
--
Peter Sefton +61410326955 pt@ptsefton.com http://ptsefton.com
Gmail, Twitter & Skype name: ptsefton
|
@vsoch If data is not in Container Images...then I don't see why Google Datasets would bother indexing them in some fashion? But if Google Datasets is open to the idea that Datasets could be found in ANY format or package as so happens A LOT in Science domains (including Container Images as a package format)... then that would be @rvguha to talk to...not me or Ericsson my employer. I can only speak to what technologies we at Ericsson internally use to see what data/software might be lurking inside Container Images. You are on the right track Vanessa to solving your discoverability problems. NOTE: Careful with just the casual use of the term "Container" (which are ephemeral) versus the more appropriate term for your use case of "Container Image". |
Definitely something I am aware of!. Albeit there is a lot more I'm not aware of, I'm doing my best :) I'll let @rvguha chime in on how the two world can work together - it would be perfect if the experts in a domain can easily "plug in" their work to the larger graph. |
I can speak specifically for the Google Dataset Search work if you have questions relevant to Schema.org discussions, but please do remember this forum is not really about Google products. At Google we have a wide variety of data- related activities going on (as you'd expect), with Schema.org as a common element in several. As far as Google Dataset Search's definition of "Dataset", it is based on schema.org/Dataset and the definitions are intentionally inclusive (we gave several examples but also say "Anything that looks like a dataset to you). As far as all that relates to Schema.org and containers, I'll just repeat my observation above that software and data and containers are intimately inter-related in practice but there's little to be gained from trying to define exactly where the one stops and the other begins. |
I agree with Dan on those points. |
Yes, but that doesn't mean that Container == software, or that Container == dataset. They are very different. It's like saying that Refrigerator == food, or Refrigerator == Mom's leftovers. Mom's leftovers and the food are found in the refrigerator, but both can exist without the Refrigerator (and vice versa). Datasets are entirely different beasts from containers, and to clump a container under the "Dataset" tag under the guise that "Containers interact with datasets" is far too abstract.
There is absolutely everything to gain, because the Container must work and be optimized for a computational environment. As soon as you remove the idea of a "Container" from the hierarchy then you are forced to deal with groupings of software (many), environments (many), and metadata (also many) and you lose the vehicle to easily group those things (the container). So yes, the relationship is strong between the groups, but schema.org needs something that looks like a Container. I really like the idea of being able to connect OCI with schema.org, so let's talk about that. |
my impression here is that the objective of the proposal is to provide a 'meta-data-means' to define Docker and Docker parts. A related analogy is a meta-data-means to define the components of a Blockchain. We have some experience with this and have pushed the specification of an item in a Docker container and a Blockchain to the lowest possible level. Then we use conventional mereology (hasPart, isPartOf and more specific mereology relationships) to define the container. Several simple reasoning applications demonstrate that this approach works. Bottom line: success is based on the quality of the |
@thadguidry with respect to opencontainers, I did open an issue on that exact repo over 2 weeks ago, but nobody from the community there responded. opencontainers/image-spec#751. |
hey everyone, I was able to have good conversation with OCI, and minimally was able to come up with a quasi reasonable proposal (see summary and links from link above). The next step was to meet with @rvguha, which was supposed to be Friday, but now it's moved to sometime in January. This is out of my hands at this point, and to be honest I'm feeling down about it. We are going to be over 4 months by the time anything can even be talked about, and I find this really depressing. Thanks again for the good discussion here, and have a good holiday season. |
I decided waiting for a meeting was not going to accomplish much very quickly. There is much I can do before that. A verbose writeup is here: https://vsoch.github.io/2018/schemaorg/ TLDR: I created a schemaorg python module that makes it possible for developers to interact with the specifications, and spit out templates that can be put in webby places (and then parsed by the Google bots). This is different than the python you have in this repository that serves the schema.org website. An example of the output is here and view the source to see the embedded json-ld. More examples are served at the repository, and note that the extraction of metadata can go much further than what I did here. We can do some really cool stuff to parse the files list from container-diff, for example. If you want to share the above, here is a link: https://twitter.com/vsoch/status/1059611828765052928 |
FYI, Research Site is now being proposed on Wikidata with good examples and discussion. |
Oh cool! @thadguidry where do wikidata definitions get used? It looks like you can add entries for datasets, is that correct? And of course it's open to all, wiki style :) I found Categorization and the (rest in peace) browser but I'm still not totally sure what is going on. @thadguidry can you point me in the right direction maybe, or give some quick rundown of how it works? |
@vsoch Wikidata self-training :) is available by going to the main site https://www.wikidata.org/wiki/Wikidata:Main_Page and you'll see there's Project chat and IRC chat as well as a mailing list. This will take you a while to get acquainted with Wikidata and its mission/purpose/usage. I.E. don't use this issue for Q/A of Wikidata. :-) Thanks! See you on the "other side". |
See issue #7 for the context of the move from the main Schema.org issue tracker to this repository. |
hey schema.org team!
We are putting together an organizational and data movement strategy for research computing and the library at Stanford, and I wanted to ask how the schemas fit into the domain of research data. I will describe the ideas we are discussing first to give you some context.
As it moves around, the organizational schema will help to guide interaction. It will help with validation and query and integration with tools built around it. For many of these organizations, they will come from the research domains themselves. For example, the brain imaging data structure BIDS is already being widely used across the neuroimaging community and software.
For the definition of the organization and data, I'm wondering how schema.org can fit in. I saw that Natasha (previously at Stanford!) at Google for Google Datasets (see this article) mentioned schema.org, and it definitely seems relevant for web page content and making it searchable. In that we want our strategy to be easy to sync with what the larger community is doing, I wanted to ask about research data? How can we work together and leverage the resources here so that our datasets can eventually integrate too into tools provided by schema.org, Google Datasets, and be useful for searching for our researchers after archive? How can we contribute templates and other tooling here to help toward this? Thanks for your help!
The text was updated successfully, but these errors were encountered: