-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial version of provenance schema. #583
Conversation
If "environment" and "parameters" were made optional, and if "software" were split out into a "software" string and a "version" string (both at the top level), it would match what SLiM already produces, which would be nice. The point is not so much saving work on my end (changing this stuff is trivial) as allowing SLiM 3.0 files to be schema-compliant, and not changing the format between versions unnecessarily (which just causes confusion all around). If there's a compelling reason driving those changes, then that's fine and we can just say SLiM 3.0 files are non-compliant, and so it goes; but offhand I don't see any particularly compelling reasons for those alterations, so maybe we can preserve backward compatibility? The other thing I would note is that requiring "parameters" to be a key-value dictionary (which I take it is what is meant in JSON by "object"?) is perhaps unnecessarily restrictive and undesirable. Right now SLiM generates an array, instead, for "parameters", composed of all of the command-line parameters that were supplied (straight from argv/argc), and that feels rather natural and simple. Some command-line parameters take no argument (so they would have a key with no value), others might allow more than one argument (multiple values per key). It seems like it is more flexible and natural to just let the generating program represent its parameters as it sees fit. Is there a need to impose the key-value structure as a requirement here? |
Thanks for the feedback @bhaller, comments below.
I don't think we should worry too much about backward compatibility here to be honest. Tskit isn't going to be checking these provenance strings for compliance, and I have no plans for adding APIs to do anything with the values at the moment. Users never see these provenance values the right now, so there's nothing to be confused by there. When we do add APIs, they'll have to be reasonably robust to provenance information that doesn't conform (i.e., maybe just print out the string as-is rather than nicely formatting, say). To me this exercise is about avoid these issue in the future when we do start using the values. So, I would rather get this right and try and get a nice, clean and extensible schema. Old msprime files will also have provenance information that isn't compliant, btw.
That's a good point. I think for recording CLI parameters the right way to do it would be something like: ...
"parameters": {"command": ["-X", "1", "--something"]}
... Having parameters as a mapping gives the flexibility to cover both CLI invocations as well as API calls. If we use a list, then it's quite limiting, as API calls would have to be either a list of (parameter, value) pairs or just a straight list of positional parameters. |
OK, that all makes sense. I'm on board with this proposal. :-> |
be69d8b
to
6230b56
Compare
I've added some documentation and I think this is ready to go now. I'll merge it in a couple of days if I don't hear any objections. |
I notice you're writing out a "schema_version" entry but no "schema" entry, whereas previous versions of this that I saw from you had a "schema" tag with value "tskit-provenance". Did this get removed deliberately, or should it be added back in? |
I took it out because it doesn't seem to be the done-thing in JSON schema. Best practices indicate that you should include a $id URI to uniquely identify the schema (in the definition), but there's no guidance on making the actual documents self-identifying. Since JSON schema isn't worrying about it and this is such a limited and specialised application, I figured it was overboard to include the schema details in the actual documents. I left out the $id field in the schema definition because I'm not sure what it should be and can't be bothered setting up a stable URL where it could be hosted. This can always be added in later if we change our minds. Adding a version number did seem prudent though, since this schema is likely to evolve over time and it'll be good to have SemVer semantics so that clients can deal with this. I'd be good to hear your thoughts on this (if you have any!)... |
OK, if this is standard practice then no worries. Re: "stable URL where it could be hosted", I have a stable domain (benhaller.com) where I'm happy to host (not-huge) stuff if you ever want that. We already host the downloads for SLiM there. |
Here is a sample SLiM provenance table now (@petrelharp):
The parameters under "command" are just an empty array because this was run in SLiMgui. I believe it follows the schema. SLiM-specific cruft has been put under a "slim" top-level tag. This will require an update to pyslim, presumably. Note that file_version was 0.1 for SLiM 3.0 (and was a top-level tag), and now it is 0.2 (and is under "slim"). The previous post-3.0 file format also called itself 0.2, but I think we should just pretend that it never existed, as it was only a thing briefly in GitHub, never in release. |
Thanks for the offer @bhaller --- it's not so much the web hosting, it just feels like the long-term stable URI for a schema like this should be something associated with the project. Something like Having said that, tskit.org is free. Maybe I should just register it... |
The SLiM provenance looks great @bhaller. I have a minor suggestion but feel free to ignore: {
"environment": {
"os": {
"machine": "x86_64",
"node": "darwin-447.local",
"release": "17.6.0",
"system": "Darwin",
"version": "Darwin Kernel Version 17.6.0: Tue May 8 15:22:16 PDT 2018; root:xnu-4570.61.1~1/RELEASE_X86_64"
}
},
"parameters": {
"command": [],
"model": "initialize() {\n\tinitializeTreeSeq();\n\tinitializeMutationRate(1e-7);\n\tinitializeMutationType(\"m1\", 0.5, \"f\", 0.0);\n\tinitializeGenomicElementType(\"g1\", m1, 1.0);\n\tinitializeGenomicElement(g1, 0, 99999);\n\tinitializeRecombinationRate(1e-8);\n}\n1 {\n\tsim.addSubpop(\"p1\", 500);\n}\n2000 late() { sim.treeSeqOutput(\"~/Desktop/junk.trees\"); }\n",
"model_type": "WF",
"seed": 1722162964723
},
"schema_version": "1.0.0",
"slim": {
"file_version": "0.2",
"remembered_node_count": 0,
"generation": 2000
},
"software": {
"name": "SLiM",
"version": "3.0"
}
} The idea is that |
6230b56
to
077affc
Compare
I like your proposed shift, I'll do that this morning. Thanks for the feedback. |
The documentation is up here --- should explain the rationale for things. |
Great, I'll link to that in my documentation, thanks. @petrelharp, SLiM's file_version 0.2 provenance format is now as Jerome proposed immediately above. Done and committed. |
Excellent. Probably best to link to the stable version: https://msprime.readthedocs.io/en/stable/provenance.html Doesn't exist right now, but will later today hopefully. |
First pass at defining the provenance schema using JSON schema. Closes #556
The purpose of provenance is reproducability: given a tree sequence that has this provenance information associated with it, you should be able to reproduce it. This won't always be possible for various reasons (parameters too big to encode as JSON, referencing files that don't exist, numerical precision of parameters, ...), but this is the basic spirit of the idea.
UPDATED to include schema_version
Here's the current schema:
This is what we end with from msprime:
This is slightly different to what we had before, but in the same spirit. The basic idea is that we have three things we need to document for provenance: the software used, the parameters provided to this software and the environment within which this software was run.
The schema is pretty light-touch, as there's no point in over-specifying this stuff. The minimal compliant instance is:
Basically, you have to have these three keys, and you have to give non-empty strings for the name and version keys, and specify the schema_version You can put anything extra in there that you want.
TODO
Pinging @petrelharp, @molpopgen and @bhaller for comments!