Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of provenance schema. #583

Merged
merged 3 commits into from
Aug 24, 2018

Conversation

jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Aug 21, 2018

First pass at defining the provenance schema using JSON schema. Closes #556

The purpose of provenance is reproducability: given a tree sequence that has this provenance information associated with it, you should be able to reproduce it. This won't always be possible for various reasons (parameters too big to encode as JSON, referencing files that don't exist, numerical precision of parameters, ...), but this is the basic spirit of the idea.

UPDATED to include schema_version

Here's the current schema:

{
  "schema": "http://json-schema.org/draft-07/schema#",
  "version": "1.0.0",
  "title": "tskit provenance",
  "description": "The combination of software, parameters and environment that produced a tree sequence",
  "type": "object",
  "required": ["schema_version", "software", "parameters", "environment"],
  "properties": {
    "schema_version": {
      "description": "The version of this schema used.",
      "type": "string",
      "minLength": 1
    },
    "software": {
      "description": "The primary software used to produce the tree sequence.",
      "type": "object",
      "required": ["name", "version"],
      "properties": {
        "name": {
          "description": "The name of the primary software.",
          "type": "string",
          "minLength": 1
        },
        "version": {
          "description": "The version of primary software.",
          "type": "string",
          "minLength": 1
        }
      }
    },
    "parameters": {
      "description": "The parameters used to produce the tree sequence.",
      "type": "object"
    },
    "environment": {
      "description": "The computational environment within which the primary software ran.",
      "type": "object",
      "properties": {
        "os": {
          "description": "Operating system.",
          "type": "object"
        },
        "libraries": {
          "description": "Details of libraries the primary software linked against.",
          "type": "object"
        }
      }
    }
  }
} 

This is what we end with from msprime:

{
  "schema_version": "1.0.0",
  "environment": {
    "libraries": {
      "gsl": {
        "version": "2.1"
      },
      "kastore": {
        "version": "0.1.0"
      }
    },
    "python": {
      "version": "3.5.2",
      "implementation": "CPython"
    },
    "os": {
      "system": "Linux",
      "node": "powderfinger",
      "release": "4.15.0-29-generic",
      "version": "#31~16.04.1-Ubuntu SMP Wed Jul 18 08:54:04 UTC 2018",
      "machine": "x86_64"
    }
  },
  "parameters": {
    "sample_size": 5,
    "command": "simulate",
    "TODO": "add other simulation parameters"
  },
  "software": {
    "name": "msprime",
    "version": "0.6.1.dev123+ga252341.d20180820"
  }
}

This is slightly different to what we had before, but in the same spirit. The basic idea is that we have three things we need to document for provenance: the software used, the parameters provided to this software and the environment within which this software was run.

The schema is pretty light-touch, as there's no point in over-specifying this stuff. The minimal compliant instance is:

    {
            "schema_version": "1",
            "software": {
                "name": "x",
                "version": "x",
            },
            "environment": {},
            "parameters": {}
     }

Basically, you have to have these three keys, and you have to give non-empty strings for the name and version keys, and specify the schema_version You can put anything extra in there that you want.

TODO

  • High-level documentation explaining what this is for.

Pinging @petrelharp, @molpopgen and @bhaller for comments!

@bhaller
Copy link

bhaller commented Aug 21, 2018

If "environment" and "parameters" were made optional, and if "software" were split out into a "software" string and a "version" string (both at the top level), it would match what SLiM already produces, which would be nice. The point is not so much saving work on my end (changing this stuff is trivial) as allowing SLiM 3.0 files to be schema-compliant, and not changing the format between versions unnecessarily (which just causes confusion all around). If there's a compelling reason driving those changes, then that's fine and we can just say SLiM 3.0 files are non-compliant, and so it goes; but offhand I don't see any particularly compelling reasons for those alterations, so maybe we can preserve backward compatibility?

The other thing I would note is that requiring "parameters" to be a key-value dictionary (which I take it is what is meant in JSON by "object"?) is perhaps unnecessarily restrictive and undesirable. Right now SLiM generates an array, instead, for "parameters", composed of all of the command-line parameters that were supplied (straight from argv/argc), and that feels rather natural and simple. Some command-line parameters take no argument (so they would have a key with no value), others might allow more than one argument (multiple values per key). It seems like it is more flexible and natural to just let the generating program represent its parameters as it sees fit. Is there a need to impose the key-value structure as a requirement here?

@jeromekelleher
Copy link
Member Author

Thanks for the feedback @bhaller, comments below.

If "environment" and "parameters" were made optional, and if "software" were split out into a "software" string and a "version" string (both at the top level), it would match what SLiM already produces, which would be nice. The point is not so much saving work on my end (changing this stuff is trivial) as allowing SLiM 3.0 files to be schema-compliant, and not changing the format between versions unnecessarily (which just causes confusion all around). If there's a compelling reason driving those changes, then that's fine and we can just say SLiM 3.0 files are non-compliant, and so it goes; but offhand I don't see any particularly compelling reasons for those alterations, so maybe we can preserve backward compatibility?

I don't think we should worry too much about backward compatibility here to be honest. Tskit isn't going to be checking these provenance strings for compliance, and I have no plans for adding APIs to do anything with the values at the moment. Users never see these provenance values the right now, so there's nothing to be confused by there. When we do add APIs, they'll have to be reasonably robust to provenance information that doesn't conform (i.e., maybe just print out the string as-is rather than nicely formatting, say). To me this exercise is about avoid these issue in the future when we do start using the values. So, I would rather get this right and try and get a nice, clean and extensible schema.

Old msprime files will also have provenance information that isn't compliant, btw.

The other thing I would note is that requiring "parameters" to be a key-value dictionary (which I take it is what is meant in JSON by "object"?) is perhaps unnecessarily restrictive and undesirable. Right now SLiM generates an array, instead, for "parameters", composed of all of the command-line parameters that were supplied (straight from argv/argc), and that feels rather natural and simple. Some command-line parameters take no argument (so they would have a key with no value), others might allow more than one argument (multiple values per key). It seems like it is more flexible and natural to just let the generating program represent its parameters as it sees fit. Is there a need to impose the key-value structure as a requirement here?

That's a good point. I think for recording CLI parameters the right way to do it would be something like:

...
   "parameters": {"command": ["-X", "1", "--something"]}
...

Having parameters as a mapping gives the flexibility to cover both CLI invocations as well as API calls. If we use a list, then it's quite limiting, as API calls would have to be either a list of (parameter, value) pairs or just a straight list of positional parameters.

@bhaller
Copy link

bhaller commented Aug 21, 2018

OK, that all makes sense. I'm on board with this proposal. :->

@jeromekelleher jeromekelleher force-pushed the provenance-schema branch 2 times, most recently from be69d8b to 6230b56 Compare August 22, 2018 12:40
@jeromekelleher
Copy link
Member Author

I've added some documentation and I think this is ready to go now. I'll merge it in a couple of days if I don't hear any objections.

@bhaller
Copy link

bhaller commented Aug 23, 2018

I notice you're writing out a "schema_version" entry but no "schema" entry, whereas previous versions of this that I saw from you had a "schema" tag with value "tskit-provenance". Did this get removed deliberately, or should it be added back in?

@jeromekelleher
Copy link
Member Author

I notice you're writing out a "schema_version" entry but no "schema" entry, whereas previous versions of this that I saw from you had a "schema" tag with value "tskit-provenance". Did this get removed deliberately, or should it be added back in?

I took it out because it doesn't seem to be the done-thing in JSON schema. Best practices indicate that you should include a $id URI to uniquely identify the schema (in the definition), but there's no guidance on making the actual documents self-identifying. Since JSON schema isn't worrying about it and this is such a limited and specialised application, I figured it was overboard to include the schema details in the actual documents.

I left out the $id field in the schema definition because I'm not sure what it should be and can't be bothered setting up a stable URL where it could be hosted. This can always be added in later if we change our minds.

Adding a version number did seem prudent though, since this schema is likely to evolve over time and it'll be good to have SemVer semantics so that clients can deal with this.

I'd be good to hear your thoughts on this (if you have any!)...

@bhaller
Copy link

bhaller commented Aug 23, 2018

OK, if this is standard practice then no worries. Re: "stable URL where it could be hosted", I have a stable domain (benhaller.com) where I'm happy to host (not-huge) stuff if you ever want that. We already host the downloads for SLiM there.

@bhaller
Copy link

bhaller commented Aug 23, 2018

Here is a sample SLiM provenance table now (@petrelharp):

{
    "environment": {
        "os": {
            "machine": "x86_64",
            "node": "darwin-447.local",
            "release": "17.6.0",
            "system": "Darwin",
            "version": "Darwin Kernel Version 17.6.0: Tue May  8 15:22:16 PDT 2018; root:xnu-4570.61.1~1/RELEASE_X86_64"
        }
    },
    "parameters": {
        "command": []
    },
    "schema_version": "1.0.0",
    "slim": {
        "file_version": "0.2",
        "generation": 2000,
        "model": "initialize() {\n\tinitializeTreeSeq();\n\tinitializeMutationRate(1e-7);\n\tinitializeMutationType(\"m1\", 0.5, \"f\", 0.0);\n\tinitializeGenomicElementType(\"g1\", m1, 1.0);\n\tinitializeGenomicElement(g1, 0, 99999);\n\tinitializeRecombinationRate(1e-8);\n}\n1 {\n\tsim.addSubpop(\"p1\", 500);\n}\n2000 late() { sim.treeSeqOutput(\"~/Desktop/junk.trees\"); }\n",
        "model_type": "WF",
        "remembered_node_count": 0,
        "seed": 1722162964723
    },
    "software": {
        "name": "SLiM",
        "version": "3.0"
    }
}

The parameters under "command" are just an empty array because this was run in SLiMgui. I believe it follows the schema. SLiM-specific cruft has been put under a "slim" top-level tag. This will require an update to pyslim, presumably. Note that file_version was 0.1 for SLiM 3.0 (and was a top-level tag), and now it is 0.2 (and is under "slim"). The previous post-3.0 file format also called itself 0.2, but I think we should just pretend that it never existed, as it was only a thing briefly in GitHub, never in release.

@jeromekelleher
Copy link
Member Author

Re: "stable URL where it could be hosted", I have a stable domain (benhaller.com) where I'm happy to host (not-huge) stuff if you ever want that. We already host the downloads for SLiM there.

Thanks for the offer @bhaller --- it's not so much the web hosting, it just feels like the long-term stable URI for a schema like this should be something associated with the project. Something like tskit.org/provenance.schema.json. So, I think the simplest thing is just leave it out for now, as it's easy to add in later without breaking anything.

Having said that, tskit.org is free. Maybe I should just register it...

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Aug 24, 2018

The SLiM provenance looks great @bhaller. I have a minor suggestion but feel free to ignore:

{
    "environment": {
        "os": {
            "machine": "x86_64",
            "node": "darwin-447.local",
            "release": "17.6.0",
            "system": "Darwin",
            "version": "Darwin Kernel Version 17.6.0: Tue May  8 15:22:16 PDT 2018; root:xnu-4570.61.1~1/RELEASE_X86_64"
        }
    },
    "parameters": {
        "command": [],
        "model": "initialize() {\n\tinitializeTreeSeq();\n\tinitializeMutationRate(1e-7);\n\tinitializeMutationType(\"m1\", 0.5, \"f\", 0.0);\n\tinitializeGenomicElementType(\"g1\", m1, 1.0);\n\tinitializeGenomicElement(g1, 0, 99999);\n\tinitializeRecombinationRate(1e-8);\n}\n1 {\n\tsim.addSubpop(\"p1\", 500);\n}\n2000 late() { sim.treeSeqOutput(\"~/Desktop/junk.trees\"); }\n",
        "model_type": "WF",        
        "seed": 1722162964723
    },
    "schema_version": "1.0.0",
    "slim": {
        "file_version": "0.2",
        "remembered_node_count": 0,
        "generation": 2000     
    },
    "software": {
        "name": "SLiM",
        "version": "3.0"
    }
}

The idea is that parameters should contain all information that you would need to supply to SLiM to recreate the tree sequence, so I've moved the model, model_type and seed in there. The stuff left in the slim tag then is the miscellaneous information that pyslim uses to better interpret the tree sequence.

@jeromekelleher jeromekelleher merged commit f67d570 into tskit-dev:master Aug 24, 2018
@jeromekelleher jeromekelleher deleted the provenance-schema branch August 24, 2018 12:19
@bhaller
Copy link

bhaller commented Aug 24, 2018

I like your proposed shift, I'll do that this morning. Thanks for the feedback.

@jeromekelleher
Copy link
Member Author

The documentation is up here --- should explain the rationale for things.

@bhaller
Copy link

bhaller commented Aug 24, 2018

Great, I'll link to that in my documentation, thanks. @petrelharp, SLiM's file_version 0.2 provenance format is now as Jerome proposed immediately above. Done and committed.

@jeromekelleher
Copy link
Member Author

Great, I'll link to that in my documentation, thanks. @petrelharp, SLiM's file_version 0.2 provenance format is now as Jerome proposed immediately above. Done and committed.

Excellent. Probably best to link to the stable version: https://msprime.readthedocs.io/en/stable/provenance.html

Doesn't exist right now, but will later today hopefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants