Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General framework #1

Closed
petrelharp opened this issue Jun 20, 2018 · 41 comments
Closed

General framework #1

petrelharp opened this issue Jun 20, 2018 · 41 comments

Comments

@petrelharp
Copy link
Contributor

This package needs to deal with the extra information that SLiM stores in a tree sequence, reading it easily, and creating it in tree sequences lacking it. These are defined here and are:

Individuals:

  • SLiM pedigree ID
  • age
  • subpopulation ID

Nodes:

  • SLiM genome ID
  • is a null genome?
  • type (autosome, X, Y, ...)

Mutations:

  • mutation type ID
  • selection coefficient
  • subpopulation ID it occurred in
  • origin generation

Population:

  • a whole bunch of stuff

There's one (at least?) bit of additional tree-sequence-global information:

  • generation : the current generation of the simulation

Proposed method #1:

  1. Define classes like SlimNodeMetadata, etcetera, to hold the information above. This would not be a tuple containing the information for a given node, it would have numpy vectors of the information for all nodes, so that SlimNodeMetadata.type would give you a vector of the same length as the node table it came from, with the type for every node. (These are basically new tables, but we won't call them that.) Define a SlimMetadata class to hold this information, as well as generation, the current generation of the simulation (obtained from provenance).

  2. Extend the TreeSequence class to SlimTreeSequence to include a SlimMetadata object. The metadata object will be immutable, as are the metadata columns they are obtained from.

  3. Provide a method that allows creating a new SlimTreeSequence with a different SlimMetadata object.

The above suffices, but to make some things easier, we might want to also:

  1. Extend Individual, Node, Mutation, and Population to include the above attributes (e.g., creating SlimIndividual);
  2. Extend methods SlimTreeSequence.individuals() and .individual() methods to return SlimIndividuals, and similarly for .nodes(), .mutations(), and .populations().

How's this look, roughly, @jeromekelleher, @bhaller?

@petrelharp
Copy link
Contributor Author

p.s. @bhaller I see that the provenance table also records remembered_node_count - remind me how this works?

@jeromekelleher
Copy link
Member

Creating subclasses seems like a lot of work to me @petrelharp, and would be brittle in the long run. Why not encode all this stuff into the relevant metadata slots? We could provide functions here that will encode and decode the metadata for (say) Individuals, so that a user could do something like

ts = msprime.load("slim-example.trees")
for ind in ts.individuals():
    slim_data = pyslim.decode_individual(ind.metadata)
    print(slim_data.age, slim_data.pedigree_id, slim_data.sub_population_id)

Ultimately, if the plan of attaching custom decoders to tree sequences works out, we could then do something like

ts = msprime.load(
    "slim-example.trees", 
    individual_metadata_decoder=pyslim.decode_individual)
for ind in ts.individuals():    
    print(ind.metadata.age, ind.metadata.pedigree_id)

This decouples things nicely, and gives a lot of freedom for how the decoders work. You can use struct.pack if you like, and the user won't know or care.

@bhaller
Copy link
Collaborator

bhaller commented Jun 20, 2018

I agree with @jeromekelleher that making subclasses seems problematic. I think my Python chops aren't good enough to know what the right solution is, though. Jerome, in your code above, what sort of beast exactly is slim_data? How does it know how to respond to .age, .pedigree_id, etc., if it is not a subclass of the sort Peter was proposing? Is it some sort of Python associative array? But without knowing how it works under the hood, I can certainly say that the code snippets Jerome cited look very clean. And keeping the information in the metadata, in the regular tables, seems much simpler and cleaner than moving it to and fro, into and out of custom objects that are effectively new tables. That sounds very complicated and, as Jerome says, brittle. (But perhaps more efficient, if a lot of manipulation of metadata values is needed? But probably typical usage won't involve huge amounts of complex processing on metadata values?)

@petrelharp, remembered_node_count is just written out from remembered_genomes_.size(); it's needed so that the genomes that were "remembered" when the file was written become "remembered" again when the file is read back in. So you presumably want to preserve its value, but for new unannotated data that you're adding metadata to for the first time I imagine it would always be zero. (If the new unannotated data comes from the coalescent, most of the old genomes never existed in the first place, so there's not much to remember. :->) For an operation like recapitation that adds new nodes that are older than the existing nodes, this will maybe get tricky; you'd like the same set of nodes to stay "remembered" after recapitation, so they need to be preserved somehow. That's above my pay grade. ;->

The variable-length migration metadata for the subpopulations will doubtless be the most annoying part of the implementation; sorry about that. :->

By the way, the metadata format is also documented now in section 23.4 of the SLiM 3 manual. :->

@bhaller
Copy link
Collaborator

bhaller commented Jun 20, 2018

Oh, and one other thought: getting/setting particular metadata values is great, obviously, but I also hope there will be a convenience method that makes it possible, with a single call, to annotate the result of an msprime coalescent simulation in a default manner that makes it importable into SLiM. Just grouping pairs of nodes into individuals, putting all the individuals into one population, setting the selection coefficient of all individuals to zero, setting all spatial positions and spatial bounds to zero, setting all nodes as representing non-null autosomes (unless msprime supports sex chromosomes?), setting all mutation type IDs to 1 by default (this could be an option), etc. Individual IDs and genome IDs could be assigned sequentially from zero, I think, with the genome IDs obeying the 2*individual_id + [0/1] rule. Set the population metadata to sensible values: selfing and cloning rates of 0, sex ratio of 0.5 (unused if it turns out to be a non-sexual model), no migration records, etc.

And don't forget the sex values set in flags, by the way. Set everybody to hermaphrodite (male | female) for non-sexual models, set males and females otherwise. I imagine whether the model is hermaphroditic or sexual would be a flag passed to this convenience method; perhaps a sex ratio could be supplied too.

The only other tricky thing is that nonWF models expect ages to all be -1, whereas for WF models -1 is an illegal value. So the age of individuals ought to be a parameter too.

So something like:

pyslim.annotate(age=-1, sexual=false, sex_ratio=0.5, mutation_type_id=1)

? That is not in the proper Python syntax since I'm not PyLiterate, but you know what I mean. :-> If the user wanted to do something more unusual they could manually set up all the metadata, but this would handle 90% of use cases, probably.

For now it could probably assume SLiM version 3.0 and file version 0.1; parameters for those could be added later if that seemed like a good idea, but probably pyslim should just always target the latest version of SLiM. (whether it ought to be able to read metadata from older versions is a bridge we can cross later.)

I'm not sure what the generation number ought to be. It would be nice for it to be inferred from the height of the tree, I guess? But it might also be nice for the user to be able to set it to whatever value they want, and have the coalescent tree extend backward in time from the chosen endpoint. So maybe that's also a parameter?

Anyway, sorry for the verbosity, I'll be quiet now. :->

@jeromekelleher
Copy link
Member

jeromekelleher commented Jun 21, 2018

I agree with @jeromekelleher that making subclasses seems problematic. I think my Python chops aren't good enough to know what the right solution is, though. Jerome, in your code above, what sort of beast exactly is slim_data? How does it know how to respond to .age, .pedigree_id, etc., if it is not a subclass of the sort Peter was proposing? Is it some sort of Python associative array? But without knowing how it works under the hood, I can certainly say that the code snippets Jerome cited look very clean. And keeping the information in the metadata, in the regular tables, seems much simpler and cleaner than moving it to and fro, into and out of custom objects that are effectively new tables. That sounds very complicated and, as Jerome says, brittle. (But perhaps more efficient, if a lot of manipulation of metadata values is needed? But probably typical usage won't involve huge amounts of complex processing on metadata values?)

These would just be simple Python classes:

import attrs
import struct

@attr.s
class IndividualMetadata(object):
     age = attr.ib()
     pedigree_id = attr.ib()

def decode_individual(buff):
    age, pedigree_id = struct.unpack("<ii", buff)
    return IndividualMetadata(age=age, pedigree_id=pedigree_id)

I've used attrs here to keep it simple. I'm assuming that the metadata has been packed as to 32bit little-endian integers here (this code might not work, I haven't actually run it).

I would literally leave it at that though. This code will perform perfectly well enough for the vast majority of cases --- I'd need to see some strong evidence showing that it was too slow for some important application before I'd personally consider making anything more complicated.

Encoding the metadata as binary has advantages and disadvantages. It's fast and simple (easy to produce and consume on the C++ side), but there are versioning issues when we want to change the format over time. Still, if you're only adding new items over time then it's actually not too bad.

Re. encoding the information for SLiM, I don't know enough about how SLiM is consuming the information to provide any help here.

@petrelharp
Copy link
Contributor Author

Current proposal:

  1. Provide a SlimTreeSequence class that extends TreeSequence, while trying to do as little as possible beyond storing the extra information that we need to store somewhere. This extra information is:

    • reference_time (the slim-time we're setting to be tskit-time=0.0)
    • alleles - since slim is storing allelic states as this big ugly thing, not nice to print out even if ascii-ified, when loading in a tree sequence we're reassigning the ancestral_state and derived_state to be digits, and to avoid information loss, need to store the conversion from these digits back to what the original alleles are
  2. Provide functions decode_X() that decode the binary stuff slim stiks in metadata in various places; and encode_X() that do the inverse operation. These each return struct-like objects containing this information. Notes:

    • actually, the mutation metadata is a list of these structs, since the derived state could be a superposition of more than one mutation
    • population metadata can be empty, which we code as None, out of laziness
  3. Provide functions to replace the metadata columns of tables with the information in iterables of the structs produced above; called annotate_X(). These are just convenience wrappers around encode_X(), but we need them because the right way to modify an existing table collection may not be obvious. (you can't just assign to tables.mutations, for instance)

  4. Provide functions to fill in everything that an msprime simulation does not have that a SLiM simulation needs, which includes:

    • make up Individuals and Populations
    • fill out metadata using the tools above and default values
    • add stuff to the Provenance table, including remembered_node_count

    @bhaller has a proposal above for what default values should be; these can be passed in as iterables as well.

@petrelharp
Copy link
Contributor Author

The only other tricky thing is that nonWF models expect ages to all be -1, whereas for WF models -1 is an illegal value. So the age of individuals ought to be a parameter too.

I'm guessing that's backwards - nonWF has positive ages, and WF has age = -1? And, is it an error for WF indivis to have age not equal to -1? Otherwise, I'll just set ages to 0 unless it's passed in otherwise.

@bhaller
Copy link
Collaborator

bhaller commented Jun 27, 2018

Backwards, yes. And yes, it's an error for WF individuals to have age not equal to -1, at present. That restriction could be relaxed, though, if it creates substantial trouble. In particular, it would not be a problem to allow an age of 0 instead, specifically, which would get corrected to -1 upon loading.

@petrelharp
Copy link
Contributor Author

Questions: for @bhaller mostly.

  1. Are we really using the individual.flags to record sex? I thought the conclusion previously was that applicatoins would use metadata to record stuff. the MALE | FEMALE option was one that I liked best, but it's not the conclusion that I remembered.

  2. Do we need to record SLiM metadata for nodes that don't correspond to individuals?

  3. Should population metadata be in ASCII? Currently tskit requires it to be so; this is easy to change, but it would be nice to actually keep it easily human-readable.

  4. Should 'time' in pyslim be forwards time or backwards time?
    This is relevant for the "reference_time" argument and the "time" information in mutation metadata.

  5. Should there be more information in the provenance record, e.g., simulation type ("nonWF" or "WF")?

  6. When annotating an msprime tree sequence, do we need to reset derived states to be comma-separated strings of mutation IDs? Do we need to set ancestral states to be ''? I'm hoping the answer is "no".

  7. What should be set in the provenance table? Two rows: one for pyslim, and then one for slim?
    How does slim extract information - by looking for the last row that says "SLiM"?

@petrelharp
Copy link
Contributor Author

Note: currently, the code seems to work, but requires a small patch to msprime to allow binary population metadata.

@bhaller
Copy link
Collaborator

bhaller commented Jun 27, 2018

Hi @petrelharp. Answers:

  1. Yes. Perhaps I misunderstood the conclusion reached by that thread; but if we're not supposed to use MSP_INDIVIDUAL_FEMALE and MSP_INDIVIDUAL_MALE, shouldn't they have been removed from the tskit headers? Anyway, I guess we need to go back and look at that thread again and see where we ended up. Where was it? I used to look it up with a link from an email, but once the discussion was over I deleted that email. I hope we don't need to change this, though; it's very late in the game to be modifying the metadata records.

  2. I don't think we need to; by design we don't, since we don't have the information that would be needed (since the genomes/individuals no longer exist). When the .trees file gets read back in, SLiM doesn't do anything with those nodes; they don't get re-instantiated back into genomes or individuals. So there is no need for such metadata, and if it were present SLiM would ignore it I think.

  3. I think it should be binary. Some of the pop metadata is floating-point, so writing it in ASCII gets into the representation issues that we were grappling with before; those issues are solvable, but binary is really just better. And there's nothing special about the population metadata; if we were to make that ASCII, wouldn't the same logic imply that all the other metadata should be ASCII too? I think this is a slippery slope that there is no reason to start down. Metadata should be private, so tskit should not make any assumptions about the format of it, I think; I could see the point of ASCII with the derived states, since they are not technically metadata, but I think I am inclined to put my foot down here. If pyslim provides a way to extract it into a record, that then allows it to be printed in a human-readable form anyway, no?

  4. I don't know. There should certainly be a way to get and set the "current generation" in the provenance information, in SLiM generation clock terms; and the origin generation in the mutation metadata is in SLiM clock terms when written out, I can say that. I don't know what pyslim ought to do regarding translating times back and forth; this difference between conceptions of time is really difficult and confusing and annoying. I think you understand these issues better than I do, so do what you think is best.

  5. Probably? It's easy to add on the SLiM side. That does seem like it would be a sensible thing to have in the provenance info, just for annotation purposes. Right now we basically put in the bare minimum that was needed to allow the simulation to be read back in. Shall I add a "model type" field? Anything else?

  6. I think derived states have to be SLiM-compliant, and thus have to be comma-separated strings of mutation IDs, yes. And if multiple msprime mutations occur at the same position (after rounding positions to integers, which also needs to be done), they need to get stacked. This is presumably the most annoying thing about writing pyslim. Sorry! I don't think SLiM itself does anything at all with ancestral states when reading in; it ignores the sites table completely, in fact, I think. It uses a vargen to construct the mutational state of every individual; so if the vargen understands the final mutational state of each extant genome correctly, then things ought to work. I don't know what that implies about what you need to write out for ancestral states. If you don't write them out as empty, then what would you write them out as, though??

  7. SLiM presently demands that the very last provenance entry be a SLiM entry. So yes, I think a pyslim entry and then a SLiM entry. Of course if this policy doesn't make sense, it can easily be changed. The rationale was that SLiM can trust the file to be in the proper compliant format only if the last entry is a SLiM entry; if some other program has done stuff subsequently to the data, then all bets are off and SLiM should refuse to read the file. That may be overly defensive, though; perhaps we ought to trust the user more? But if it's not a problem to write out a pyslim entry followed by a SLiM entry, then perhaps that is fine.

Re: patch to msprime, that's rather unfortunate, since it means SLiM 3 and pyslim won't be compatible with the release version of msprime, no? A shame this wasn't found earlier.

@petrelharp
Copy link
Contributor Author

The discussion around flags was here: 482 and 490 and 484

@petrelharp
Copy link
Contributor Author

  1. I don't think we need to;

Great.

  1. I think it should be binary.

OK; I'll figure out something hack-ey to do in the meantime that will pull out the metadata and hide it elsewhere; I've make a pull request to msprime to discuss this.

@petrelharp
Copy link
Contributor Author

  1. I think derived states have to be SLiM-compliant, and thus have to be comma-separated strings of mutation IDs, yes. And if multiple msprime mutations occur at the same position (after rounding positions to integers, which also needs to be done), they need to get stacked.

Ah-ha. Makes sense. Could you provide some more guidance on this? Should each unique mutation just be assigned integer IDs starting from 0? And: site positions are floats; are you saying that I should actually modify the site table, replacing the positions with rounded positions, and then merge mutations at sites that, as a result, are now the same?

@petrelharp
Copy link
Contributor Author

  1. SLiM presently demands that the very last provenance entry be a SLiM entry.

I think that it should not demand this, since if you do other things to the .trees file with other programs before reading it back into SLiM, they will add their own provenance entries. I'd say just read back up from the bottom until you hit a SLiM entry.

@bhaller
Copy link
Collaborator

bhaller commented Jun 27, 2018

The discussion around flags was here: 482 and 490 and 484

Having read through that, yes, we decided to move the sex info into metadata, but it didn't happen. The sex flags in msprime should be removed or moved to a private header, and SLiM should add sex to the metadata for individuals. I'll do that today. Ugh.

Ah-ha. Makes sense. Could you provide some more guidance on this? Should each unique mutation just be assigned integer IDs starting from 0? And: site positions are floats; are you saying that I should actually modify the site table, replacing the positions with rounded positions, and then merge mutations at sites that, as a result, are now the same?

Integer IDs from 0 should be fine, as long as they don't collide with any IDs for mutations already present of course; keep in mind that the .trees file read in may already contain mutations. But you can re-assign the IDs of those if it is convenient to do so.

Re: sites and merging and stacking, yes, I think that's what I'm saying. SLiM compliance implies integer site positions, and integer site positions implies stacking, I think. I don't think we want to start relaxing the compliance rules to make pyslim's life easier (sorry!); pyslim should write out files that SLiM cannot tell were not generated by SLiM itself, otherwise we're effectively allowing two different import formats, and bugs will inevitably result. Better to be strict.

I think that it should not demand this, since if you do other things to the .trees file with other programs before reading it back into SLiM, they will add their own provenance entries. I'd say just read back up from the bottom until you hit a SLiM entry.

OK. Perhaps pyslim ought to generate a SLiM entry and then a pyslim entry, then, rather than the other way around? To say that it is the generator of the file, in the end? I'll fix SLiM to use the last SLiM entry found, that seems fine. Users processing .trees files through other software and then back into SLiM will have to know what they're doing anyway.

@ashander
Copy link
Contributor

Catching up here and providing some comments on @petrelharp 's current proposal here are some comments on the items in the API

  1. does SlimTreeSequence need to be exposed to /understood by a user of API? In the current state of the code it seems no and they can just use .load() and .load_tables(). If not this could be a private class _slimTreeSequence.

  2. similarly decode_X() and encode_X() are nice to use in eg testing but do they need to be understood by a user outside this module? If not these could be staticmethods to their class, which you can also make private like NodeMetaData._encode(buff)

  3. annotate_X() seem like good functions for the lowest level user of the module and this seems like a good name for these functions.

  4. Such a setup function makes sense and I think is now in pyslim.slim_tree_sequence.annotate() but seems to do more than just 'annotate' and is thus confusing with annotate_X() also (as of 63fdc47). perhaps call this .initialize()?

So I think I agree that the stuff in pyslim.slim_tables needn't be exposed and perhaps less than that. On the more internal side I'm surprised to not see the set_X() methods in pyslim.slim_tables not using the annotate_X() methods

@bhaller
Copy link
Collaborator

bhaller commented Jun 27, 2018

I think users of pyslim will want to decode and encode; they will want to get SLiM-specific metadata out of the tree sequence, for analysis or other purposes, and will want to set up that metadata with the intention of then importing the .trees file back into SLiM with whatever changes they made. So I think that stuff does need to be public (and so SlimTreeSequence probably does too?). I agree that perhaps there is naming confusion between annotate_X() and annotate(), but I'm not sure initialize() gets my vote; seems rather vague (initialize what, how?). Maybe a method name that more explicitly conveys what the method does?

@petrelharp
Copy link
Contributor Author

  1. does SlimTreeSequence need to be exposed to /understood by a user of API?

At least in the current state, no - it should look just like an msprime.TreeSequence. How does a private class work?

  1. decode_X() and encode_X() are nice to use in eg testing but do they need to be understood by a user outside this module?

At first I thought I agreed with @bhaller - they are necessary, because that's how people will get information out of the metadata, by doing e.g.:

ages = [pyslim.decode_individual(x.metadata).age for x in ts.individuals()]

But, we could alternatively just provide two functions, extract_metadata_X and annotate_metadata_X, where the first would return a list of XMetadata objects, and the second would use such a list to annotate the tables. Then the user would never need to deal with the actual binary in the metadata itself.

I guess I still think that encode_X and decode_X should be exposed, but maybe the previous thing is still a good idea.

  1. annotate() seems to do more than just 'annotate'

Good point; maybe it should be annotate_default_metadata().

@ashander
Copy link
Contributor

On 1. For the example, I think you saw 8d086a0 #4 doing things this way makes clear a subclass of TreeSequence isn't really being used at all. (Tho this is an implementation detail and not what we're meant to discuss here :)

On 2. Heh yes I see I didn't really finish my thinking here --- to provide the kind of encapsulation I was envisioning requires the second method of @petrelharp That is to say: with annotate_X() and the XMetadata objects we provide an interface to add a python objects as properly encoded metadata. With a similar interface, at the same level as annotate_X() then a python user doesn't need to know about the bytes and structs. I certainly think this could be cleaner and require less duplicated documentation of the internals of SLiMs bookkeeping. OTOH maybe it would make this module not that useful to not expose the encoding functions directly!

@bhaller
Copy link
Collaborator

bhaller commented Jun 28, 2018

Since I'm not good in Python, I'll bow out; it sounds like you guys are on top of the relevant issues. :->

@petrelharp
Copy link
Contributor Author

I've switched things over to have (and use) symmetric annotate_X_metadata() and extract_X_metadata() functions, that deal with iterators of XxMetadata objects. So, we can do something like the following to set sexes randomly:

import pyslim
import msprime
import random
ts = msprime.simulate(10, mutation_rate = 0.0, recombination_rate = 1.0)
tables = ts.tables
pyslim.annotate(tables, model_type="nonWF")
individual_metadata = list(pyslim.extract_individual_metadata(tables))
for j in range(len(individual_metadata)):
    individual_metadata[j].sex = random.choice([0, 1])

pyslim.annotate_individual_metadata(tables, individual_metadata)
slim_ts = pyslim.load_tables(tables)

@petrelharp
Copy link
Contributor Author

petrelharp commented Jun 28, 2018

Observations:

  1. To make a nice tree sequence to look at and analyze, we are wanting to modify the SLiM-produced tables as follows:

    a. change times so that they measure time before the end of the sim instead of the start
    b. replace alleles with single characters so they're easy to look at, and store this information somewhere

  2. To take a tree sequence and turn it into tables suitable for SLiM to read in, we need to:

    • add metadata to, like, all the tables (if this was msprime-produced)
    • make up individuals (if this was msprime-produced)
    • reset times (which is like inverting operation (a) above)
    • substitute msprime alleles with SLiM-based strings of mutation IDs (like inverting operation (b) above)

So, the route to making SLiM-readable tables is always through a tree sequence, which is a bit annoying, because if you'd started with SLiM tables then you could just modify these directly and not have to do the inverse operations above.

This could be confusing to the user: it would be quite easy to load a .trees file with pyslim, but then write one back out with msprime, or vice-versa. Making a subclass to TreeSequence would avoid this problem... although maybe what we really want to subclass is TableCollection.

Summarizing:

  • by convention, TreeSequences have been shifted to the generation time recorded in the last SLiM provenance entry; and have another attribute, allele, which is a list of dicts that give alleles
  • writing out a .trees file with pyslim will invert operations (a) and (b), and will error if the allele attribute is not present
  • the annotate() method will add both things (provenance and alleles), as well as metadata, to an msprime-produced tree sequence, so that pyslim can write it out

@bhaller
Copy link
Collaborator

bhaller commented Jun 28, 2018

So, the route to making SLiM-readable tables is always through a tree sequence, which is a bit annoying, because if you'd started with SLiM tables then you could just modify these directly and not have to do the inverse operations above.

Huh. So why not start from tables, then, instead, if that would be the simpler design? I'm not sure I understand the issue here.

This could be confusing to the user: it would be quite easy to load a .trees file with pyslim, but then write one back out with msprime, or vice-versa. Making a subclass to TreeSequence would avoid this problem... although maybe what we really want to subclass is TableCollection.

I thought you did plan to subclass TreeSequence, just with a private subclass. Given that, how would the user do what you say – load with pyslim but write with msprime, or vice versa? If they load with pyslim, they end up with a pyslim subclass, which then writes in the pyslim way; and if they load with msprime, they end up with an msprime TreeSequence, which they would have no way of writing out with pyslim. No? Probably I just don't understand Python...?

The rest all sounds quite reasonable to me.

@ashander
Copy link
Contributor

ashander commented Jun 29, 2018 via email

@bhaller
Copy link
Collaborator

bhaller commented Jun 29, 2018

OK. So how, in this design, do you prevent the user from loading with pyslim and saving with msprime, or vice versa? I guess the "vice versa" case is prevent by the "opinionated" policy you mention; pyslim would refuse to write an msprime tree sequence? How is the first case prevented?

I'm also curious, not knowing much about python: if you don't subclass TreeSequence, where does pyslim store all of the extra data associated with tree sequences that it reads in?

@petrelharp
Copy link
Contributor Author

I've changed my mind again. More observations:

  1. Tree sequences are immutable, whereas table collections are not. So, to modify tables - like in my code above to add sexes to all the individuals - we need to work with tables. That's why annotate_X_metadata( ) takes tables, not tree sequences. We could take a tree sequence; but then we'd have to do, internally:
def annotate_X_metadata(ts, md):
   tables = ts.tables
   annotate_X_metadata_tables(tables, md)
   return tables.tree_sequence()

... which, code-wise, is just fine, but it does feel awkward to do all that tree sequence -> tables -> tree sequence -> tables -> ... for no reason.

  1. It would be much better to store the alleles information in the metadata column of the Site table. We're not using it for anything, and if we keep it there, then the alleles will stay with the sites, even if these get rearranged, with no extra effort!

So, here's my revised proposal.

  • pyslim.load() has an argument, slim_format, which if False will load as usual; and if True will load, then modify the tables to stick the alleles in the site metadata, and rebase times according to the generation given in the provenance table.
  • pyslim.annotate() will take a non-SLiM tree sequence and produce a SLiM tree sequence.
  • pyslim.dump() will write out a SLiM tree sequence, after reversing the time-adjustment and the allele translation.

Use cases:

  1. Read in a SLiM tree sequence and analyze it.
ts = pyslim.load("slim.trees", slim_format=True)
# selection coefficients and locations of all selected mutations
sel_loci = []
for mut in ts.mutations():
  md = pyslim.decode_mutation(mut.metadata)
  if md is not None and md.selection_coeff != 0:
    sel_loc.append((mut.position, md.selection_coeff))
  1. Create a coalescent simulation, annotate it minimally to be read into SLiM, and write it out:
ts = msprime.simulate(10, mutation_rate = 0.0, recombination_rate = 1.0)
slim_ts = pyslim.annotate(ts, model_type="nonWF", slim_generation=1)
pyslim.dump(slim_ts, "new_slim.trees")
  1. Read in a tree sequence produced by coalescent simulation, annotate it, assign individual sexes, and write it out to be read SLiM:
ts = pyslim.load("msprime.trees", slim_format=False)
# now we will be editing it so we need to switch to tables
tables = ts.tables
pyslim.annotate_tables(tables, model_type="nonWF")
individual_metadata = list(pyslim.extract_individual_metadata(tables))
for j in range(len(individual_metadata)):
    individual_metadata[j].sex = random.choice([0, 1])

pyslim.annotate_individual_metadata(tables, individual_metadata)
slim_ts = pyslim.load_tables(tables)
pyslim.dump(slim_ts, "new_slim.trees")
  1. Read in a tree sequence produced by SLiM, reduce it to only the hermaphrodites, and write it out again:
ts = pyslim.load("slim.trees", slim_format=True)
hermaphrodites = []
for j in ts.samples():
   ind = ts.individual(ts.node(j).individual)
   if pyslim.decode_individual(ind.metadata).sex == -1:
      hermaphrodites.append(j)
sub_ts = ts.simplify(hermaphrodites)
pyslim.dump(sub_ts, "sub_slim.trees")

@petrelharp
Copy link
Contributor Author

The only thing that I think we're lacking is some good way to tell if a given slim-annotated table collection (or tree sequence derived from one) has already been put through pyslim besides checking to see if there's anything in sites.metadata. Any ideas there @ashander?

@bhaller
Copy link
Collaborator

bhaller commented Jun 29, 2018

Very interesting to see the actual use cases. Overall it looks very clean and logical; I like it a lot. One comment: it would be nice if pyslim defined a few constants, for sex values and perhaps for "WF" and "nonWF", and whatever else might come up; it would be better for users to use predefined constants than to hardcode values into their scripts. As far as how to tell if a tc/ts has already been put through pyslim, is that something that the provenance table could help with? Looks great so far, nice work @petrelharp.

@bhaller
Copy link
Collaborator

bhaller commented Jun 29, 2018

Ah, here's a question. You generally have the decode methods take metadata: pyslim.decode_individual(ind.metadata).sex for example. Is there a reason not to just have them take the object itself, from which they can fetch the metadata anyway? I.e., pyslim.decode_individual(ind).sex. This seems possibly more flexible, since it allows the decode methods to possibly provide information beyond what is in the metadata (in other fields of the object, or conceivably even harvested/analyzed/summarized from other linked objects). And in any case it is more compact for the user.

If the individual were passed in, then perhaps it would also be possible to collapse the different decode_X() methods into a single polymorphic decode() method that would decode whatever type of object it was passed...? I'm not sure whether that's a good idea or not, just throwing stuff out there...

@petrelharp
Copy link
Contributor Author

Is there a reason not to just have them take the object itself, from which they can fetch the metadata anyway?

Yes - you could get the metadata in several ways, e.g., from the columns in the tables, or from iterating over the objects themselves:

for md in msprime.unpack_bytes(tables.nodes.metadata, tables.nodes.metadata_offset):
   print(pyslim.decode_node(md))

# or
for node in tables.nodes():
  print(pyslim.decode_node(node.metadata))

This isn't something I think we should be opinionated about the right way to do it. I'm trying to keep things simple!

@petrelharp
Copy link
Contributor Author

petrelharp commented Jun 29, 2018

it would be nice if pyslim defined a few constants, for sex values and perhaps for "WF" and "nonWF"

Yes, excellent point. I'm not sure where to find the list of possible values for these:

  • mutation type
  • genome type
  • sex
  • individual flags

@bhaller
Copy link
Collaborator

bhaller commented Jun 29, 2018

Yes - you could get the metadata in several ways...

Ah, I see, makes sense.

I'm not sure where to find the list of possible values for these:

mutation type

Well, these are defined by the SLiM model. Mutation types don't get saved out to the .trees file, and the assumption is that appropriate matching mutation types will have been defined in SLiM by the time a .trees file is loaded. So I would just treat these as arbitrary integers, with a minimum value of 0.

genome type

Section 23.4.2 has these: 0 for autosome, 1 for X, 2 for Y, at present.

sex

Section 23.4.3: 0 for female, 1 for male, -1 for hermaphrodite

individual flags

Section 23.4.3 again: 0x01 represents an individual that migrated in the current generation, if set. That is the only defined flag at the moment.

I'm not sure whether the doc for all of those exists in the manual draft you have; it's a work in progress. But anyway, there they are. :->

@petrelharp
Copy link
Contributor Author

@bhaller points out that we should be writing out files like ts.dump( ) not pyslim.dump(ts); since that's how it's done in msprime. So, I think this means we need to make a subclass.

@bhaller
Copy link
Collaborator

bhaller commented Jun 29, 2018

So, if pyslim uses a subclass, and then ts.dump() calls pyslim's dump method, I assume there's a way to call the superclass implementation – msprime's dump method? Would this ever be desirable? Suppose one wanted, for some reason, to load a SLiM .trees file, strip all of SLiM's metadata from it, and save it out as a non-SLiM-compliant file (perhaps for use with some tool that would choke on the SLiM annotations); is that something that can be done in the present design? (This is not a priority for me, it just occurs to me as something we might want to think through.)

@petrelharp
Copy link
Contributor Author

Suppose one wanted, for some reason, to load a SLiM .trees file, strip all of SLiM's metadata from it, and save it out as a non-SLiM-compliant file?

There is a way to call the parent method; but I think that we would provide a conversion method to/from SlimTreeSequence.

@petrelharp
Copy link
Contributor Author

Update 5,467:

We want to avoid getting mixed up between tables that have and have not been SLiMified. So, here's my proposal:

  1. Write a class, SlimTreeSequence, that extends TreeSequence. When you make a SlimTreeSequence, it rebases the times, and prettifies the alleles.

  2. But, when you get the tables back out of a SlimTreeSequence, it always reverses those operations.

So, when we deal with tables, we're always dealing with the sort that SLiM output. Under the hood, a SlimTreeSequence has modified these, to make looking at the trees and haplotypes and things nicer, but that should be invisible to the user.

@ashander
Copy link
Contributor

That all sounds good. Nice idea from @bhaller to use the provenance table. Taking a closer look now

@ashander
Copy link
Contributor

  • there's still that issue of .annotate() (and .annotate_tables()) doing more than the other annotates (i.e. setting defualts). @petrelharp 's earlier idea of .annotate_default_metadata() seemed clearer maybe

  • these magic version numbers should be constants up top

@petrelharp
Copy link
Contributor Author

petrelharp commented Jun 29, 2018

there's still that issue of .annotate() (and .annotate_tables()) doing more than the other annotates

Good point. I've changed it to annotate_defaults() and annotate_defaults_tables() (the second name is unwieldy, but oh well?

And moved those versions up top.

@petrelharp
Copy link
Contributor Author

I think the general framework is settled. Open new issues to discuss particular things further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants