New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General framework #1
Comments
p.s. @bhaller I see that the provenance table also records |
Creating subclasses seems like a lot of work to me @petrelharp, and would be brittle in the long run. Why not encode all this stuff into the relevant metadata slots? We could provide functions here that will encode and decode the metadata for (say) Individuals, so that a user could do something like ts = msprime.load("slim-example.trees")
for ind in ts.individuals():
slim_data = pyslim.decode_individual(ind.metadata)
print(slim_data.age, slim_data.pedigree_id, slim_data.sub_population_id) Ultimately, if the plan of attaching custom decoders to tree sequences works out, we could then do something like ts = msprime.load(
"slim-example.trees",
individual_metadata_decoder=pyslim.decode_individual)
for ind in ts.individuals():
print(ind.metadata.age, ind.metadata.pedigree_id) This decouples things nicely, and gives a lot of freedom for how the decoders work. You can use struct.pack if you like, and the user won't know or care. |
I agree with @jeromekelleher that making subclasses seems problematic. I think my Python chops aren't good enough to know what the right solution is, though. Jerome, in your code above, what sort of beast exactly is @petrelharp, The variable-length migration metadata for the subpopulations will doubtless be the most annoying part of the implementation; sorry about that. :-> By the way, the metadata format is also documented now in section 23.4 of the SLiM 3 manual. :-> |
Oh, and one other thought: getting/setting particular metadata values is great, obviously, but I also hope there will be a convenience method that makes it possible, with a single call, to annotate the result of an msprime coalescent simulation in a default manner that makes it importable into SLiM. Just grouping pairs of nodes into individuals, putting all the individuals into one population, setting the selection coefficient of all individuals to zero, setting all spatial positions and spatial bounds to zero, setting all nodes as representing non-null autosomes (unless msprime supports sex chromosomes?), setting all mutation type IDs to 1 by default (this could be an option), etc. Individual IDs and genome IDs could be assigned sequentially from zero, I think, with the genome IDs obeying the 2*individual_id + [0/1] rule. Set the population metadata to sensible values: selfing and cloning rates of 0, sex ratio of 0.5 (unused if it turns out to be a non-sexual model), no migration records, etc. And don't forget the sex values set in flags, by the way. Set everybody to hermaphrodite (male | female) for non-sexual models, set males and females otherwise. I imagine whether the model is hermaphroditic or sexual would be a flag passed to this convenience method; perhaps a sex ratio could be supplied too. The only other tricky thing is that nonWF models expect ages to all be -1, whereas for WF models -1 is an illegal value. So the age of individuals ought to be a parameter too. So something like: pyslim.annotate(age=-1, sexual=false, sex_ratio=0.5, mutation_type_id=1) ? That is not in the proper Python syntax since I'm not PyLiterate, but you know what I mean. :-> If the user wanted to do something more unusual they could manually set up all the metadata, but this would handle 90% of use cases, probably. For now it could probably assume SLiM version 3.0 and file version 0.1; parameters for those could be added later if that seemed like a good idea, but probably pyslim should just always target the latest version of SLiM. (whether it ought to be able to read metadata from older versions is a bridge we can cross later.) I'm not sure what the generation number ought to be. It would be nice for it to be inferred from the height of the tree, I guess? But it might also be nice for the user to be able to set it to whatever value they want, and have the coalescent tree extend backward in time from the chosen endpoint. So maybe that's also a parameter? Anyway, sorry for the verbosity, I'll be quiet now. :-> |
These would just be simple Python classes: import attrs
import struct
@attr.s
class IndividualMetadata(object):
age = attr.ib()
pedigree_id = attr.ib()
def decode_individual(buff):
age, pedigree_id = struct.unpack("<ii", buff)
return IndividualMetadata(age=age, pedigree_id=pedigree_id) I've used attrs here to keep it simple. I'm assuming that the metadata has been packed as to 32bit little-endian integers here (this code might not work, I haven't actually run it). I would literally leave it at that though. This code will perform perfectly well enough for the vast majority of cases --- I'd need to see some strong evidence showing that it was too slow for some important application before I'd personally consider making anything more complicated. Encoding the metadata as binary has advantages and disadvantages. It's fast and simple (easy to produce and consume on the C++ side), but there are versioning issues when we want to change the format over time. Still, if you're only adding new items over time then it's actually not too bad. Re. encoding the information for SLiM, I don't know enough about how SLiM is consuming the information to provide any help here. |
Current proposal:
|
I'm guessing that's backwards - nonWF has positive ages, and WF has age = -1? And, is it an error for WF indivis to have age not equal to -1? Otherwise, I'll just set ages to 0 unless it's passed in otherwise. |
Backwards, yes. And yes, it's an error for WF individuals to have age not equal to -1, at present. That restriction could be relaxed, though, if it creates substantial trouble. In particular, it would not be a problem to allow an age of 0 instead, specifically, which would get corrected to -1 upon loading. |
Questions: for @bhaller mostly.
|
Note: currently, the code seems to work, but requires a small patch to msprime to allow binary population metadata. |
Hi @petrelharp. Answers:
Re: patch to msprime, that's rather unfortunate, since it means SLiM 3 and pyslim won't be compatible with the release version of msprime, no? A shame this wasn't found earlier. |
Great.
OK; I'll figure out something hack-ey to do in the meantime that will pull out the metadata and hide it elsewhere; I've make a pull request to msprime to discuss this. |
Ah-ha. Makes sense. Could you provide some more guidance on this? Should each unique mutation just be assigned integer IDs starting from 0? And: site positions are floats; are you saying that I should actually modify the site table, replacing the positions with rounded positions, and then merge mutations at sites that, as a result, are now the same? |
I think that it should not demand this, since if you do other things to the .trees file with other programs before reading it back into SLiM, they will add their own provenance entries. I'd say just read back up from the bottom until you hit a SLiM entry. |
Having read through that, yes, we decided to move the sex info into metadata, but it didn't happen. The sex flags in msprime should be removed or moved to a private header, and SLiM should add sex to the metadata for individuals. I'll do that today. Ugh.
Integer IDs from 0 should be fine, as long as they don't collide with any IDs for mutations already present of course; keep in mind that the .trees file read in may already contain mutations. But you can re-assign the IDs of those if it is convenient to do so. Re: sites and merging and stacking, yes, I think that's what I'm saying. SLiM compliance implies integer site positions, and integer site positions implies stacking, I think. I don't think we want to start relaxing the compliance rules to make pyslim's life easier (sorry!); pyslim should write out files that SLiM cannot tell were not generated by SLiM itself, otherwise we're effectively allowing two different import formats, and bugs will inevitably result. Better to be strict.
OK. Perhaps pyslim ought to generate a SLiM entry and then a pyslim entry, then, rather than the other way around? To say that it is the generator of the file, in the end? I'll fix SLiM to use the last SLiM entry found, that seems fine. Users processing .trees files through other software and then back into SLiM will have to know what they're doing anyway. |
Catching up here and providing some comments on @petrelharp 's current proposal here are some comments on the items in the API
So I think I agree that the stuff in |
I think users of pyslim will want to decode and encode; they will want to get SLiM-specific metadata out of the tree sequence, for analysis or other purposes, and will want to set up that metadata with the intention of then importing the .trees file back into SLiM with whatever changes they made. So I think that stuff does need to be public (and so SlimTreeSequence probably does too?). I agree that perhaps there is naming confusion between annotate_X() and annotate(), but I'm not sure initialize() gets my vote; seems rather vague (initialize what, how?). Maybe a method name that more explicitly conveys what the method does? |
At least in the current state, no - it should look just like an msprime.TreeSequence. How does a private class work?
At first I thought I agreed with @bhaller - they are necessary, because that's how people will get information out of the metadata, by doing e.g.:
But, we could alternatively just provide two functions, I guess I still think that
Good point; maybe it should be |
On 1. For the example, I think you saw 8d086a0 #4 doing things this way makes clear a subclass of TreeSequence isn't really being used at all. (Tho this is an implementation detail and not what we're meant to discuss here :) On 2. Heh yes I see I didn't really finish my thinking here --- to provide the kind of encapsulation I was envisioning requires the second method of @petrelharp That is to say: with |
Since I'm not good in Python, I'll bow out; it sounds like you guys are on top of the relevant issues. :-> |
I've switched things over to have (and use) symmetric
|
Observations:
So, the route to making SLiM-readable tables is always through a tree sequence, which is a bit annoying, because if you'd started with SLiM tables then you could just modify these directly and not have to do the inverse operations above. This could be confusing to the user: it would be quite easy to load a .trees file with pyslim, but then write one back out with msprime, or vice-versa. Making a subclass to TreeSequence would avoid this problem... although maybe what we really want to subclass is TableCollection. Summarizing:
|
Huh. So why not start from tables, then, instead, if that would be the simpler design? I'm not sure I understand the issue here.
I thought you did plan to subclass TreeSequence, just with a private subclass. Given that, how would the user do what you say – load with pyslim but write with msprime, or vice versa? If they load with pyslim, they end up with a pyslim subclass, which then writes in the pyslim way; and if they load with msprime, they end up with an msprime TreeSequence, which they would have no way of writing out with pyslim. No? Probably I just don't understand Python...? The rest all sounds quite reasonable to me. |
If a private subclass ultimately makes it easier to separate concerns of a
module user from internals of Slim Tree Sequences then we should still do
it.
(I'm not sure it does btw. Peter and I just discussed how making this
package opinionated about what flavor of Tree Sequence the python user
interacts with should make things simpler. If we're still in agreement on
that, then we should go with whatever internal architecture makes it happen
smoothly =)
|
OK. So how, in this design, do you prevent the user from loading with pyslim and saving with msprime, or vice versa? I guess the "vice versa" case is prevent by the "opinionated" policy you mention; pyslim would refuse to write an msprime tree sequence? How is the first case prevented? I'm also curious, not knowing much about python: if you don't subclass TreeSequence, where does pyslim store all of the extra data associated with tree sequences that it reads in? |
I've changed my mind again. More observations:
... which, code-wise, is just fine, but it does feel awkward to do all that tree sequence -> tables -> tree sequence -> tables -> ... for no reason.
So, here's my revised proposal.
Use cases:
|
The only thing that I think we're lacking is some good way to tell if a given slim-annotated table collection (or tree sequence derived from one) has already been put through pyslim besides checking to see if there's anything in sites.metadata. Any ideas there @ashander? |
Very interesting to see the actual use cases. Overall it looks very clean and logical; I like it a lot. One comment: it would be nice if pyslim defined a few constants, for sex values and perhaps for "WF" and "nonWF", and whatever else might come up; it would be better for users to use predefined constants than to hardcode values into their scripts. As far as how to tell if a tc/ts has already been put through pyslim, is that something that the provenance table could help with? Looks great so far, nice work @petrelharp. |
Ah, here's a question. You generally have the decode methods take metadata: If the individual were passed in, then perhaps it would also be possible to collapse the different |
Yes - you could get the metadata in several ways, e.g., from the columns in the tables, or from iterating over the objects themselves:
This isn't something I think we should be opinionated about the right way to do it. I'm trying to keep things simple! |
Yes, excellent point. I'm not sure where to find the list of possible values for these:
|
Ah, I see, makes sense.
Well, these are defined by the SLiM model. Mutation types don't get saved out to the .trees file, and the assumption is that appropriate matching mutation types will have been defined in SLiM by the time a .trees file is loaded. So I would just treat these as arbitrary integers, with a minimum value of 0.
Section 23.4.2 has these: 0 for autosome, 1 for X, 2 for Y, at present.
Section 23.4.3: 0 for female, 1 for male, -1 for hermaphrodite
Section 23.4.3 again: 0x01 represents an individual that migrated in the current generation, if set. That is the only defined flag at the moment. I'm not sure whether the doc for all of those exists in the manual draft you have; it's a work in progress. But anyway, there they are. :-> |
@bhaller points out that we should be writing out files like |
So, if |
There is a way to call the parent method; but I think that we would provide a conversion method to/from SlimTreeSequence. |
Update 5,467: We want to avoid getting mixed up between tables that have and have not been SLiMified. So, here's my proposal:
So, when we deal with tables, we're always dealing with the sort that SLiM output. Under the hood, a SlimTreeSequence has modified these, to make looking at the trees and haplotypes and things nicer, but that should be invisible to the user. |
That all sounds good. Nice idea from @bhaller to use the provenance table. Taking a closer look now |
|
Good point. I've changed it to And moved those versions up top. |
I think the general framework is settled. Open new issues to discuss particular things further. |
This package needs to deal with the extra information that SLiM stores in a tree sequence, reading it easily, and creating it in tree sequences lacking it. These are defined here and are:
Individuals:
Nodes:
Mutations:
Population:
There's one (at least?) bit of additional tree-sequence-global information:
generation
: the current generation of the simulationProposed method #1:
Define classes like
SlimNodeMetadata
, etcetera, to hold the information above. This would not be a tuple containing the information for a given node, it would have numpy vectors of the information for all nodes, so thatSlimNodeMetadata.type
would give you a vector of the same length as the node table it came from, with the type for every node. (These are basically new tables, but we won't call them that.) Define aSlimMetadata
class to hold this information, as well asgeneration
, the current generation of the simulation (obtained from provenance).Extend the
TreeSequence
class toSlimTreeSequence
to include a SlimMetadata object. The metadata object will be immutable, as are the metadata columns they are obtained from.Provide a method that allows creating a new SlimTreeSequence with a different SlimMetadata object.
The above suffices, but to make some things easier, we might want to also:
Individual
,Node
,Mutation
, andPopulation
to include the above attributes (e.g., creatingSlimIndividual
);SlimTreeSequence.individuals()
and.individual()
methods to returnSlimIndividual
s, and similarly for.nodes()
,.mutations()
, and.populations()
.How's this look, roughly, @jeromekelleher, @bhaller?
The text was updated successfully, but these errors were encountered: