Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for 'finite_sites'? #880

Closed
jeromekelleher opened this issue Feb 5, 2020 · 18 comments · Fixed by #1115
Closed

API for 'finite_sites'? #880

jeromekelleher opened this issue Feb 5, 2020 · 18 comments · Fixed by #1115
Milestone

Comments

@jeromekelleher
Copy link
Member

With #862 merged it's now possible to simulate recombination in discrete or continuous coordinates by specifying the discrete parameter to RecombinationMap. This means we can have a "finite sites" or "infinite sites" model of recombination (although strictly, we would have to check and enforce the infinite sites recombination thing). It's probably a good idea to have a high-level API for this in simulate so we don't force users to specify a RecombinationMap if they want to simulate in discrete coordinates.

One idea is to have a new finite_sites=True argument which would apply to both recombination and mutation. We haven't implemented the mutation bit yet, but we could just ValueError for now if finite_sites is true and mutation_rate > 0. It seems to me that users would probably want finite sites semantics in both cases, and that this would be a user friendly option in the long run.

pinging @tskit-dev/all for opinions on this one.

@jeromekelleher jeromekelleher added this to the Release 1.0 milestone Feb 5, 2020
@hyanwong
Copy link
Member

hyanwong commented Feb 5, 2020

Do we want the resulting TS to "know" in some way that it is a "FiniteSitesTreeSequence"? This was mooted in tskit-dev/tskit#146 (comment) with no particular resolution.

@jeromekelleher
Copy link
Member Author

I don't see any value in FiniteSitesTreeSequence, it's just complexity for no gain. If a reference sequence exists, then tskit should use it and if not it should continue to behave as it does. That's a separate issue to generating mutations/recombinations under a finite sites model.

@hyanwong
Copy link
Member

hyanwong commented Feb 5, 2020

Sorry - I may have put that wrongly (or the comment in the issue I linked to is not strictly relevant to the title of the issue). I'm not thinking at all about reference sequences. The question is whether the TS should know, internally, whether it is a finite sites tree sequence. For example if you generate a TS with finite recombinations, and then call msprime.mutate(rate=0.1) on it, should it throw an error just as you suggest it should do when calling simulate(..., mutation_rate=0.1)?

@jeromekelleher
Copy link
Member Author

The question is whether the TS should know, internally, whether it is a finite sites tree sequence.

No, I don't think it should.

@molpopgen
Copy link
Member

No, I don't think it should.

Right, "finite-sites" recombination + infinitely-many sites mutation is actually key to Hudson's original formulation of the process, and also to various methods for simulating odd things like microsatellites.

@jeromekelleher
Copy link
Member Author

So we've gotten off the track here with the last few comments above which are actually about tskit.

The question is, should msprime have an option finite_sites=True which is a short cut for specifying a finite sites recombination model and a finite sites mutation model, or whether there should be a single finite_sites_recombination option. Or, we can keep it as it is now where finite sites recombination functionality is accessed via specifying the recombination map.

@molpopgen
Copy link
Member

molpopgen commented Feb 5, 2020 via email

@jeromekelleher
Copy link
Member Author

I was trying to address this second point--the mutation model and the recombination model have no necessary relationship.

True enough, but I'm just trying to make it easy to do what most people would want. By default, we have infinite sites both, and we can have either as we like (well, when we actually implement finite sites in mutations). It seems to me that most people, who are running simulations of (say) a human chromosome will want finite sites mutations. It's then pretty tedious writing finite_sites_mutation=True and finite_sites_recombination=True, when what you want is finite_sites=True. We can still provide the exact ms behaviour as well, if that's what's wanted.

@hyanwong
Copy link
Member

hyanwong commented Feb 6, 2020

The question is whether the TS should know, internally, whether it is a finite sites tree sequence.

No, I don't think it should.

Fine, but then I don't think that we should raise a warning or error when specifying mutation_rate>0 in a finite sites simulate() call, otherwise it is confusing that something is raised in this case, but will not be (because the TS doens't 'know' that it's a finite sites version) when doing msprime.mutate(rate=0.5)[1]. Anyway, as @molpopgen says, it is a reasonable thing want to specify a non finite sites mutation rate in this case.

On the original point, if we are using keyword-only arguments, could we have finite_sites_mutation and finite_sites_recombination boolean keyword parameters, and also finite_sites which is a shortcut for setting both at the same time? Or is that too complicated a system of parameters for such a frequently-used function?

[1] Does anyone else find it confusing that the parameter name in msprime.simulate is mutation_rate=XXX, whereas in msprime.mutate it is rate=XXX? I always find my fingers typing the wrong thing, and wondered if we could have mutation_rate as an alias for simply rate in the msprime.mutate function.

@gtsambos
Copy link
Member

gtsambos commented Feb 6, 2020

The question is, should msprime have an option finite_sites=True which is a short cut for specifying a finite sites recombination model and a finite sites mutation model, or whether there should be a single finite_sites_recombination option. Or, we can keep it as it is now where finite sites recombination functionality is accessed via specifying the recombination map.

I vote for the finite_sites=True option that controls both mutation and recombination, or at least for finite site recombination to be accompanied by finite site mutation by default, for the reason @jeromekelleher mentioned up top -- I think that most users would be choosing this option in a situation where they want to explicitly model a chromosome with discrete bps. I suspect there would be a lot of confusion and unwitting errors made if this wasn't the default behaviour. Perhaps an argument called discrete=True would be more fitting than finite_sites=True.

Right, "finite-sites" recombination + infinitely-many sites mutation is actually key to Hudson's original formulation of the process, and also to various methods for simulating odd things like microsatellites.

I hear you and wouldn't be opposed to there being some way for the user to have a finite sites recombination and infinite sites mutation model together, but I suspect that both of these uses are a bit more niche, so I don't think it should be the default behaviour.

@petrelharp
Copy link
Contributor

I like this. We should make it as easy for users to simulate in discrete coordinates as in continuous ones. I propose that the option is called discrete_coordinates (since finite_sites is a bit jargon-ey and discrete could refer to many other things).

@jeromekelleher
Copy link
Member Author

So we can't change the default behaviour @gtsambos, but you're right, we'll still support all combinations of discrete and continuous coordinates.

I like the discrete_coordinates idea @petrelharp, this is is much less opaque.

@petrelharp
Copy link
Contributor

With #946 merged we should do this.

jeromekelleher added a commit to mcveanlab/msprime that referenced this issue Aug 7, 2020
NOTE: This commit changes simulate to make all arguments after the first
keyword only. This is a good change to get in for 1.0.

Closes tskit-dev#880
@jeromekelleher
Copy link
Member Author

Should we change this to discrete_genome rather than discrete_coordinates @petrelharp? There's potential confusion here with spatial coordinates - who knows, some day we might have an SFLV model in here or something with individuals falling at continuous locations in space.

@hyanwong
Copy link
Member

discrete_loci perhaps? Or does that imply mutations at those sites?

jeromekelleher added a commit to mcveanlab/msprime that referenced this issue Aug 10, 2020
NOTE: This commit changes simulate to make all arguments after the first
keyword only. This is a good change to get in for 1.0.

Closes tskit-dev#880
jeromekelleher added a commit to mcveanlab/msprime that referenced this issue Aug 10, 2020
NOTE: This commit changes simulate to make all arguments after the first
keyword only. This is a good change to get in for 1.0.

Closes tskit-dev#880
@jeromekelleher
Copy link
Member Author

discrete_loci perhaps? Or does that imply mutations at those sites?

Yeah, that's good option too.

jeromekelleher added a commit to mcveanlab/msprime that referenced this issue Aug 10, 2020
NOTE: This commit changes simulate to make all arguments after the first
keyword only. This is a good change to get in for 1.0.

Closes tskit-dev#880
@petrelharp
Copy link
Contributor

Sounds good. I think I like discrete_genome more than discrete_loci (though at first I liked 'loci'), since a locus means so many different things (eg, it can mean a gene, or a 1Mb chunk of the genome if it's a QTL). discrete_genome_coordinates would be even more descriptive, but I don't think it needs to be so long.

jeromekelleher added a commit to mcveanlab/msprime that referenced this issue Aug 10, 2020
NOTE: This commit changes simulate to make all arguments after the first
keyword only. This is a good change to get in for 1.0.

Closes tskit-dev#880
@jeromekelleher
Copy link
Member Author

OK, discrete_genome it is. I'll update #1115 accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants