Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional genomic-scale for popgen stats #893

Closed
timothymillar opened this issue Aug 17, 2022 · 10 comments · Fixed by #945
Closed

Optional genomic-scale for popgen stats #893

timothymillar opened this issue Aug 17, 2022 · 10 comments · Fixed by #945

Comments

@timothymillar
Copy link
Collaborator

See discussion in #888

Most methods within popgen.py are automatically applied over windows and default to variants if the dataset has not been windowed. It would be useful to optionally apply these methods over the whole genome and potentially per contig. Additionally, some methods such as identity_by_state and Weir_Goudet_beta are currently applied to the whole genome, but it would also be useful to apply them within windows/contigs.

An option to allow more user control, whist keeping sensible defaults, would be to introduce a by parameter which can be used to specify the genomic scale (e.g., one of {"genome", "windows", "variants"}). The specified scale would determine the genomic-dimension and shape of the resulting values.

@timothymillar
Copy link
Collaborator Author

timothymillar commented Aug 17, 2022

A consideration is whether specifying by="genome" should result in a dimension of length 1, or the absence of a dimension.

A dimension of length 1 would potentially simplify some things programmatically, but may be confusing to users. It may also complicate post hoc generalization of methods such as identity_by_state where no genomic-dimension was included in the original implementation (e.g., requiring breaking changes to the schema).

The absence of a genomic-dimension when by="genome" would require that returned variables have variable dimensionality in the schema (already an option). This may be result in some awkward edge-cases but it probably(?) maps better to user expectations and simplifies post hoc generalization of methods. Furthermore, it shouldn't be necessary to allow all options within {"genome", "windows", "variants"} where those options aren't sensible.

@jeromekelleher
Copy link
Collaborator

The approach we've taken in tskit to "extra" dimensions question is drop them by default where possible, but to always keep them if the user has specified explicit windows (or anything else that changes the output dimensions). That seemed like a good balance between doing the "right" thing by default, but also giving the user full control and not changing the number of dimensions arbitrarily when the user changes the number of windows.

Would this general approach map here?

@jeromekelleher
Copy link
Collaborator

In general, the by= argument seems like an excellent idea to me.

@timothymillar
Copy link
Collaborator Author

That makes sense

but to always keep them if the user has specified explicit windows

In this context would by="genome" be considered specifying an explicit window, or specifying no window?

@jeromekelleher
Copy link
Collaborator

Maybe we could sketch out a few different usages here with mixtures of default and non-default options to see what the most sensible approach is? I'm having trouble seeing exactly how the by argument will be used in conjunction with other args.

@tomwhite
Copy link
Collaborator

tomwhite commented Sep 8, 2022

I've started on one possible implementation here, just to see what it looks like: https://github.com/tomwhite/sgkit/tree/by-genome

In this context would by="genome" be considered specifying an explicit window, or specifying no window?

No window.

>>> import numpy as np
>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=3000,
...     n_sample=12,
...     n_ploidy=1,
...     n_allele=2,
...     n_contig=2,
...     seed=3,
...      missing_pct=0.1,
... ).chunk(1000)
>>> ds["sample_cohort"] = ("samples", np.array([0, 1, 2, 3] * 3))
>>> 
>>> # default is by variant if not windowed
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (variants: 3000, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: variants, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (variants, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1000, 4, 4), meta=np.ndarray>
>>> 
>>> # specify by="genome" for genome-wide stats
>>> div = sg.divergence(ds, by="genome", merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: cohorts_0, cohorts_1
Data variables:
    stat_divergence  (cohorts_0, cohorts_1) float64 dask.array<chunksize=(4, 4), meta=np.ndarray>
>>> 
>>> # now let's try with a windowed dataset
>>> ds = sg.window_by_variant(ds, size=25)
>>> 
>>> # default is by window if windowed
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (windows: 120, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: windows, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (windows, cohorts_0, cohorts_1) float64 dask.array<chunksize=(40, 4, 4), meta=np.ndarray>
>>> 
>>> # specify by="genome" for genome-wide stats (even if windowed)
>>> div = sg.divergence(ds, by="genome", merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: cohorts_0, cohorts_1
Data variables:
    stat_divergence  (cohorts_0, cohorts_1) float64 dask.array<chunksize=(4, 4), meta=np.ndarray>
>>> 
>>> # specify by="variant" for per-variant stats (even if windowed)
>>> div = sg.divergence(ds, by="variant", merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (variants: 3000, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: variants, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (variants, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1000, 4, 4), meta=np.ndarray>

An alternative would be to have a window_by_genome function (as discussed here, example branch: https://github.com/tomwhite/sgkit/tree/window-by-genome), then we wouldn't need the by argument. The windows dimension would be kept in this case (and be of size one). Implementation-wise it needs special-casing, because we don't support such large windows (as discussed in the above issue).

>>> # window_by_genome for genome-wide stats
>>> ds = sg.window_by_genome(ds)
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (windows: 1, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: windows, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (windows, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1, 4, 4), meta=np.ndarray>

Interested to hear people's thoughts.

[BTW The naming of the existing window_by_variant seems a little ambiguous or misleading, since it doesn't produce a window for each variant, whereas window_by_genome produces a window for the genome. Perhaps it should be window_by_variant_position.]

@timothymillar
Copy link
Collaborator Author

This looks good @tomwhite, I personally prefer the first option.

I wonder if we should use a more specific argument name rather than by, just in case we end up wanting a similar option for another dimension in future. Maybe a pattern of <dimention>_by so variants_by="window" which leaves samples_by="cohort" as a future possibility?

@tomwhite
Copy link
Collaborator

tomwhite commented Nov 1, 2022

Looking at this again I think I prefer the second option. The by argument (first option) has a complex interaction with windowing, which I feel might be confusing for users.

The following table shows how statistics are grouped (per variant, per window, whole genome) depending on the value of the by argument, and if the dataset is windowed or not:

by Not windowed Windowed
None (default) per variant per window
genome whole genome whole genome
window ERROR per window
variant per variant per variant

With the second option there is no by argument, and genome-wide stats are treated as another way of specifying a window (by calling window_by_genome(ds)).

The second option is simpler to introduce into the codebase, since it just needs the window_by_genome function. The first option needs every stats method (divergence, Fst, etc) to be changed to support the by logic. (This isn't necessarily a show-stopper, but I thought I'd mention it.)

@jeromekelleher
Copy link
Collaborator

SGTM @tomwhite, I'll cast my vote for adding the window_by_genome function then.

@timothymillar
Copy link
Collaborator Author

SGTM also, but this may require some re-working of the identity_by_state and Weir_Goudet_beta methods if/when windowing is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants