Optional genomic-scale for popgen stats #893

timothymillar · 2022-08-17T22:15:35Z

See discussion in #888

Most methods within popgen.py are automatically applied over windows and default to variants if the dataset has not been windowed. It would be useful to optionally apply these methods over the whole genome and potentially per contig. Additionally, some methods such as identity_by_state and Weir_Goudet_beta are currently applied to the whole genome, but it would also be useful to apply them within windows/contigs.

An option to allow more user control, whist keeping sensible defaults, would be to introduce a by parameter which can be used to specify the genomic scale (e.g., one of {"genome", "windows", "variants"}). The specified scale would determine the genomic-dimension and shape of the resulting values.

The text was updated successfully, but these errors were encountered:

timothymillar · 2022-08-17T22:38:09Z

A consideration is whether specifying by="genome" should result in a dimension of length 1, or the absence of a dimension.

A dimension of length 1 would potentially simplify some things programmatically, but may be confusing to users. It may also complicate post hoc generalization of methods such as identity_by_state where no genomic-dimension was included in the original implementation (e.g., requiring breaking changes to the schema).

The absence of a genomic-dimension when by="genome" would require that returned variables have variable dimensionality in the schema (already an option). This may be result in some awkward edge-cases but it probably(?) maps better to user expectations and simplifies post hoc generalization of methods. Furthermore, it shouldn't be necessary to allow all options within {"genome", "windows", "variants"} where those options aren't sensible.

jeromekelleher · 2022-08-25T08:36:01Z

The approach we've taken in tskit to "extra" dimensions question is drop them by default where possible, but to always keep them if the user has specified explicit windows (or anything else that changes the output dimensions). That seemed like a good balance between doing the "right" thing by default, but also giving the user full control and not changing the number of dimensions arbitrarily when the user changes the number of windows.

Would this general approach map here?

jeromekelleher · 2022-08-25T08:36:26Z

In general, the by= argument seems like an excellent idea to me.

timothymillar · 2022-08-25T09:49:19Z

That makes sense

but to always keep them if the user has specified explicit windows

In this context would by="genome" be considered specifying an explicit window, or specifying no window?

jeromekelleher · 2022-08-25T10:01:46Z

Maybe we could sketch out a few different usages here with mixtures of default and non-default options to see what the most sensible approach is? I'm having trouble seeing exactly how the by argument will be used in conjunction with other args.

tomwhite · 2022-09-08T15:44:33Z

I've started on one possible implementation here, just to see what it looks like: https://github.com/tomwhite/sgkit/tree/by-genome

In this context would by="genome" be considered specifying an explicit window, or specifying no window?

No window.

>>> import numpy as np
>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=3000,
...     n_sample=12,
...     n_ploidy=1,
...     n_allele=2,
...     n_contig=2,
...     seed=3,
...      missing_pct=0.1,
... ).chunk(1000)
>>> ds["sample_cohort"] = ("samples", np.array([0, 1, 2, 3] * 3))
>>> 
>>> # default is by variant if not windowed
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (variants: 3000, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: variants, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (variants, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1000, 4, 4), meta=np.ndarray>
>>> 
>>> # specify by="genome" for genome-wide stats
>>> div = sg.divergence(ds, by="genome", merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: cohorts_0, cohorts_1
Data variables:
    stat_divergence  (cohorts_0, cohorts_1) float64 dask.array<chunksize=(4, 4), meta=np.ndarray>
>>> 
>>> # now let's try with a windowed dataset
>>> ds = sg.window_by_variant(ds, size=25)
>>> 
>>> # default is by window if windowed
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (windows: 120, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: windows, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (windows, cohorts_0, cohorts_1) float64 dask.array<chunksize=(40, 4, 4), meta=np.ndarray>
>>> 
>>> # specify by="genome" for genome-wide stats (even if windowed)
>>> div = sg.divergence(ds, by="genome", merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: cohorts_0, cohorts_1
Data variables:
    stat_divergence  (cohorts_0, cohorts_1) float64 dask.array<chunksize=(4, 4), meta=np.ndarray>
>>> 
>>> # specify by="variant" for per-variant stats (even if windowed)
>>> div = sg.divergence(ds, by="variant", merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (variants: 3000, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: variants, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (variants, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1000, 4, 4), meta=np.ndarray>

An alternative would be to have a window_by_genome function (as discussed here, example branch: https://github.com/tomwhite/sgkit/tree/window-by-genome), then we wouldn't need the by argument. The windows dimension would be kept in this case (and be of size one). Implementation-wise it needs special-casing, because we don't support such large windows (as discussed in the above issue).

>>> # window_by_genome for genome-wide stats
>>> ds = sg.window_by_genome(ds)
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (windows: 1, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: windows, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (windows, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1, 4, 4), meta=np.ndarray>

Interested to hear people's thoughts.

[BTW The naming of the existing window_by_variant seems a little ambiguous or misleading, since it doesn't produce a window for each variant, whereas window_by_genome produces a window for the genome. Perhaps it should be window_by_variant_position.]

timothymillar · 2022-09-12T00:34:38Z

This looks good @tomwhite, I personally prefer the first option.

I wonder if we should use a more specific argument name rather than by, just in case we end up wanting a similar option for another dimension in future. Maybe a pattern of <dimention>_by so variants_by="window" which leaves samples_by="cohort" as a future possibility?

tomwhite · 2022-11-01T11:29:51Z

Looking at this again I think I prefer the second option. The by argument (first option) has a complex interaction with windowing, which I feel might be confusing for users.

The following table shows how statistics are grouped (per variant, per window, whole genome) depending on the value of the by argument, and if the dataset is windowed or not:

`by`	Not windowed	Windowed
`None` (default)	per variant	per window
`genome`	whole genome	whole genome
`window`	ERROR	per window
`variant`	per variant	per variant

With the second option there is no by argument, and genome-wide stats are treated as another way of specifying a window (by calling window_by_genome(ds)).

The second option is simpler to introduce into the codebase, since it just needs the window_by_genome function. The first option needs every stats method (divergence, Fst, etc) to be changed to support the by logic. (This isn't necessarily a show-stopper, but I thought I'd mention it.)

jeromekelleher · 2022-11-01T15:53:42Z

SGTM @tomwhite, I'll cast my vote for adding the window_by_genome function then.

timothymillar · 2022-11-01T20:03:23Z

SGTM also, but this may require some re-working of the identity_by_state and Weir_Goudet_beta methods if/when windowing is required.

timothymillar mentioned this issue Sep 14, 2022

Include dimension names is variable specifications #904

Closed

tomwhite mentioned this issue Nov 2, 2022

Add window_by_genome #945

Merged

mergify bot closed this as completed in #945 Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional genomic-scale for popgen stats #893

Optional genomic-scale for popgen stats #893

timothymillar commented Aug 17, 2022

timothymillar commented Aug 17, 2022 •

edited

Loading

jeromekelleher commented Aug 25, 2022

jeromekelleher commented Aug 25, 2022

timothymillar commented Aug 25, 2022

jeromekelleher commented Aug 25, 2022

tomwhite commented Sep 8, 2022

timothymillar commented Sep 12, 2022

tomwhite commented Nov 1, 2022

jeromekelleher commented Nov 1, 2022

timothymillar commented Nov 1, 2022

Optional genomic-scale for popgen stats #893

Optional genomic-scale for popgen stats #893

Comments

timothymillar commented Aug 17, 2022

timothymillar commented Aug 17, 2022 • edited Loading

jeromekelleher commented Aug 25, 2022

jeromekelleher commented Aug 25, 2022

timothymillar commented Aug 25, 2022

jeromekelleher commented Aug 25, 2022

tomwhite commented Sep 8, 2022

timothymillar commented Sep 12, 2022

tomwhite commented Nov 1, 2022

jeromekelleher commented Nov 1, 2022

timothymillar commented Nov 1, 2022

timothymillar commented Aug 17, 2022 •

edited

Loading