Skip to content

Latest commit

 

History

History
391 lines (281 loc) · 17.2 KB

getting_started.rst

File metadata and controls

391 lines (281 loc) · 17.2 KB

Getting Started

You can install sgkit with pip. Python 3.10, 3.11, or 3.12 is required:

$ pip install sgkit

Alternatively, you can use conda:

$ conda install -c conda-forge sgkit

Sgkit is a general purpose toolkit for quantitative and population genetics. The primary goal of sgkit is to take advantage of powerful tools in the PyData ecosystem to facilitate interactive analysis of large-scale datasets. The main libraries we use are:

  • Xarray: N-D labeling for arrays and datasets
  • Dask: Parallel computing on chunked arrays
  • Zarr: Storage for chunked arrays
  • Numpy: N-D in-memory array manipulation
  • Pandas: Tabular data frame manipulation
  • CuPy: Numpy-like array interface for CUDA GPUs

Sgkit uses Xarray Dataset objects to model genetic data. Users are free to manipulate quantities within these objects as they see fit and a set of conventions for variable names, dimensions and underlying data types is provided to aid in workflow standardization. The example below illustrates a Dataset format that would result from an assay expressible as VCF, PLINK or BGEN. This is a guideline however, and a Dataset seen in practice might include many more or fewer variables and dimensions.

_static/data-structures-xarray.jpg

The worked examples below show how to access and visualize data like this using Xarray. They also demonstrate several of the sgkit conventions in place for representing common genetic quantities.

.. ipython:: python

    import sgkit as sg
    ds = sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, n_contig=23, missing_pct=.1)
    ds

The presence of a single-nucleotide variant (SNV) is indicated above by the call_genotype variable, which contains an integer value corresponding to the index of the associated allele present (i.e. index into the variant_allele variable) for a sample at a given genome coordinate and chromosome. Every sgkit variable has a set of fixed semantics like this. For more information on this specific variable and any others, see :ref:`genetic_variables`.

Data subsets can be accessed as shown here, using the features described in Indexing and Selecting Xarray Data:

.. ipython:: python

    import sgkit as sg
    ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, n_contig=23, missing_pct=.1)

    # Subset the entire dataset to the first 10 variants/samples
    ds.isel(variants=slice(10), samples=slice(10))

    # Subset to a specific set of variables
    ds[['variant_allele', 'call_genotype']]

    # Extract a single variable
    ds.call_genotype[:3, :3]

    # Access the array underlying a single variable (this would return dask.array.Array if chunked)
    ds.call_genotype.data[:3, :3]

    # Access the alleles corresponding to the calls for the first variant and sample
    allele_indexes = ds.call_genotype[0, 0]
    allele_indexes

    ds.variant_allele[0, allele_indexes]

    # Get a single item from an array as a Python scalar
    ds.sample_id.item(0)

Larger subsets of data can be visualized and/or summarized through various sgkit utilities as well as the Pandas/Xarray integration:

.. ipython:: python

    import sgkit as sg
    ds = sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, missing_pct=.1)

    # Show genotype calls with domain-specific display logic
    sg.display_genotypes(ds, max_variants=8, max_samples=8)

    # A naive version of the above is also possible using only Xarray/Pandas and
    # illustrates the flexibility that comes from being able to transition into
    # and out of array/dataframe representations easily
    (ds.call_genotype[:5, :5].to_series()
        .unstack().where(lambda df: df >= 0, None).fillna('.')
        .astype(str).apply('/'.join, axis=1).unstack())

    # Show call rate distribution for each variant using Pandas
    df = ~ds.call_genotype_mask.to_dataframe()
    df.head(5)

    call_rates = df.groupby('variants').mean()
    call_rates

    @savefig call_rate_example.png width=6in height=3in
    call_rates.plot(kind='hist', bins=24, title='Call Rate Distribution', figsize=(6, 3))

This last example alludes to representations of missing data that are explained further in :ref:`missing_data`.

Genetic methods in sgkit are nearly always applied to individual Dataset objects. For a full list of available methods, see :ref:`api_methods`.

In this example, the variant_stats method is applied to a dataset to compute a number of statistics across samples for each individual variant:

.. ipython:: python

    import sgkit as sg
    ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, missing_pct=.1)
    sg.variant_stats(ds, merge=False)

There are two ways that the results of every function are handled -- either they are merged with the provided dataset or they are returned in a separate dataset. See :ref:`dataset_merge` for more details.

Missing data in sgkit is represented using a sentinel value within data arrays (-1 in integer arrays and NaN in float arrays) as well as a companion boolean mask array (True where data is missing). These sentinel values are handled transparently in most sgkit functions and where this isn't possible, limitations related to it are documented along with potential workarounds.

This example demonstrates one such function where missing calls are ignored:

.. ipython:: python

    import sgkit as sg
    ds = sg.simulate_genotype_call_dataset(n_variant=1, n_sample=4, n_ploidy=2, missing_pct=.3, seed=4)
    ds.call_genotype

    # Here, you can see that the missing calls above are not included in the allele counts
    sg.count_variant_alleles(ds).variant_allele_count


A primary design goal in sgkit is to facilitate ad hoc analysis. There are many useful functions in the library but they are not enough on their own to accomplish many analyses. To that end, it is often helpful to be able to handle missing data in your own functions or exploratory summaries. Both the sentinel values and the boolean mask array help make this possible, where the sentinel values are typically more useful when implementing compiled operations and the boolean mask array is easier to use in a higher level API like Xarray or Numpy. Only advanced users would likely ever need to worry about compiling their own functions (see :ref:`custom_computations` for more details). Using Xarray functions and the boolean mask is generally enough to accomplish most tasks, and this mask is often more efficient to operate on due to its high on-disk compression ratio. This example shows how it can be used in the context of doing something simple like counting heterozygous calls:

.. ipython:: python

    import sgkit as sg
    import xarray as xr
    ds = sg.simulate_genotype_call_dataset(n_variant=1, n_sample=4, n_ploidy=2, missing_pct=.2, seed=2)
    # This array contains the allele indexes called for a sample
    ds.call_genotype

    # This array represents only locations where the above calls are missing
    ds.call_genotype_mask

    # Determine which calls are heterozygous
    is_heterozygous = (ds.call_genotype[..., 0] != ds.call_genotype[..., 1])
    is_heterozygous

    # Count the number of heterozygous samples for the lone variant
    is_heterozygous.sum().item(0)

    # This is almost correct except that the calls for the first sample aren't
    # really heterozygous, one of them is just missing.  Conditional logic like
    # this can be used to isolate those values and replace them in the result:
    xr.where(ds.call_genotype_mask.any(dim='ploidy'), False, is_heterozygous).sum().item(0)

    # Now the result is correct -- only the third sample is heterozygous so the count should be 1.
    # This how many sgkit functions handle missing data internally:
    sg.variant_stats(ds).variant_n_het.values.item(0)

It is common to compute statistics in windows along the genome. Some :ref:`api_methods` in sgkit are "windowing aware" and will compute values for windows defined in a dataset. If no windows are defined then the values will typically be computed for each variant. It is therefore important to define windows before computing statistics on a dataset.

Windows are intervals that span the variants dimension in a dataset, and they are defined using the :func:`sgkit.window_by_variant`, :func:`sgkit.window_by_position`, and :func:`sgkit.window_by_genome` functions.

The first function, :func:`sgkit.window_by_variant`, is the simplest and produces windows with a fixed number of variants in each window, determined by the size argument. An optional step argument may be provided to control the spacing between windows. By default, it is the same as the size, giving contiguous windows.

The second function, :func:`sgkit.window_by_position`, produces windows whose size is measured by genomic position (base pairs). See the API documentation for usage and examples.

The third function, :func:`sgkit.window_by_genome`, produces a single window spanning the whole genome, which can be used to compute whole-genome statistics.

This example shows the effect of computing the diversity statistic: first with no windows defined, then with windows.

.. ipython:: python
    :okwarning:

    import sgkit as sg
    import xarray as xr
    ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50)

    # Define a single cohort for all samples
    ds["sample_cohort"] = xr.DataArray(np.full(ds.dims['samples'], 0), dims="samples")

    # The diversity statistic is computed for every variant since no windows are defined
    sg.diversity(ds, merge=False)

    # Define windows of size 20 variants. This creates a new dimension called `windows`, and
    # some new variables for internal use.
    ds = sg.window_by_variant(ds, size=20)

    # The diversity statistic is now computed for every window
    sg.diversity(ds, merge=False)

During analysis we often want to be able to group samples into populations, and compute statistics based on these groups. Groups of samples are referred to as cohorts in sgkit.

Cohorts are defined by a mapping from samples to cohort index. The following example creates a sample_cohort variable to group a dataset of ten samples into three cohorts. Note that first value is -1, which means the corresponding sample is not in any of the three cohorts, and will be ignored when computing cohort statistics.

.. ipython:: python
    :okwarning:

    import sgkit as sg
    import xarray as xr
    ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=10)
    ds["sample_cohort"] = xr.DataArray(np.array([-1, 0, 1, 1, 1, 1, 0, 2, 2, 2]), dims="samples")

Typically the sample_cohort variable is derived from a dataframe that has the sample/cohort mapping.

Cohort-level statistics can have repeated cohorts dimensions. :func:`sgkit.Fst`, for example, produces statistics for pairs of cohorts, which is represented as a variable with dimensions (windows, cohorts_0, cohorts_1), making it possible to read off the value of the statistic for any pair of cohorts.

It's convenient to name cohorts, to avoid errors that can occur when using index values. This example shows how to give cohorts names.

.. ipython:: python
    :okwarning:

    ds = sg.window_by_variant(ds, size=20)
    ds = sg.Fst(ds)

    cohort_names = ["Africa", "Asia", "Europe"]
    ds = ds.assign_coords({"cohorts_0": cohort_names, "cohorts_1": cohort_names})
    ds.stat_Fst.sel(cohorts_0="Africa", cohorts_1="Asia").values

Methods that work with cohorts will, by default, operate over all cohorts at once. Sometimes however you might only want to run the computation for a subset of cohorts, in which case you can explicitly specify the cohorts when calling the function.

Method chaining is a common practice with Python data tools that improves code readability and reduces the probability of introducing accidental namespace collisions. Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with Xarray and Pandas operations in a single pipeline:

.. ipython:: python
    :okwarning:

    import sgkit as sg
    ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, missing_pct=.1)

    # Use `pipe` to apply a single sgkit function to a dataset
    ds_qc = ds.pipe(sg.variant_stats).drop_dims('samples')
    ds_qc

    # Show statistics for one of the arrays to be used as a filter
    ds_qc.variant_call_rate.to_series().describe()

    # Build a pipeline that filters on call rate and computes Fst between two cohorts
    # for windows of size 20 variants
    (
        ds
        # Add and compute call rate and other statistics
        .pipe(sg.variant_stats).compute()
        # Apply filter to include variants present across > 80% of samples
        .pipe(lambda ds: ds.sel(variants=ds.variant_call_rate > .8))
        # Create windows of size 20 variants
        .pipe(lambda ds: sg.window_by_variant(ds, size=20))
        # Assign a "cohort" variable that splits samples into two groups
        .assign(sample_cohort=np.repeat([0, 1], ds.dims['samples'] // 2))
        # Compute Fst between the groups
        .pipe(sg.Fst)
        # Extract the Fst values for cohort pairs
        .stat_Fst.values
    )

This is possible because sgkit functions nearly always take a Dataset as the first argument, create new variables, and then merge these new variables into a copy of the provided dataset in the returned value. See :ref:`dataset_merge` for more details.

Chunked arrays are required when working on large datasets. Libraries for managing chunked arrays such as Dask Array and Zarr make it possible to implement blockwise algorithms that operate on subsets of arrays (in parallel) without ever requiring them to fit entirely in memory.

By design, they behave almost identically to in-memory (typically Numpy) arrays within Xarray and can be interchanged freely when provided to sgkit functions. The most notable difference in behavior though is that operations on chunked arrays are evaluated lazily. This means that if an Xarray Dataset contains only chunked arrays, no actual computations will be performed until one of the following occurs:

This example shows a few of these features:

.. ipython:: python

    import sgkit as sg
    ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, missing_pct=.1)

    # Chunk our original in-memory dataset using a blocksize of 50 in all dimensions.
    ds = ds.chunk(chunks=50)
    ds

    # Show the chunked array representing base pair position
    ds.variant_position

    # Call compute via the dask.array API
    ds.variant_position.data.compute()[:5]

    # Coerce to numpy via Xarray
    ds.variant_position.values[:5]

    # Compute without unboxing from xarray.DataArray
    ds.variant_position.compute()[:5]


Unlike this simplified example, real datasets often contain a mixture of chunked and unchunked arrays. Sgkit will often load smaller arrays directly into memory while leaving large arrays chunked as a trade-off between convenience and resource usage. This can always be modified by users though and sgkit functions that operate on a Dataset should work regardless of the underlying array backend.

See Parallel computing with Dask in Xarray for more examples and information, as well as the Dask tutorials on delayed array execution and lazy execution in Dask graphs.