# Xarray Genetics API Prototype 

In [1]:
# Set to enable auto-complete for subclass properties
# See: https://github.com/ipython/ipython/issues/11653#issuecomment-492578777
%config Completer.use_jedi = False
import sys
import xarray as xr
import numpy as np
import pandas as pd
import dask.array as da
sys.path.append(".")
from lib import api
xr.set_options(display_style='html')

<xarray.core.options.set_options at 0x7f12142b6cd0>

## Data Structures

Data structure classes are all Xarray subclasses and do not maintain any state that will interfere with serialization or other Xarray functionality.  They function solely to:

- Provide casting mechanisms from one type to another (e.g. array containing allele index by chromosome to array containing alt allele counts)
- Apply preconditions on underlying arrays (e.g. count arrays have to be some unsigned integer type, all others will raise an error)
  - This allows for the data model to do most of the upfront type, value, and shape checking that many functions might otherwise do with a bunch of biolerplate.  In other words, this shifts data structure validation to the type hint system rather than "utility" functions.
- Allow pass-through support for duck array typing, meaning that this library should make as few assumptions as possible about what underlying array libraries are used (e.g. numpy, dask, xarray).  Anything implementing [\_\_array_function__](https://numpy.org/neps/nep-0018-array-function-protocol.html) is fine.
- Serve as anchors for documentation.

Any information that is specific to a genetics dataset needs to be represented in xarray Attributes, as coordinates, or as data variables in a Dataset -- no instance attributes should be attached to the class instances directly.  An example of this is phasing.  Whether or not calls are phased can vary within a single dataset but this should not be represented as say, an attribute on a ```GenotypeIndexArray``` but rather as a separate data variable in an ```xarray.Dataset```.  Examples of this are shown later.

### Creating Arrays

In [3]:
# 10 variants, 5 samples, and 2 chromosomes for which hard calls are simulated
shape = (10, 5, 2)

In [17]:
# From numpy:
gt = np.random.randint(0, 2, size=shape, dtype=np.uint8) # Draw from [0, 1]
gt = api.GenotypeIndexArray(allele_index=gt, attrs={'description': 'Bi-allelic, diploid hard call array example [numpy]'})
gt

In [56]:
# From dask:
gt = da.random.randint(0, 2, size=shape, dtype=np.uint8)
gt = api.GenotypeIndexArray(allele_index=gt, attrs={'description': 'Bi-allelic, diploid hard call array example [dask]'})
gt

Unnamed: 0,Array,Chunk
Bytes,100 B,100 B
Shape,"(10, 5, 2)","(10, 5, 2)"
Count,1 Tasks,1 Chunks
Type,uint8,numpy.ndarray
"Array Chunk Bytes 100 B 100 B Shape (10, 5, 2) (10, 5, 2) Count 1 Tasks 1 Chunks Type uint8 numpy.ndarray",2  5  10,

Unnamed: 0,Array,Chunk
Bytes,100 B,100 B
Shape,"(10, 5, 2)","(10, 5, 2)"
Count,1 Tasks,1 Chunks
Type,uint8,numpy.ndarray


In [57]:
gt.data

Unnamed: 0,Array,Chunk
Bytes,100 B,100 B
Shape,"(10, 5, 2)","(10, 5, 2)"
Count,1 Tasks,1 Chunks
Type,uint8,numpy.ndarray
"Array Chunk Bytes 100 B 100 B Shape (10, 5, 2) (10, 5, 2) Count 1 Tasks 1 Chunks Type uint8 numpy.ndarray",2  5  10,

Unnamed: 0,Array,Chunk
Bytes,100 B,100 B
Shape,"(10, 5, 2)","(10, 5, 2)"
Count,1 Tasks,1 Chunks
Type,uint8,numpy.ndarray


In [58]:
# From xarray (gt is already an Xarray DataArray)
# Note however, that when creating a DataArray from another DataArray, Xarray
# will call .asarray and force the underlying data to numpy 
api.GenotypeIndexArray(gt)

### Creating Datasets

To represent more complex structures, such as CNV data for a multi-allelic, polyploid experiment, ```Dataset``` instances can be used.  These structures constitute the inputs to more advanced analyses:

In [59]:
# Create an array containing hard calls where values correspond to allele index (up to at most 3 alleles possible in this case)
gt_idx = api.GenotypeIndexArray(allele_index=np.random.randint(0, 3, size=shape, dtype=np.uint8))

# Create an array containing the copy numbers as counts of the alleles above on each chromosome
gt_cts = api.GenotypeCountArray(allele_count=np.random.randint(0, 5, size=shape, dtype=np.uint8))

# Combine the two arrays into a single dataset
cnv_ds = api.GenotypeAlleleCountDataset(allele_index=gt_idx, allele_count=gt_cts, attrs={'name': 'CNV Dataset Example'})
cnv_ds

### Data Structure Conversions

The API facilitates conversions to different representations where possible through methods on the data structure subclasses.  These conversions often result in a loss of information (e.g. reductions across a dimension) -- i.e. they are not invertible.  This makes it easy for an analysis to start with a complex N-dimensional structure and call these conversions where needed as inputs to algorithms that generally expect simpler structures (e.g. LD estimation only needs alternate allele counts as a 2D array).

These examples show a few conversions to dosage arrays:

#### From Hard Calls

In [66]:
# Simulate an array of diploid hard calls
gt = np.random.randint(0, 2, size=shape, dtype=np.uint8) # Draw from [0, 1]
gt = api.GenotypeIndexArray(allele_index=gt, attrs={'description': 'Calculating dosage from hard calls example'})
gt

In [67]:
# Convert to dosages (sum along ploidy dimension)
gtd = gt.to_dosage_array()
gtd

In [68]:
# This is the usual input many genetics methods operate on
gtd.data

array([[0, 1, 1, 2, 1],
       [0, 2, 2, 2, 2],
       [2, 1, 2, 1, 0],
       [1, 1, 1, 2, 0],
       [2, 1, 1, 2, 0],
       [0, 1, 2, 1, 0],
       [1, 1, 2, 0, 1],
       [2, 0, 1, 1, 1],
       [1, 1, 1, 0, 1],
       [2, 1, 1, 2, 2]], dtype=uint64)

In [69]:
# This is the same, but w/o API type preservation
gt.sum(dim='ploidy')

#### From Probabilities

If the genotype calls for an experiment are imputed (or probabilistic for some other reason), this is what converting to dosage looks like:

In [105]:
gp = np.random.rand(*shape, 2)
gp /= gp.sum(axis=-1, keepdims=True)
gp.shape

(10, 5, 2, 2)

In [106]:
gp[0, 0, 0]

array([0.46215599, 0.53784401])

In [107]:
gp = api.GenotypeProbabilityArray(allele_probability=gp, attrs={'description': 'Genotype probability example'})
gp

In [111]:
# Dosages will be calculated as Pr(Heterozygous) + 2 * Pr(Homozygous Alternate) (Allele 0 assumed as reference)
gp.to_dosage_array()

## Operations

TBD

-----

### Example Ideas

- handling missing values
- adding fields to dataset as joins
- copy number
- converting from calls to dosages/alt_counts
- are subtypes preserved when added to dataset?
- show matrix of convertability between types?
    
### Notes

- Xarray often invokes constructors on DataArray/Dataset like this: ```type(self)(*args, **kwargs)```
  - This means that subclasses must be able to differentiate constructor params
- Do any loci have more than 128 alleles (i.e. should np.uint8 be a safe assumption for allele dimension)?
- xr.DataArray(xr.DataArray()) will call .asarray on input argument pushing .data into memory as numpy