# Load From Malaria Gen Zarr

A central point to the SGkit API is the Genotype Call Dataset. This is the data structure that most of the other functions use. It uses [Xarray](http://xarray.pydata.org/en/stable/) underneath the hood to give a programmatic interface that allows for the backend to be several different data files.

The Xarray itself is *sort of* a transposed VCF file.

For this example we are going to from the preprocessed zarr to the sgkit Genotype Call XArray Dataset.

This is only meant to demonstrate the datatypes that we feed into the Xarray dataset. For a more conceptual understanding please check out the `Genotype-Call-Dataset-From-VCF.ipynb`.

In [1]:
import numpy as np
import zarr
import pandas as pd
import dask.array as da
import allel
from pprint import pprint
import matplotlib.pyplot as plt
%matplotlib inline

## Create a Dask Cluster

This isn't that important for this example, but SGkit can use Dask under the hood for many of it's calculations. Divide and conquer your statistical genomics data!

In [2]:
from dask_kubernetes import KubeCluster
cluster = KubeCluster(n_workers=30, silence_logs='error')
cluster

VBox(children=(HTML(value='<h2>KubeCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n    .…

## Import sgkit

In [3]:
! pip install git+https://github.com/pystatgen/sgkit@96203d471531e7e2416d4dd9b48ca11d660a1bcc

Collecting git+https://github.com/pystatgen/sgkit@96203d471531e7e2416d4dd9b48ca11d660a1bcc
  Cloning https://github.com/pystatgen/sgkit (to revision 96203d471531e7e2416d4dd9b48ca11d660a1bcc) to /tmp/pip-req-build-7iudp4iv
  Running command git clone -q https://github.com/pystatgen/sgkit /tmp/pip-req-build-7iudp4iv
  Running command git checkout -q 96203d471531e7e2416d4dd9b48ca11d660a1bcc
Building wheels for collected packages: sgkit
  Building wheel for sgkit (setup.py) ... [?25ldone
[?25h  Created wheel for sgkit: filename=sgkit-0.1.dev67+g96203d4-py3-none-any.whl size=19421 sha256=76ddd164160ed34beee7e6e8f6f0bde32b36b898074de2a50e0e1ce64f228d70
  Stored in directory: /home/jovyan/.cache/pip/wheels/6f/2b/6e/48d20c382bb6a66ea96c6dee6e6e575ea88180fef1e96a9024
Successfully built sgkit


In [4]:
import sgkit
help(sgkit.api.create_genotype_call_dataset)

Help on function create_genotype_call_dataset in module sgkit.api:

create_genotype_call_dataset(*, variant_contig_names: List[str], variant_contig: Any, variant_position: Any, variant_alleles: Any, sample_id: Any, call_genotype: Any, call_genotype_phased: Any = None, variant_id: Any = None) -> xarray.core.dataset.Dataset
    Create a dataset of genotype calls.
    
    Parameters
    ----------
    variant_contig_names : list of str
        The contig names.
    variant_contig : array_like, int
        The (index of the) contig for each variant.
    variant_position : array_like, int
        The reference position of the variant.
    variant_alleles : array_like, S1
        The possible alleles for the variant.
    sample_id : array_like, str
        The unique identifier of the sample.
    call_genotype : array_like, int
        Genotype, encoded as allele values (0 for the reference, 1 for
        the first allele, 2 for the second allele), or -1 to indicate a
        missing value.

## Get the Malaria Gen Zarr Data

The [zarr](https://zarr.readthedocs.io/en/stable) data is hosted in a google cloud bucket, or available for download from the public FTP site.

In [5]:
import gcsfs

gcs_bucket_fs = gcsfs.GCSFileSystem(project='malariagen-jupyterhub', token='anon', access='read_only')

storage_path = 'ag1000g-release/phase2.AR1/variation/main/zarr/pass/ag1000g.phase2.ar1.pass'
store = gcsfs.mapping.GCSMap(storage_path, gcs=gcs_bucket_fs, check=False, create=False)
callset = zarr.Group(store)

If you explore the zarr data you will see that it is mostly the VCF data, with a few fields pre calculated for convenience.

In [6]:
print(callset['samples'])

<zarr.core.Array '/samples' (1142,) object>


In [7]:
chrom = '3R'
print(callset[chrom].tree())

3R
 ├── calldata
 │   └── GT (14481509, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (14481509,) float32
     ├── ABHom (14481509,) float32
     ├── AC (14481509, 3) int32
     ├── AF (14481509, 3) float32
     ├── ALT (14481509, 3) |S1
     ├── AN (14481509,) int32
     ├── Accessible (14481509,) bool
     ├── BaseCounts (14481509, 4) int32
     ├── BaseQRankSum (14481509,) float32
     ├── Coverage (14481509,) int32
     ├── CoverageMQ0 (14481509,) int32
     ├── DP (14481509,) int32
     ├── DS (14481509,) bool
     ├── Dels (14481509,) float32
     ├── FILTER_BaseQRankSum (14481509,) bool
     ├── FILTER_FS (14481509,) bool
     ├── FILTER_HRun (14481509,) bool
     ├── FILTER_HighCoverage (14481509,) bool
     ├── FILTER_HighMQ0 (14481509,) bool
     ├── FILTER_LowCoverage (14481509,) bool
     ├── FILTER_LowMQ (14481509,) bool
     ├── FILTER_LowQual (14481509,) bool
     ├── FILTER_NoCoverage (14481509,) bool
     ├── FILTER_PASS (14481509,) bool
     ├

## Get the Call Data

In [8]:
chrom = '3R'
calldata = callset[chrom]['calldata']

# TODO Will this be changed for SGKit?
genotypes = allel.GenotypeChunkedArray(calldata['GT'])
genotypes

Unnamed: 0,0,1,2,3,4,...,1137,1138,1139,1140,1141,Unnamed: 12
0,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
14481506,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
14481507,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
14481508,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,


### Genotype Chunked Array Data Structure

When looking at the `allel.GenotypeChunkedArray` we see that we have: GenotypeChunkedArray shape=(14481509, 1142, 2)

The shape corresponds to `variants`, `samples`, `alleles`.

For every index of a variant we have the alleles of each of the samples.

So let's get all the sample data for the first variant.

In [9]:
genotypes[0]

0,1,2,3,4,...,1137,1138,1139,1140,1141
0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0


And now let's look at the first variant call for the first sample.

In [10]:
genotypes[0][0]

array([0, 0], dtype=int8)

You can see above that for sample[0] the allele is 0/0, meaning it is homozygous for the reference.

## Get the Samples

In [11]:
samples = callset['samples']
sample_id = np.array(samples, dtype='U')

In [12]:
sample_id[0:5]

array(['AA0040-C', 'AA0041-C', 'AA0042-C', 'AA0043-C', 'AA0044-C'],
      dtype='<U8')

## Grab the Variant Positions

Get the positions of each variant

In [13]:
variant_position = callset[chrom]['variants/POS']

Let's investigate some of the attributes of our numpy array.

In [14]:
print(variant_position.shape)
print(variant_position.dtype.kind)

(14481509,)
i


## Grab the Reference Alleles

For each variant we need the reference and the alternate.

In [15]:
variant_ref = callset[chrom]['variants/REF']
variant_ref

<zarr.core.Array '/3R/variants/REF' (14481509,) |S1>

In [16]:
variant_alt = callset[chrom]['variants/ALT']
variant_alt

<zarr.core.Array '/3R/variants/ALT' (14481509, 3) |S1>

Now, instead of having 2 separate variant arrays, we want an np array of :

```python

[ 
    # variant position index
    [ ref, alt ],
]    
```

In [17]:
# the alternate lists all possible variants. we'll just grab the first, but really we should filter out any variants that aren't biallelic
variant_alleles = np.column_stack((variant_ref, variant_alt[:,0]))
variant_contig = np.zeros(len(variant_alleles))

In [18]:
variant_contig[0:10]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [19]:
variant_alleles[0:10]

array([[b'A', b'G'],
       [b'A', b'T'],
       [b'T', b'C'],
       [b'G', b'A'],
       [b'T', b'A'],
       [b'A', b'G'],
       [b'G', b'C'],
       [b'C', b'T'],
       [b'C', b'T'],
       [b'G', b'A']], dtype='|S1')

## Create the Xarray Genotype Callset

In [20]:
# You can use the dataset_size to create a smaller dataset if you're just exploring

#dataset_size = len(variant_alleles)
variant_contig_names = [chrom]
call_genotype = genotypes
dataset_size = 10000
variant_contig = np.zeros(dataset_size)
variant_position = variant_position[0:dataset_size]
variant_alleles = variant_alleles[0:dataset_size]
call_genotype = call_genotype[0:dataset_size]

In [21]:
genotype_xarray_dataset = sgkit.api.create_genotype_call_dataset(
    variant_contig_names = variant_contig_names,
    # these are all on the 0th contig, because we only have one contig
    variant_contig = np.zeros(len(variant_position), dtype='int'),
    variant_position = variant_position,
    variant_alleles = variant_alleles,
    sample_id = sample_id,
    call_genotype = call_genotype,
)

In [22]:
genotype_xarray_dataset