# Understanding the Genotype Call XArray DataSet - From Malaria Gen Zarr

A central point to the SGkit API is the Genotype Call Dataset. This is the data structure that most of the other functions use. It uses [Xarray](http://xarray.pydata.org/en/stable/) underneath the hood to give a programmatic interface that allows for the backend to be several different data files.

The Xarray itself is *sort of* a transposed VCF file.

For this example we are going to from the preprocessed zarr to the sgkit Genotype Call XArray Dataset.

This is only meant to demonstrate the datatypes that we feed into the Xarray dataset. For a more conceptual understanding please check out the `Genotype-Call-Dataset-From-VCF.ipynb`.

In [1]:
import numpy as np
import zarr
import pandas as pd
import dask.array as da
import allel
from pprint import pprint
import matplotlib.pyplot as plt
%matplotlib inline

## Create a Dask Cluster

This isn't that important for this example, but SGkit can use Dask under the hood for many of it's calculations. Divide and conquer your statistical genomics data!

In [2]:
from dask_kubernetes import KubeCluster
cluster = KubeCluster(n_workers=30, silence_logs='error')
cluster

VBox(children=(HTML(value='<h2>KubeCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n    .…

## Get the Malaria Gen Zarr Data

The [zarr](https://zarr.readthedocs.io/en/stable) data is hosted in a google cloud bucket, or available for download from the public FTP site.

In [3]:
import gcsfs

gcs_bucket_fs = gcsfs.GCSFileSystem(project='malariagen-jupyterhub', token='anon', access='read_only')

storage_path = 'ag1000g-release/phase2.AR1/variation/main/zarr/pass/ag1000g.phase2.ar1.pass'
store = gcsfs.mapping.GCSMap(storage_path, gcs=gcs_bucket_fs, check=False, create=False)
callset = zarr.Group(store)

If you explore the zarr data you will see that it is mostly the VCF data, with a few fields pre calculated for convenience.

In [4]:
print(callset['samples'])

<zarr.core.Array '/samples' (1142,) object>


In [5]:
chrom = '3R'
print(callset[chrom].tree())

3R
 ├── calldata
 │   └── GT (14481509, 1142, 2) int8
 ├── samples (1142,) object
 └── variants
     ├── ABHet (14481509,) float32
     ├── ABHom (14481509,) float32
     ├── AC (14481509, 3) int32
     ├── AF (14481509, 3) float32
     ├── ALT (14481509, 3) |S1
     ├── AN (14481509,) int32
     ├── Accessible (14481509,) bool
     ├── BaseCounts (14481509, 4) int32
     ├── BaseQRankSum (14481509,) float32
     ├── Coverage (14481509,) int32
     ├── CoverageMQ0 (14481509,) int32
     ├── DP (14481509,) int32
     ├── DS (14481509,) bool
     ├── Dels (14481509,) float32
     ├── FILTER_BaseQRankSum (14481509,) bool
     ├── FILTER_FS (14481509,) bool
     ├── FILTER_HRun (14481509,) bool
     ├── FILTER_HighCoverage (14481509,) bool
     ├── FILTER_HighMQ0 (14481509,) bool
     ├── FILTER_LowCoverage (14481509,) bool
     ├── FILTER_LowMQ (14481509,) bool
     ├── FILTER_LowQual (14481509,) bool
     ├── FILTER_NoCoverage (14481509,) bool
     ├── FILTER_PASS (14481509,) bool
     ├

## Get the Call Data

In [6]:
chrom = '3R'
calldata = callset[chrom]['calldata']

# TODO Will this be changed for SGKit?
genotypes = allel.GenotypeChunkedArray(calldata['GT'])
genotypes

Unnamed: 0,0,1,2,3,4,...,1137,1138,1139,1140,1141,Unnamed: 12
0,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
14481506,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
14481507,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
14481508,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,


### Genotype Chunked Array Data Structure

When looking at the `allel.GenotypeChunkedArray` we see that we have: GenotypeChunkedArray shape=(14481509, 1142, 2)

The shape corresponds to `variants`, `samples`, `alleles`.

For every index of a variant we have the alleles of each of the samples.

So let's get all the sample data for the first variant.

In [7]:
genotypes[0]

0,1,2,3,4,...,1137,1138,1139,1140,1141
0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0


And now let's look at the first variant call for the first sample.

In [8]:
genotypes[0][0]

array([0, 0], dtype=int8)

You can see above that for sample[0] the allele is 0/0, meaning it is homozygous for the reference.

In [9]:
## Get the Samples