# Minimal Numpy Example

A central point to the SGkit API is the Genotype Call Dataset. This is the data structure that most of the other functions use. It uses [Xarray](http://xarray.pydata.org/en/stable/) underneath the hood to give a programmatic interface that allows for the backend to be several different data files.

The Xarray itself is *sort of* a transposed VCF file.

For this particular example we are going to use a minimal set of numpy arrays in order to create a small Genotype Call Dataset. 

This is only meant to demonstrate the datatypes that we feed into the Xarray dataset. For a more conceptual understanding please check out the `Genotype-Call-Dataset-From-VCF.ipynb`.

In [1]:
import numpy as np
import zarr
import pandas as pd
import dask.array as da
import allel
from pprint import pprint
import matplotlib.pyplot as plt
%matplotlib inline

## Prep Work - Install Packages

SGKit is still under rapid development, so I'm installing based on a commit. 

In [2]:
#! pip install git+https://github.com/pystatgen/sgkit@96203d471531e7e2416d4dd9b48ca11d660a1bcc

## Numpy Representations of the Variant Data

We need to prepare for our XArray dataset by converting these to Numpy arrays.

If you're wondering how I know what these are you can check out the `sgkit.api.create_genotype_call_dataset`. The exact functions are `check_array_like` and make sure that these are numpy arrays of a particular type.

```
check_array_like(variant_contig, kind="i", ndim=1)
check_array_like(variant_position, kind="i", ndim=1)
check_array_like(variant_alleles, kind="S", ndim=2)
check_array_like(sample_id, kind="U", ndim=1)
check_array_like(call_genotype, kind="i", ndim=3)
```

In [3]:
variant_contig_names = ['3R']
# the variant contig is the index of the chr in the variant_contig_names
# because we always prefer numbers over strings!
variant_contig = np.array([0], dtype='i')
variant_position = np.array([1], dtype='i')
variant_alleles = np.array([['A', 'T']], dtype='S')

sample_id = np.array(['sample-1'], dtype='U')
call_genotype_phased = None
variant_id = None

In [4]:
# The genotype is 
#         "call/genotype": ([DIM_VARIANT, DIM_SAMPLE, DIM_PLOIDY], call_genotype),
# and needs to be type 'i'
# You can also look at the GenotypeChunkedArray
call_genotype = np.array([[[0, 0]]], dtype='i')
call_genotype.shape

(1, 1, 2)

This is correct! We have 1 variant, 1 sample, 1 biallelic call.

## Convert to Genotype Call Dataset

Finally! Let's convert this to the Genotype Call Dataset!

In [5]:
import sgkit

genotype_xarray_dataset = sgkit.api.create_genotype_call_dataset(
    variant_contig_names = variant_contig_names,
    variant_contig = variant_contig,
    variant_position = variant_position,
    variant_alleles = variant_alleles,
    sample_id = sample_id,
    call_genotype = call_genotype,
)

In [6]:
genotype_xarray_dataset

## Done!

Now we have our Xarray dataset that we can use with the rest of Sgkit!