# Import data to sgkit

Convert MalariaGEN data from scikit-allel VCF Zarr format, to sgkit Zarr format. This uses the `vcfzarr_to_zarr` function that has been optimized to avoid high-memory usage. (See https://github.com/pystatgen/sgkit/pull/324 and the linked issues for details.)

We also use the [rechunker](https://rechunker.readthedocs.io/en/latest/) library to rechunk to the desired chunk sizes.

In [1]:
%run setup.ipynb

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import sgkit as sg
from dask.diagnostics import ProgressBar
from sgkit.io.vcfzarr_reader import vcfzarr_to_zarr

In [4]:
input = here() / 'data/external/ag1000g/phase2/AR1/variation/main/zarr/pass/ag1000g.phase2.ar1.pass/'
output = here() / 'data/sgkit/ag1000g_import.zarr'

In [5]:
with ProgressBar():
    vcfzarr_to_zarr(input, output, grouped_by_contig=True, consolidated=True)

[########################################] | 100% Completed |  5.2s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 30.9s
[########################################] | 100% Completed |  7.5s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 39.5s
[########################################] | 100% Completed |  4.9s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 29.2s
[########################################] | 100% Completed |  6.6s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 36.8s
[########################################] | 100% Completed |  2.7s
[########################################] | 100% Completed |  0.1s
[########################################] | 100

Have a look at the dataset that's been created, by reading it with Xarray (using the sgkit convenience function).

In [6]:
ds = sg.load_dataset(output)
ds

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,63.96 MB
Shape,"(57837885, 1142, 2)","(524288, 61, 2)"
Count,2110 Tasks,2109 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 132.10 GB 63.96 MB Shape (57837885, 1142, 2) (524288, 61, 2) Count 2110 Tasks 2109 Chunks Type int8 numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,63.96 MB
Shape,"(57837885, 1142, 2)","(524288, 61, 2)"
Count,2110 Tasks,2109 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,63.96 MB
Shape,"(57837885, 1142, 2)","(524288, 61, 2)"
Count,2110 Tasks,2109 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 132.10 GB 63.96 MB Shape (57837885, 1142, 2) (524288, 61, 2) Count 2110 Tasks 2109 Chunks Type bool numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,63.96 MB
Shape,"(57837885, 1142, 2)","(524288, 61, 2)"
Count,2110 Tasks,2109 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.14 kB,9.14 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 9.14 kB 9.14 kB Shape (1142,) (1142,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",1142  1,

Unnamed: 0,Array,Chunk
Bytes,9.14 kB,9.14 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885, 4) (4194304, 4) Count 15 Tasks 14 Chunks Type |S1 numpy.ndarray",4  57837885,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray


The dataset is chunked, in the `samples` dimension (chunk size is `61`, total number of samples is `1142`), but the popgen functions don't support chunking in that dimension, so we need to rechunk to have a single chunk in that dimension. We can do that using [rechunker](https://rechunker.readthedocs.io/en/latest/) working directly on Zarr groups.

In [7]:
source_group = zarr.open(str(output))
target_chunks = {"call_genotype": (524288, 1142, 2), "call_genotype_mask": (524288, 1142, 2), "sample_id": None, "variant_allele": None, "variant_contig": None, "variant_position": None}
max_mem = '2GB'

target_store = str(here() / 'data/sgkit/ag1000g.zarr')
temp_store = str(here() / 'data/sgkit/ag1000g_rechunked_tmp.zarr')

In [8]:
from rechunker import api as rechunker_api
plan = rechunker_api.rechunk(source_group, target_chunks, max_mem, target_store)

In [9]:
with ProgressBar():
    plan.execute()

[########################################] | 100% Completed |  3min  9.5s


Now when we look at the dataset it is chunked as we want it.

In [10]:
ds = sg.load_dataset(target_store)
ds

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type int8 numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type bool numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 4.57 kB 4.57 kB Shape (1142,) (1142,) Count 2 Tasks 1 Chunks Type numpy.ndarray",1142  1,

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885, 4) (4194304, 4) Count 15 Tasks 14 Chunks Type |S1 numpy.ndarray",4  57837885,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
