# Import haplotype data to sgkit

Convert MalariaGEN data from scikit-allel VCF Zarr format, to sgkit Zarr format. This uses the `vcfzarr_to_zarr` function that has been optimized to avoid high-memory usage. (See https://github.com/pystatgen/sgkit/pull/324 and the linked issues for details.)

We also use the [rechunker](https://rechunker.readthedocs.io/en/latest/) library to rechunk to the desired chunk sizes.

In [1]:
%run setup.ipynb

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from dask.diagnostics import ProgressBar
from sgkit.io.vcfzarr_reader import vcfzarr_to_zarr

In [4]:
input = here() / 'data/external/ag1000g/phase2/AR1/haplotypes/main/zarr/ag1000g.phase2.ar1.haplotypes/'
output = here() / 'data/sgkit/ag1000g_import_haplotypes.zarr'
contigs = ["2R", "2L", "3R", "3L"] # note R is before L; skip X since it has fewer samples (1099 vs 1164)

In [7]:
with ProgressBar():
    vcfzarr_to_zarr(input, output, contigs=contigs, grouped_by_contig=True, consolidated=True)

[########################################] | 100% Completed |  2.8s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 35.8s
[########################################] | 100% Completed |  2.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 26.2s
[########################################] | 100% Completed |  2.5s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 31.5s
[########################################] | 100% Completed |  1.8s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed | 24.3s
[########################################] | 100% Completed |  2min  9.0s


Have a look at the dataset that's been created, by reading it with Xarray.

In [5]:
import xarray as xr

In [6]:
ds = xr.open_zarr(str(output), concat_characters=False)
ds

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,62.91 MB
Shape,"(39604636, 1164, 2)","(524288, 60, 2)"
Count,1521 Tasks,1520 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 92.20 GB 62.91 MB Shape (39604636, 1164, 2) (524288, 60, 2) Count 1521 Tasks 1520 Chunks Type int8 numpy.ndarray",2  1164  39604636,

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,62.91 MB
Shape,"(39604636, 1164, 2)","(524288, 60, 2)"
Count,1521 Tasks,1520 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,62.91 MB
Shape,"(39604636, 1164, 2)","(524288, 60, 2)"
Count,1521 Tasks,1520 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 92.20 GB 62.91 MB Shape (39604636, 1164, 2) (524288, 60, 2) Count 1521 Tasks 1520 Chunks Type bool numpy.ndarray",2  1164  39604636,

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,62.91 MB
Shape,"(39604636, 1164, 2)","(524288, 60, 2)"
Count,1521 Tasks,1520 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.31 kB,9.31 kB
Shape,"(1164,)","(1164,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 9.31 kB 9.31 kB Shape (1164,) (1164,) Count 2 Tasks 1 Chunks Type object numpy.ndarray",1164  1,

Unnamed: 0,Array,Chunk
Bytes,9.31 kB,9.31 kB
Shape,"(1164,)","(1164,)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,79.21 MB,8.39 MB
Shape,"(39604636, 2)","(4194304, 2)"
Count,11 Tasks,10 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 79.21 MB 8.39 MB Shape (39604636, 2) (4194304, 2) Count 11 Tasks 10 Chunks Type |S1 numpy.ndarray",2  39604636,

Unnamed: 0,Array,Chunk
Bytes,79.21 MB,8.39 MB
Shape,"(39604636, 2)","(4194304, 2)"
Count,11 Tasks,10 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 158.42 MB 16.78 MB Shape (39604636,) (4194304,) Count 11 Tasks 10 Chunks Type int32 numpy.ndarray",39604636  1,

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 158.42 MB 16.78 MB Shape (39604636,) (4194304,) Count 11 Tasks 10 Chunks Type int32 numpy.ndarray",39604636  1,

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray


The dataset is chunked, in the `samples` dimension (chunk size is `60`, total number of samples is `1142`), but the popgen functions don't support chunking in that dimension, so we need to rechunk to have a single chunk in that dimension. We can do that using [rechunker](https://rechunker.readthedocs.io/en/latest/) working directly on Zarr groups.

In [7]:
source_group = zarr.open(str(output))
target_chunks = {"call_genotype": (524288, 1142, 2), "call_genotype_mask": (524288, 1142, 2), "sample_id": None, "variant_allele": None, "variant_contig": None, "variant_position": None}
max_mem = '2GB'

target_store = str(here() / 'data/sgkit/ag1000g_haplotypes.zarr')
temp_store = str(here() / 'data/sgkit/ag1000g_haplotypes_rechunked_tmp.zarr')

In [8]:
from rechunker import api as rechunker_api
plan = rechunker_api.rechunk(source_group, target_chunks, max_mem, target_store)

In [9]:
with ProgressBar():
    plan.execute()

[########################################] | 100% Completed |  3min 34.1s


Now when we look at the dataset it is chunked as we want it.

In [10]:
ds = xr.open_zarr(target_store, concat_characters=False)
ds

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,1.20 GB
Shape,"(39604636, 1164, 2)","(524288, 1142, 2)"
Count,153 Tasks,152 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 92.20 GB 1.20 GB Shape (39604636, 1164, 2) (524288, 1142, 2) Count 153 Tasks 152 Chunks Type int8 numpy.ndarray",2  1164  39604636,

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,1.20 GB
Shape,"(39604636, 1164, 2)","(524288, 1142, 2)"
Count,153 Tasks,152 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,1.20 GB
Shape,"(39604636, 1164, 2)","(524288, 1142, 2)"
Count,153 Tasks,152 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 92.20 GB 1.20 GB Shape (39604636, 1164, 2) (524288, 1142, 2) Count 153 Tasks 152 Chunks Type bool numpy.ndarray",2  1164  39604636,

Unnamed: 0,Array,Chunk
Bytes,92.20 GB,1.20 GB
Shape,"(39604636, 1164, 2)","(524288, 1142, 2)"
Count,153 Tasks,152 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,37.25 kB,37.25 kB
Shape,"(1164,)","(1164,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 37.25 kB 37.25 kB Shape (1164,) (1164,) Count 2 Tasks 1 Chunks Type numpy.ndarray",1164  1,

Unnamed: 0,Array,Chunk
Bytes,37.25 kB,37.25 kB
Shape,"(1164,)","(1164,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,79.21 MB,8.39 MB
Shape,"(39604636, 2)","(4194304, 2)"
Count,11 Tasks,10 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 79.21 MB 8.39 MB Shape (39604636, 2) (4194304, 2) Count 11 Tasks 10 Chunks Type |S1 numpy.ndarray",2  39604636,

Unnamed: 0,Array,Chunk
Bytes,79.21 MB,8.39 MB
Shape,"(39604636, 2)","(4194304, 2)"
Count,11 Tasks,10 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 158.42 MB 16.78 MB Shape (39604636,) (4194304,) Count 11 Tasks 10 Chunks Type int32 numpy.ndarray",39604636  1,

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 158.42 MB 16.78 MB Shape (39604636,) (4194304,) Count 11 Tasks 10 Chunks Type int32 numpy.ndarray",39604636  1,

Unnamed: 0,Array,Chunk
Bytes,158.42 MB,16.78 MB
Shape,"(39604636,)","(4194304,)"
Count,11 Tasks,10 Chunks
Type,int32,numpy.ndarray
