# PBS with sgkit

This notebook is for running a PBS scan using sgkit, to reproduce the scikit-allel one (`pbs_scans.ipynb`).

You need to have run `sgkit_import.ipynb` first to convert the data into sgkit format.

In [1]:
%run setup.ipynb

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from dask.diagnostics import ProgressBar
import sgkit as sg
import xarray as xr

First, let's inspect the input data. Note that it has a single chunk in the `samples` dimension, which is a requirement for running the popgen analyses.

In [4]:
ds = xr.open_zarr(str(here() / 'data/sgkit/ag1000g.zarr'), concat_characters=False)
ds

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type int8 numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type bool numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 4.57 kB 4.57 kB Shape (1142,) (1142,) Count 2 Tasks 1 Chunks Type numpy.ndarray",1142  1,

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885, 4) (4194304, 4) Count 15 Tasks 14 Chunks Type |S1 numpy.ndarray",4  57837885,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray


## Cohorts

We need to divide the samples into separate cohorts, which we get from the `pop_defs` YAML:

In [5]:
cohort_ids = list(pop_defs.keys())
cohort_ids

['ao_col',
 'bf_col',
 'bf_gam',
 'ci_col',
 'cm_sav_gam',
 'fr_gam',
 'ga_gam',
 'gh_col',
 'gh_gam',
 'gm',
 'gn_gam',
 'gq_gam',
 'gw',
 'ke',
 'ug_gam']

In [6]:
ds["cohort_id"] = xr.DataArray(cohort_ids, dims="cohorts")
ds

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type int8 numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type bool numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 4.57 kB 4.57 kB Shape (1142,) (1142,) Count 2 Tasks 1 Chunks Type numpy.ndarray",1142  1,

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885, 4) (4194304, 4) Count 15 Tasks 14 Chunks Type |S1 numpy.ndarray",4  57837885,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray


Sample metadata is in the `df_samples` dataframe, so we can use that to produce a mapping from sample to cohort

In [7]:
sample_cohorts = np.full_like(ds.sample_id.values, -1, dtype=np.int8)
for i, pop in enumerate(cohort_ids):
    pop_query = (
            pop_defs[pop]['query']
            .replace('region', 'location')
            .replace('Gado-Badzere', 'Gado Badzere')
            .replace('Zembe-Borongo', 'Zembe Borongo')
    )
    loc_pop = df_samples.query(pop_query).index.values
    sample_cohorts[loc_pop] = i
sample_cohorts

array([7, 7, 7, ..., 3, 3, 3], dtype=int8)

Add `sample_cohort` to the dataset

In [8]:
ds["sample_cohort"] = xr.DataArray(sample_cohorts, dims="samples")
ds

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type int8 numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type bool numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 4.57 kB 4.57 kB Shape (1142,) (1142,) Count 2 Tasks 1 Chunks Type numpy.ndarray",1142  1,

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885, 4) (4194304, 4) Count 15 Tasks 14 Chunks Type |S1 numpy.ndarray",4  57837885,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray


Some samples are not in any of the named cohorts, and have -1 in the `sample_cohort` variable. These are ignored in cohort allele counts.

## Count cohort alleles

Rather than just computing PBS directly, we are going to do the computation for allele counts separately, since it is a fairly expensive computation which we can save to disk so we don't have to do it repeatedly.

In [9]:
cac = sg.count_cohort_alleles(ds, merge=False)
cac

Unnamed: 0,Array,Chunk
Bytes,13.88 GB,125.83 MB
Shape,"(57837885, 15, 4)","(524288, 15, 4)"
Count,337 Tasks,111 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 13.88 GB 125.83 MB Shape (57837885, 15, 4) (524288, 15, 4) Count 337 Tasks 111 Chunks Type int32 numpy.ndarray",4  15  57837885,

Unnamed: 0,Array,Chunk
Bytes,13.88 GB,125.83 MB
Shape,"(57837885, 15, 4)","(524288, 15, 4)"
Count,337 Tasks,111 Chunks
Type,int32,numpy.ndarray


In [12]:
cac_zarr_path = (here() / 'data/sgkit/ag1000g_cohort_allele_count.zarr')
if not cac_zarr_path.exists():
    with ProgressBar():
        cac.to_zarr(str(cac_zarr_path))

[########################################] | 100% Completed | 15min  9.2s


The technique used here computes a new variable in a new Dataset (via `merge=False`) and then saves that to disk, effectively checkpointing the computation. We can then load the new variable and combine it with the original dataset, as follows:

In [10]:
cac = xr.open_zarr(str(cac_zarr_path), concat_characters=False)
ds2 = xr.merge([ds, cac])
ds2

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type int8 numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type bool numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 4.57 kB 4.57 kB Shape (1142,) (1142,) Count 2 Tasks 1 Chunks Type numpy.ndarray",1142  1,

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885, 4) (4194304, 4) Count 15 Tasks 14 Chunks Type |S1 numpy.ndarray",4  57837885,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.88 GB,125.83 MB
Shape,"(57837885, 15, 4)","(524288, 15, 4)"
Count,112 Tasks,111 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 13.88 GB 125.83 MB Shape (57837885, 15, 4) (524288, 15, 4) Count 112 Tasks 111 Chunks Type int32 numpy.ndarray",4  15  57837885,

Unnamed: 0,Array,Chunk
Bytes,13.88 GB,125.83 MB
Shape,"(57837885, 15, 4)","(524288, 15, 4)"
Count,112 Tasks,111 Chunks
Type,int32,numpy.ndarray


## Windowing

To compute popgen stats we need to set up windows along the genome. For PBS we are just going to have contiguous (non-overlapping) windows of size 200 variants.

In [11]:
ds2 = sg.window(ds2, size=200, step=200)
ds2

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type int8 numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 132.10 GB 1.20 GB Shape (57837885, 1142, 2) (524288, 1142, 2) Count 112 Tasks 111 Chunks Type bool numpy.ndarray",2  1142  57837885,

Unnamed: 0,Array,Chunk
Bytes,132.10 GB,1.20 GB
Shape,"(57837885, 1142, 2)","(524288, 1142, 2)"
Count,112 Tasks,111 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 4.57 kB 4.57 kB Shape (1142,) (1142,) Count 2 Tasks 1 Chunks Type numpy.ndarray",1142  1,

Unnamed: 0,Array,Chunk
Bytes,4.57 kB,4.57 kB
Shape,"(1142,)","(1142,)"
Count,2 Tasks,1 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885, 4) (4194304, 4) Count 15 Tasks 14 Chunks Type |S1 numpy.ndarray",4  57837885,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885, 4)","(4194304, 4)"
Count,15 Tasks,14 Chunks
Type,|S1,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 231.35 MB 16.78 MB Shape (57837885,) (4194304,) Count 15 Tasks 14 Chunks Type int32 numpy.ndarray",57837885  1,

Unnamed: 0,Array,Chunk
Bytes,231.35 MB,16.78 MB
Shape,"(57837885,)","(4194304,)"
Count,15 Tasks,14 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.88 GB,125.83 MB
Shape,"(57837885, 15, 4)","(524288, 15, 4)"
Count,112 Tasks,111 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 13.88 GB 125.83 MB Shape (57837885, 15, 4) (524288, 15, 4) Count 112 Tasks 111 Chunks Type int32 numpy.ndarray",4  15  57837885,

Unnamed: 0,Array,Chunk
Bytes,13.88 GB,125.83 MB
Shape,"(57837885, 15, 4)","(524288, 15, 4)"
Count,112 Tasks,111 Chunks
Type,int32,numpy.ndarray


## PBS

We are now in a position to calculate the PBS statistic. The following computes the statistic for all cohort triples. (For large numbers of cohorts it would be more efficient to state which subset of triples to compute, but for 15 cohorts it is feasible to compute all of them.)

In [12]:
pbs = sg.pbs(ds2, merge=False)
pbs

Unnamed: 0,Array,Chunk
Bytes,7.81 GB,70.79 MB
Shape,"(289190, 15, 15, 15)","(2622, 15, 15, 15)"
Count,1786 Tasks,111 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 7.81 GB 70.79 MB Shape (289190, 15, 15, 15) (2622, 15, 15, 15) Count 1786 Tasks 111 Chunks Type float64 numpy.ndarray",289190  1  15  15  15,

Unnamed: 0,Array,Chunk
Bytes,7.81 GB,70.79 MB
Shape,"(289190, 15, 15, 15)","(2622, 15, 15, 15)"
Count,1786 Tasks,111 Chunks
Type,float64,numpy.ndarray


In [13]:
with ProgressBar():
    pbs = pbs.chunk({"windows": 65536}) # rechunk to uniform window sizes so we can save to zarr
    pbs.to_zarr(str(here() / 'data/sgkit/ag1000g_pbs.zarr'), mode="w")

[########################################] | 100% Completed |  4min 20.1s


In [14]:
pbs = xr.open_zarr(str(here() / 'data/sgkit/ag1000g_pbs.zarr'), concat_characters=False)
pbs = pbs.assign_coords({"cohorts_0": list(pop_defs), "cohorts_1": list(pop_defs), "cohorts_2": list(pop_defs)})
pbs

Unnamed: 0,Array,Chunk
Bytes,7.81 GB,1.77 GB
Shape,"(289190, 15, 15, 15)","(65536, 15, 15, 15)"
Count,6 Tasks,5 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 7.81 GB 1.77 GB Shape (289190, 15, 15, 15) (65536, 15, 15, 15) Count 6 Tasks 5 Chunks Type float64 numpy.ndarray",289190  1  15  15  15,

Unnamed: 0,Array,Chunk
Bytes,7.81 GB,1.77 GB
Shape,"(289190, 15, 15, 15)","(65536, 15, 15, 15)"
Count,6 Tasks,5 Chunks
Type,float64,numpy.ndarray


Have a look at the PBS values for a given cohort triple:

In [15]:
pbs["stat_pbs"].sel(cohorts_0="ao_col", cohorts_1="ga_gam", cohorts_2="gw")[:100].values

array([ 0.02035689,  0.04731605, -0.01172031, -0.03537765, -0.02032909,
       -0.0333362 , -0.03767379, -0.02111455,  0.05980914,  0.11571038,
        0.01851218, -0.04264947,  0.05029451, -0.04179188,  0.19336811,
       -0.02979891,  0.03450247,  0.01112275, -0.00036684, -0.0182281 ,
       -0.00058695, -0.01986418,  0.00841146,  0.04060411, -0.03413758,
        0.01322737,  0.06647295,  0.03491832,  0.02169768,  0.00198195,
        0.01425712, -0.03852   , -0.01723636,  0.09697634, -0.03494845,
        0.03890285, -0.02748769, -0.00588355,  0.00218827,  0.02507205,
        0.08253355,  0.01861447,  0.05695996,  0.01140143,  0.01135529,
       -0.01142703,  0.06941356, -0.03485373, -0.01417671,  0.04527451,
        0.06298338, -0.04952757,  0.08196584,  0.01478386,  0.00591458,
        0.13480932,  0.02203656,  0.10433447,  0.05936221, -0.0513408 ,
        0.15805215,  0.08118062,  0.01150256, -0.02030459,  0.07340336,
        0.00170747,  0.03243914,  0.14615229,  0.07863543, -0.04