### Hail BlockMatrix QC Tests

This experiment tests performance on variant/sample call rate filtering of 1KG data using the Hail BlockMatrix API rather than MatrixTable.

It is approximately 3x faster than MatrixTable for this one set of QC ops, but still nowhere near as fast as PLINK or Dask.

In [1]:
import hail as hl
import pandas as pd
import numpy as np
import plotnine as pn
import matplotlib.pyplot as plt
import os.path as osp
%run ../../init/paths.py
data_dir = osp.expanduser('~/data/gwas/tutorial/2_PS_GWAS')
hl.init() 

Running on Apache Spark version 2.4.4
SparkUI available at http://d42c6af5a4e5:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.30-2ae07d872f43
LOGGING: writing to /home/eczech/repos/gwas-analysis/notebooks/tutorial/ext/dask/hail-20200205-2241-0.2.30-2ae07d872f43.log


In [2]:
mt = hl.balding_nichols_model(3, 25, 50)
#mt = hl.read_matrix_table(osp.join(data_dir, PS1_1KG_RAW_FILE + '.mt'))

2020-02-05 22:24:57 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 25 samples, and 50 variants...


In [12]:
hl.linalg.BlockMatrix.default_block_size()

4096

In [24]:
%%time
bm = hl.linalg.BlockMatrix.from_entry_expr(hl.is_defined(mt.GT))

CPU times: user 39.5 ms, sys: 0 ns, total: 39.5 ms
Wall time: 430 ms


2020-02-05 22:39:43 Hail: INFO: Coerced sorted dataset
2020-02-05 22:39:43 Hail: INFO: Wrote all 1 blocks of 50 x 25 matrix with block size 4096.


In [2]:
bm = hl.linalg.BlockMatrix.read(osp.join(data_dir, PS1_1KG_RAW_FILE + '.is_defined.bm'))

In [3]:
bm.is_sparse

False

In [4]:
bm.element_type

dtype('float64')

In [5]:
%%time

def filter_by_variant_call_rate(bm, threshold):
    idx = np.argwhere((bm.sum(axis=1) / bm.shape[1]).to_numpy().squeeze() >= threshold).squeeze()
    return bm.filter_rows(idx.tolist())

def filter_by_sample_call_rate(bm, threshold):
    idx = np.argwhere((bm.sum(axis=0) / bm.shape[0]).to_numpy().squeeze() >= threshold).squeeze()
    return bm.filter_cols(idx.tolist())

bmf = filter_by_variant_call_rate(bm, .8)
bmf = filter_by_sample_call_rate(bmf, .8)
bmf = filter_by_variant_call_rate(bmf, .98)
bmf = filter_by_sample_call_rate(bmf, .98)
bmf.shape

v 9007422
s 629
v 8240745
s 629
CPU times: user 20.2 s, sys: 2.11 s, total: 22.3 s
Wall time: 6min 57s


(8240745, 629)