## LD Prune on GPU

This example compares pruning times between a GPU implementation and PLINK on 1KG chromosome 1.  A single non-chunked array is used until a good solution to [dask#2403](https://github.com/dask/dask/issues/2403) makes it possible to do the block overlap required here.

Some earlier [experiments](https://github.com/related-sciences/gwas-analysis/tree/master/notebooks/benchmark/method/ld_prune/tild) with a CPU implementation using numba for jit compilation showed that PLINK was still substantially faster (at least 10x) for low R2 thresholds (the most common use case).  It is encouraging to see that in this case the GPU implementation is just as fast with little optimization effort -- using shared memory for thread blocks would likely speed things up a good bit.

Note: Tests were run using a single GeForce RTX 2070 and an 8-core Intel Core i7-9800X CPU (which retail for about the same price)

In [1]:
import os
import sys
import numpy as np
import dask.array as da
from numba import cuda
sys.path.append(".")
# Enable assertions run in kernel
os.environ['NUMBA_DEBUGINFO'] = '1'
%run nb/paths.py
from lib import api
from lib.method.ld_prune import tsgpu_backend

In [2]:
path = PLINK_1KG_PATH_01
path

PosixPath('/lab/data/gwas/tutorial/2_PS_GWAS/ALL.2of4intersection.20100804.genotypes')

#### Prep

Load 1KG dataset:

In [3]:
%%time
ds = api.read_plink(path, chunks='auto', fam_sep=' ', bim_sep='\t')
ds

CPU times: user 4min 5s, sys: 20.3 s, total: 4min 25s
Wall time: 1min 49s


Unnamed: 0,Array,Chunk
Bytes,16.03 GB,134.22 MB
Shape,"(25488488, 629)","(213382, 629)"
Count,241 Tasks,120 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 16.03 GB 134.22 MB Shape (25488488, 629) (213382, 629) Count 241 Tasks 120 Chunks Type int8 numpy.ndarray",629  25488488,

Unnamed: 0,Array,Chunk
Bytes,16.03 GB,134.22 MB
Shape,"(25488488, 629)","(213382, 629)"
Count,241 Tasks,120 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.03 GB,134.22 MB
Shape,"(25488488, 629)","(213382, 629)"
Count,241 Tasks,120 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 16.03 GB 134.22 MB Shape (25488488, 629) (213382, 629) Count 241 Tasks 120 Chunks Type bool numpy.ndarray",629  25488488,

Unnamed: 0,Array,Chunk
Bytes,16.03 GB,134.22 MB
Shape,"(25488488, 629)","(213382, 629)"
Count,241 Tasks,120 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type object numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type object numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 203.91 MB 20.86 MB Shape (25488488,) (2607671,) Count 50 Tasks 10 Chunks Type int64 numpy.ndarray",25488488  1,

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 203.91 MB 20.86 MB Shape (25488488,) (2607671,) Count 50 Tasks 10 Chunks Type object numpy.ndarray",25488488  1,

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 203.91 MB 20.86 MB Shape (25488488,) (2607671,) Count 50 Tasks 10 Chunks Type int64 numpy.ndarray",25488488  1,

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 203.91 MB 20.86 MB Shape (25488488,) (2607671,) Count 50 Tasks 10 Chunks Type int64 numpy.ndarray",25488488  1,

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 203.91 MB 20.86 MB Shape (25488488,) (2607671,) Count 50 Tasks 10 Chunks Type object numpy.ndarray",25488488  1,

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 203.91 MB 20.86 MB Shape (25488488,) (2607671,) Count 50 Tasks 10 Chunks Type object numpy.ndarray",25488488  1,

Unnamed: 0,Array,Chunk
Bytes,203.91 MB,20.86 MB
Shape,"(25488488,)","(2607671,)"
Count,50 Tasks,10 Chunks
Type,object,numpy.ndarray


Filter to chromosome 1 so that the entire array will fit in GPU memory:

In [4]:
x = ds.sel(variant=ds.contig==1).data
x

Unnamed: 0,Array,Chunk
Bytes,1.26 GB,134.22 MB
Shape,"(2001208, 629)","(213382, 629)"
Count,251 Tasks,10 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 1.26 GB 134.22 MB Shape (2001208, 629) (213382, 629) Count 251 Tasks 10 Chunks Type int8 numpy.ndarray",629  2001208,

Unnamed: 0,Array,Chunk
Bytes,1.26 GB,134.22 MB
Shape,"(2001208, 629)","(213382, 629)"
Count,251 Tasks,10 Chunks
Type,int8,numpy.ndarray


Write the array to disk so that IO is a part of the benchmark:

In [5]:
shape = x.shape
x.data.rechunk(-1)

Unnamed: 0,Array,Chunk
Bytes,1.26 GB,1.26 GB
Shape,"(2001208, 629)","(2001208, 629)"
Count,252 Tasks,1 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 1.26 GB 1.26 GB Shape (2001208, 629) (2001208, 629) Count 252 Tasks 1 Chunks Type int8 numpy.ndarray",629  2001208,

Unnamed: 0,Array,Chunk
Bytes,1.26 GB,1.26 GB
Shape,"(2001208, 629)","(2001208, 629)"
Count,252 Tasks,1 Chunks
Type,int8,numpy.ndarray


In [6]:
x.data.rechunk(-1).to_zarr('/tmp/x.zarr', overwrite=True)
!du -ch /tmp/x.zarr



238M	/tmp/x.zarr
238M	total


#### GPU Time

Benchmark using Marees et al. 2018 PLINK parameters:

In [7]:
%%time
res = tsgpu_backend.ld_prune(
    x = da.from_zarr('/tmp/x.zarr').compute(),
    groups = np.ones(shape[0]),
    positions = None,
    threshold = .2,
    window = 50,
    step = 5,
    scores = None,
    max_distance = None
)
assert len(res) == shape[0]

CPU times: user 6.26 s, sys: 371 ms, total: 6.63 s
Wall time: 6.66 s


#### PLINK Time

In [8]:
%%time
%%bash -s {str(path.parent)} {str(path.name)}
set -e
cd $1
plink --bfile $2 --chr 1 --indep-pairwise 50 5 0.2 --out /tmp/$2

PLINK v1.90b6.14 64-bit (7 Jan 2020)           www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /tmp/ALL.2of4intersection.20100804.genotypes.log.
Options in effect:
  --bfile ALL.2of4intersection.20100804.genotypes
  --chr 1
  --indep-pairwise 50 5 0.2
  --out /tmp/ALL.2of4intersection.20100804.genotypes

128535 MB RAM detected; reserving 64267 MB for main workspace.
2001208 out of 25488488 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /tmp/ALL.2of4intersection.20100804.genotypes.nosex
.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total gen