## LD Prune PLINK/Scikit-allel Comparison

Compare LD prune results from 3 implementations, making sure that the input dataset includes:

- No missing values
- Only founders
- No zero-variance rows/cols
- Unique variant identifiers
- Equal MAF scores

*Conclusion*: Matching results to scikit-allel via exact equality is easy but must be approximate for PLINK (apparently that list above is missing something)

For reference, see the PLINK source at https://github.com/chrchang/plink-ng/blob/master/1.9/plink_ld.c

In [1]:
from lib import api
import numpy as np
import pandas as pd
import dask
#dask.config.set(scheduler='single-threaded')
%run ../nb/paths.py

In [2]:
path = PLINK_1KG_PATH_02
path

PosixPath('/lab/data/gwas/tutorial/2_PS_GWAS/1kG_MDS5')

#### Prep

Load 1KG dataset:

In [3]:
%%time
ds = api.read_plink(path, chunks='auto', fam_sep=' ', bim_sep='\t')
ds = ds.sel(variant=ds.contig==1)
ds

CPU times: user 41.9 s, sys: 4.8 s, total: 46.7 s
Wall time: 29.3 s


Unnamed: 0,Array,Chunk
Bytes,291.56 MB,134.22 MB
Shape,"(463525, 629)","(213382, 629)"
Count,60 Tasks,3 Chunks
Type,int8,numpy.ndarray
"Array Chunk Bytes 291.56 MB 134.22 MB Shape (463525, 629) (213382, 629) Count 60 Tasks 3 Chunks Type int8 numpy.ndarray",629  463525,

Unnamed: 0,Array,Chunk
Bytes,291.56 MB,134.22 MB
Shape,"(463525, 629)","(213382, 629)"
Count,60 Tasks,3 Chunks
Type,int8,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,291.56 MB,134.22 MB
Shape,"(463525, 629)","(213382, 629)"
Count,60 Tasks,3 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 291.56 MB 134.22 MB Shape (463525, 629) (213382, 629) Count 60 Tasks 3 Chunks Type bool numpy.ndarray",629  463525,

Unnamed: 0,Array,Chunk
Bytes,291.56 MB,134.22 MB
Shape,"(463525, 629)","(213382, 629)"
Count,60 Tasks,3 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type object numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 5.03 kB 5.03 kB Shape (629,) (629,) Count 5 Tasks 1 Chunks Type int64 numpy.ndarray",629  1,

Unnamed: 0,Array,Chunk
Bytes,5.03 kB,5.03 kB
Shape,"(629,)","(629,)"
Count,5 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 3.71 MB 3.71 MB Shape (463525,) (463525,) Count 21 Tasks 1 Chunks Type int64 numpy.ndarray",463525  1,

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.71 MB 3.71 MB Shape (463525,) (463525,) Count 21 Tasks 1 Chunks Type object numpy.ndarray",463525  1,

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 3.71 MB 3.71 MB Shape (463525,) (463525,) Count 21 Tasks 1 Chunks Type int64 numpy.ndarray",463525  1,

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 3.71 MB 3.71 MB Shape (463525,) (463525,) Count 21 Tasks 1 Chunks Type int64 numpy.ndarray",463525  1,

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.71 MB 3.71 MB Shape (463525,) (463525,) Count 21 Tasks 1 Chunks Type object numpy.ndarray",463525  1,

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 3.71 MB 3.71 MB Shape (463525,) (463525,) Count 21 Tasks 1 Chunks Type object numpy.ndarray",463525  1,

Unnamed: 0,Array,Chunk
Bytes,3.71 MB,3.71 MB
Shape,"(463525,)","(463525,)"
Count,21 Tasks,1 Chunks
Type,object,numpy.ndarray


In [53]:
# Make sure only founders are present, b/c PLINK only considers them for pruning
assert np.all(ds.mat_id.values == ds.pat_id.values)
assert np.all(ds.mat_id.values == ds.sample_id.values)

In [11]:
# Ensure that no data is missing in this case
# (there are missing values, but not on chr 1)
assert ds.data.data.min().compute() == 0
assert np.all(~ds.is_masked.data).compute()

In [60]:
# Ensure that all rows/cols have non-zero variance
assert ds.data.std(dim='sample').min().data.compute() > 0
assert ds.data.std(dim='variant').min().data.compute() > 0

In [7]:
# Ensure that variant identifiers are unique since they
# will be used for comparison of results
variant_ids = ds.variant_id.to_pandas()
assert variant_ids.is_unique
variant_ids.head()

variant
0    1:10583:G:A
1    1:11508:G:A
2    1:15820:G:T
3    1:16257:G:C
4    1:16378:C:T
dtype: object

In [4]:
window, step, threshold = 50, 5, 0.2

In [14]:
api.config.set('stats.axis_intervals.backend', 'numba')
api.config.set('stats.ld_matrix.backend', 'dask')
api.config.set('graph.maximal_independent_set.backend', 'dask')

In [9]:
%%time
pruned_variants = api.ld_prune(ds, window=window, step=step, threshold=threshold)

CPU times: user 1.19 s, sys: 180 ms, total: 1.37 s
Wall time: 1.37 s


In [10]:
pruned_variants

Unnamed: 0_level_0,index_to_drop
npartitions=1,Unnamed: 1_level_1
,int32
,...


In [13]:
idx_to_drop = pruned_variants.index_to_drop.compute()
idx_to_drop.shape

(403983,)

### Scikit-allel

In [15]:
import allel

In [17]:
# Results are different with block size < n_samples (non-unit step breaks pairwise cycle)
m = allel.locate_unlinked(ds.data.compute(), size=window+1, step=step, threshold=threshold)
np.unique(m, return_counts=True)

(array([False,  True]), array([403963,  59562]))

In [19]:
# Need to use single block for validation
m = allel.locate_unlinked(ds.data.compute(), size=window+1, step=step, threshold=threshold, blen=ds.dims['variant'])
np.unique(m, return_counts=True)

(array([False,  True]), array([403983,  59542]))

In [47]:
idx_to_drop.values

array([     7,     10,     11, ..., 463517, 463518, 463521])

In [48]:
np.argwhere(~m).squeeze()

array([     7,     10,     11, ..., 463517, 463518, 463521])

In [50]:
# Check prototype implementation results for exact equality w/ scikit-allel
assert np.array_equal(np.argwhere(~m).squeeze(), idx_to_drop.values)

In [54]:
# Need to use single block for validation
m = allel.locate_unlinked(ds.data.compute(), size=window+1, step=step, threshold=threshold, blen=100001)
np.unique(m, return_counts=True)

(array([False,  True]), array([403943,  59582]))

### PLINK

In [29]:
%%time
%%bash -s {str(path.parent)} {str(path.name)}
set -e
cd $1
# Show what a `freq` report looks like
plink --bfile $2 --chr 1 --freq --out /tmp/$2
head /tmp/$2.frq

PLINK v1.90b6.14 64-bit (7 Jan 2020)           www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /tmp/1kG_MDS5.log.
Options in effect:
  --bfile 1kG_MDS5
  --chr 1
  --freq
  --out /tmp/1kG_MDS5

128535 MB RAM detected; reserving 64267 MB for main workspace.
463525 out of 5808310 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /tmp/1kG_MDS5.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
--freq: Allele frequencies (founders only) written to /tmp/1kG_MDS5.frq .
 CHR              SNP   A1   A2          MAF  NCHROBS
   1      1:10583:G:A    A

In [20]:
# Write MAF scores to emulate `freq` as dummy values to 
# negate effects from this in pruning
freq_path = '/tmp/' + path.name + '.xarray.frq'
ds[['contig', 'variant_id', 'a1', 'a2']].to_dataframe()\
    .rename(columns={'contig': 'CHR', 'variant_id': 'SNP', 'a1': 'A1', 'a2': 'A2'})\
    .assign(MAF=0.1, NCHROBS=ds.dims['sample']*2)\
    .to_csv(freq_path, index=False, sep='\t')
!head $freq_path

CHR	SNP	A1	A2	MAF	NCHROBS
1	1:10583:G:A	A	G	0.1	1258
1	1:11508:G:A	A	G	0.1	1258
1	1:15820:G:T	T	G	0.1	1258
1	1:16257:G:C	C	G	0.1	1258
1	1:16378:C:T	T	C	0.1	1258
1	1:30860:G:C	C	G	0.1	1258
1	1:30923:T:G	G	T	0.1	1258
1	1:40261:C:A	A	C	0.1	1258
1	1:49298:C:T	T	C	0.1	1258


In [44]:
%%time
%%bash -s {str(path.parent)} {str(path.name)}
set -e
cd $1
plink --bfile $2 --chr 1 --indep-pairwise 51 5 0.2 --read-freq /tmp/$2.xarray.frq --out /tmp/$2

PLINK v1.90b6.14 64-bit (7 Jan 2020)           www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /tmp/1kG_MDS5.log.
Options in effect:
  --bfile 1kG_MDS5
  --chr 1
  --indep-pairwise 51 5 0.2
  --out /tmp/1kG_MDS5
  --read-freq /tmp/1kG_MDS5.xarray.frq

128535 MB RAM detected; reserving 64267 MB for main workspace.
463525 out of 5808310 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /tmp/1kG_MDS5.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
--read-freq: .frq file loaded.
463525 variants and 629 people pass filters and QC.
Note: No phen

Compare differences:

In [23]:
!wc -l /tmp/1kG_MDS5.prune.out

403407 /tmp/1kG_MDS5.prune.out


In [26]:
df1 = pd.read_csv('/tmp/1kG_MDS5.prune.out', names=['variant_id'])
df2 = (
    ds[['variant_id']]
    .sel(variant=ds.variant.isin(idx_to_drop))
    .to_dataframe()
    .reset_index(drop=True)
)

In [91]:
df1.head()

Unnamed: 0,variant_id
0,1:40261:C:A
1,1:51803:T:C
2,1:52238:G:T
3,1:54490:G:A
4,1:54676:C:T


In [27]:
df2.head()

Unnamed: 0,variant_id
0,1:40261:C:A
1,1:51803:T:C
2,1:52238:G:T
3,1:54490:G:A
4,1:54676:C:T


In [30]:
dfm = pd.merge(df2, df1, how='outer', indicator='status')

In [31]:
dfm['status'].value_counts()

both          387957
left_only      16026
right_only     15450
Name: status, dtype: int64

In [32]:
dfm.head()

Unnamed: 0,variant_id,status
0,1:40261:C:A,both
1,1:51803:T:C,both
2,1:52238:G:T,both
3,1:54490:G:A,both
4,1:54676:C:T,both


In [33]:
df1.loc[17:19]

Unnamed: 0,variant_id
17,1:86331:A:G
18,1:87190:G:A
19,1:88338:G:A


In [34]:
df2.loc[17:19]

Unnamed: 0,variant_id
17,1:86331:A:G
18,1:88169:C:T
19,1:88338:G:A
