### CSV Read (cuDF vs Dask)

Example times to read large PLINK bim file using cudf/dask/pandas.  This one dataset shows that `cudf` read times are at least 10x faster than dask and/or pandas (on 1 GPU vs 8 cores).

In [1]:
import cudf
import pandas as pd
import dask.dataframe as dd
import dask
import numpy as np
%run ../nb/paths.py
dask.config.set(scheduler='threads')

<dask.config.set at 0x7f67b01a5a10>

In [2]:
path = PLINK_1KG_PATH_01
path

PosixPath('/lab/data/gwas/tutorial/2_PS_GWAS/ALL.2of4intersection.20100804.genotypes')

In [3]:
dfs = {}
names = ['contig', 'variant_id', 'cm_pos', 'pos', 'a1', 'a2']
dtypes = ['str', 'str', 'int', 'int', 'str', 'str']

#### cuDF

In [4]:
%%time
# https://github.com/pacejohn/RAPIDS-Benchmarks/blob/master/gdf_vs_pdf_benchmarks.ipynb
dfs['cudf'] = cudf.read_csv(str(path) + '.bim', delimiter='\t', names=names, dtype=dtypes, skiprows=1)



CPU times: user 1.35 s, sys: 513 ms, total: 1.86 s
Wall time: 1.86 s


In [5]:
dfs['cudf'].head()

Unnamed: 0,contig,variant_id,cm_pos,pos,a1,a2
0,1,rs117577454,0,10469,G,C
1,1,rs55998931,0,10492,T,C
2,1,rs58108140,0,10583,A,G
3,1,.,0,11508,A,G
4,1,.,0,11565,T,G


In [6]:
%%time
dfs['cudf']['a1'].value_counts()

CPU times: user 37.3 ms, sys: 10.3 ms, total: 47.6 ms
Wall time: 45.8 ms


A    7170243
T    7169977
G    5579940
C    5568318
0          9
Name: a1, dtype: int32

#### Pandas

In [7]:
%%time
dfs['pandas'] = pd.read_csv(str(path) + '.bim', sep='\t', names=names, dtype=dict(zip(names, dtypes)))

CPU times: user 12 s, sys: 1.14 s, total: 13.2 s
Wall time: 13.2 s


In [8]:
dfs['pandas'].head()

Unnamed: 0,contig,variant_id,cm_pos,pos,a1,a2
0,1,rs112750067,0,10327,C,T
1,1,rs117577454,0,10469,G,C
2,1,rs55998931,0,10492,T,C
3,1,rs58108140,0,10583,A,G
4,1,.,0,11508,A,G


In [9]:
%%time
dfs['pandas']['a1'].value_counts()

CPU times: user 3.47 s, sys: 22.4 ms, total: 3.49 s
Wall time: 3.45 s


A    7170243
T    7169977
G    5579940
C    5568319
0          9
Name: a1, dtype: int64

#### Dask

In [4]:
%%time
dfs['dask'] = dd.read_csv(str(path) + '.bim', sep='\t', names=names, dtype=dict(zip(names, dtypes)))

CPU times: user 35.2 ms, sys: 3.88 ms, total: 39.1 ms
Wall time: 37 ms




In [5]:
%%time
# Note: distributed scheduler will send this over TCP making total time ~5 minutes (use threading scheduler)
df = dfs['dask'].compute()

CPU times: user 32.6 s, sys: 4.02 s, total: 36.7 s
Wall time: 17.7 s


In [6]:
%%time
dfs['dask']['a1'].value_counts().compute()

CPU times: user 36.1 s, sys: 3.14 s, total: 39.3 s
Wall time: 20.1 s


A    7170243
T    7169977
G    5579940
C    5568319
0          9
Name: a1, dtype: int64