# DaskSQL vs. Apache Spark 

Below we have one of our popular workloads running with [DaskSQL](https://docs.dask.org/en/latest/install.html) + [RAPIDS AI](https://rapids.ai) and then running the entire ETL phase again, only this time with Apache Spark + PySpark.

In this notebook, we will cover: 
- How to read and query csv files with cuDF and DaskSQL.
- How DaskSQL compares against Apache Spark (analyzing over 20M records).

In [1]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

In [5]:
# create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 40293 instead
distributed.diskutils - INFO - Found stale lock file and directory '/home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-t_s2rl0u', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-l48__ft_', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-pdgf5peu', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-msqkv49_', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/u00ubc1kg5n2YAppSj357/notebooks-cont

0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:40293/status,

0,1
Dashboard: http://127.0.0.1:40293/status,Workers: 8
Total threads: 8,Total memory: 503.79 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39571,Workers: 8
Dashboard: http://127.0.0.1:40293/status,Total threads: 8
Started: Just now,Total memory: 503.79 GiB

0,1
Comm: tcp://127.0.0.1:46699,Total threads: 1
Dashboard: http://127.0.0.1:40649/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:37017,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-8jesk9y3,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-8jesk9y3
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:43599,Total threads: 1
Dashboard: http://127.0.0.1:38079/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:42257,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-11sovjrb,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-11sovjrb
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:36577,Total threads: 1
Dashboard: http://127.0.0.1:45925/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:41893,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-k03mnnf3,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-k03mnnf3
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:37565,Total threads: 1
Dashboard: http://127.0.0.1:38229/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:45707,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-6g_hgn97,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-6g_hgn97
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:35057,Total threads: 1
Dashboard: http://127.0.0.1:36401/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:40023,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-t7ywgrl0,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-t7ywgrl0
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:38833,Total threads: 1
Dashboard: http://127.0.0.1:42301/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:42535,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-ym8u4y3z,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-ym8u4y3z
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:43553,Total threads: 1
Dashboard: http://127.0.0.1:46099/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:36609,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-n03x84y_,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-n03x84y_
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:45385,Total threads: 1
Dashboard: http://127.0.0.1:44847/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:45395,
Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-563ts5de,Local directory: /home/u00ubc1kg5n2YAppSj357/notebooks-contrib/community_tutorials_and_guides/blazingsql/dask-worker-space/worker-563ts5de
GPU: Tesla V100-SXM2-32GB-LS,GPU memory: 31.75 GiB


In [3]:
# cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True, jit_unspill=False)
# client = Client(cluster)
# client

#### Download Data
This cell will check if you have the data for this demo, and, if you don't, will download it for you.

In [6]:
import os
# relative path to data folder
data_dir = '../../data/blazingsql/'
# file name
fn = 'nf-chunk2.csv'

# does folder exist?
if not os.path.exists(data_dir):
    print('creating blazingsql directory')
    # create folder
    os.system('mkdir ../../data/blazingsql')

# do we have music file?
if not os.path.isfile(data_dir + fn):
    # save nf-chunk2 to data folder, may take a few minutes to download (21,526,138 records)
    !wget -P ../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
else:
    print("You've got the data!")

You've got the data!


## Create DaskContext
You can think of the DaskContext much like a Spark Context, this is where information such as FileSystems you have registered and Tables you have created will be stored. 

In [7]:
import cudf
import dask_cudf
from dask_sql import Context
# start up DaskSQL
dc = Context()

## DaskSQL + cuDF 
Data in hand, we can test the preformance of cuDF and DaskSQL on this dataset. 

In [8]:
%%time
# Load CSVs into GPU DataFrames (GDF)
netflow_gdf = cudf.read_csv(data_dir + fn)

MemoryError: std::bad_alloc: CUDA error at: /home/u00ubc1kg5n2YAppSj357/miniconda3/envs/rapids-dask-sql/include/rmm/mr/device/cuda_memory_resource.hpp:71: cudaErrorMemoryAllocation out of memory

In [11]:
netflow_gdf.dtypes


TimeSeconds                  float64
parsedDate                    object
dateTimeStr                  float64
ipLayerProtocol                int64
ipLayerProtocolCode           object
firstSeenSrcIp                object
firstSeenDestIp               object
firstSeenSrcPort               int64
firstSeenDestPort              int64
moreFragments                  int64
contFragments                  int64
durationSeconds                int64
firstSeenSrcPayloadBytes       int64
firstSeenDestPayloadBytes      int64
firstSeenSrcTotalBytes         int64
firstSeenDestTotalBytes        int64
firstSeenSrcPacketCount        int64
firstSeenDestPacketCount       int64
recordForceOut                 int64
dtype: object

In [5]:
%%time
# Create DaskSQL table from GDF - There is no copy in this process
dc.create_table('netflow', netflow_gdf, persist=False)

NameError: name 'netflow_gdf' is not defined

In [13]:
result = dc.sql("SELECT * FROM netflow")
result.head()
#type(result.head())

Unnamed: 0,TimeSeconds,parsedDate,dateTimeStr,ipLayerProtocol,ipLayerProtocolCode,firstSeenSrcIp,firstSeenDestIp,firstSeenSrcPort,firstSeenDestPort,moreFragments,contFragments,durationSeconds,firstSeenSrcPayloadBytes,firstSeenDestPayloadBytes,firstSeenSrcTotalBytes,firstSeenDestTotalBytes,firstSeenSrcPacketCount,firstSeenDestPacketCount,recordForceOut
0,1364948000.0,2013-04-03 00:11:50,20130400000000.0,6,TCP,10.38.37.13,172.20.0.3,42559,25,0,0,10,36,125,422,403,7,5,0
1,1364948000.0,2013-04-03 00:11:53,20130400000000.0,6,TCP,10.13.77.49,172.30.0.4,42566,25,0,0,9,0,0,186,0,3,0,0
2,1364948000.0,2013-04-03 00:11:54,20130400000000.0,17,UDP,172.10.0.40,172.255.255.255,138,138,0,0,0,201,0,243,0,1,0,0
3,1364948000.0,2013-04-03 00:11:57,20130400000000.0,6,TCP,10.156.215.83,172.10.0.7,42593,80,0,0,0,170,336,448,506,5,3,0
4,1364948000.0,2013-04-03 00:12:00,20130400000000.0,6,TCP,10.170.32.110,172.20.0.4,42612,80,0,0,3,1870,79850,5730,84250,70,80,0


In [14]:
len(netflow_gdf.index)

21526138

In [15]:
# import time

# t0 = time.time()

In [16]:
# %%time
# define the query
import time

t0 = time.time()

query = '''
        select
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            sum(a.firstSeenSrcTotalBytes) as bytesOut,
            sum(a.firstSeenDestTotalBytes) as bytesIn,
            sum(a.durationSeconds) as durationSeconds,
            min(parsedDate) as firstFlowDate,
            max(parsedDate) as lastFlowDate,
            count(*) as attemptCount
        from 
            netflow a
        group by
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# query the table (returns cudf dataframe)
result_gdf = dc.sql(query)

t1 = time.time()
print(f"run_stuff took {t1-t0}s")

run_stuff took 0.6950595378875732s


In [17]:
# t1 = time.time()
# print(f"run_stuff took {t1-t0}s")

run_stuff took 0.9239158630371094s


In [18]:
result_gdf.head()

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,10.0.0.10,172.10.1.10,2,1142,216,0,2013-04-03 10:47:33,2013-04-03 10:47:33,2
1,10.0.0.10,172.10.1.100,2,1266,784,0,2013-04-03 10:43:15,2013-04-03 10:43:15,2
2,10.0.0.10,172.10.1.102,2,1142,216,0,2013-04-03 09:58:01,2013-04-03 09:58:01,2
3,10.0.0.10,172.10.1.109,2,1266,784,0,2013-04-03 10:02:42,2013-04-03 10:02:42,2
4,10.0.0.10,172.10.1.112,2,1142,216,0,2013-04-03 13:13:49,2013-04-03 13:13:49,2


In [23]:
result_gdf


Unnamed: 0_level_0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,object,object,int64,int64,int64,int64,object,object,int64
,...,...,...,...,...,...,...,...,...


In [25]:
# how's it looking?
result_gdf.head(10)


Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,10.0.0.10,172.10.1.10,2,1142,216,0,2013-04-03 10:47:33,2013-04-03 10:47:33,2
1,10.0.0.10,172.10.1.100,2,1266,784,0,2013-04-03 10:43:15,2013-04-03 10:43:15,2
2,10.0.0.10,172.10.1.102,2,1142,216,0,2013-04-03 09:58:01,2013-04-03 09:58:01,2
3,10.0.0.10,172.10.1.109,2,1266,784,0,2013-04-03 10:02:42,2013-04-03 10:02:42,2
4,10.0.0.10,172.10.1.112,2,1142,216,0,2013-04-03 13:13:49,2013-04-03 13:13:49,2
5,10.0.0.10,172.10.1.114,2,1266,784,0,2013-04-03 11:49:06,2013-04-03 11:49:06,2
6,10.0.0.10,172.10.1.116,2,108,0,0,2013-04-03 10:55:57,2013-04-03 10:55:57,2
7,10.0.0.10,172.10.1.117,2,108,0,0,2013-04-03 10:15:42,2013-04-03 10:15:42,2
8,10.0.0.10,172.10.1.119,2,108,0,0,2013-04-03 10:25:07,2013-04-03 10:25:07,2
9,10.0.0.10,172.10.1.120,2,1142,216,0,2013-04-03 10:25:01,2013-04-03 10:25:01,2


## Apache Spark
The cell below installs Apache Spark ([PySpark](https://spark.apache.org/docs/latest/api/python/index.html)).

In [26]:
# installs Spark (2.4.4 Jan 2020)
!pip install pyspark



In [None]:
!pip install install-jdk

In [None]:
#import os
#os.environ["PYSPARK_SUBMIT_ARGS"]="--master local[3] pyspark-shell"
#os.environ['SPARK_LOCAL_IP'] =  '10.150.160.7'

#### PyBlazing vs PySpark
With everything installed we can launch a SparkSession and see how BlazingSQL stacks up.

In [None]:
%%time
#I copied this cell's snippet from another Google Colab by Luca Canali here: https://colab.research.google.com/github/LucaCanali/sparkMeasure/blob/master/examples/SparkMeasure_Jupyter_Colab_Example.ipynb

from pyspark.sql import SparkSession
from pyspark import SparkConf, StorageLevel, SparkContext

# Create Spark Session
# This example uses a local cluster, you can modify master to use  YARN or K8S if available 
# This example downloads sparkMeasure 0.13 for scala 2_11 from maven central

# spark = SparkSession \
#         .builder \
#         .master("local[*]") \
#         .appName("PySpark Netflow Benchmark code") \
#         .config("spark.jars.packages","ch.cern.sparkmeasure:spark-measure_2.11:0.13")  \
#         .getOrCreate()

conf = (SparkConf().setMaster("local[*]")
                .setAppName("PySpark Netflow Benchmark code")
                .set('spark.driver.memory', '300G')
                .set('spark.driver.maxResultSize', '20G')
                .set('spark.network.timeout', '7200s'))

sc = SparkContext(conf=conf)

spark = SparkSession(sc)

### Load & Query Table

In [None]:
%%time
# load CSV into Spark
netflow_df = spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(data_dir+fn)

In [None]:
%%time
# create table for querying
netflow_df.createOrReplaceTempView('netflow')

In [None]:
import time

t0 = time.time()


In [None]:
%%time
# define the same query run tested on blazingsql above
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# query with Spark
edges_df = spark.sql(query)

# set/display results
edges_df.show(5)

In [None]:
t1 = time.time()
print(f"run_stuff took {t1-t0}s")

In [None]:
type(edges_df)


In [None]:
edges_df.count()