# ARCHIVED - Please run with RAPIDS 24.06 or earlier.  You can use conda or docker to get the correct version

# Dask-SQL vs. Apache Spark 

*Converted from the [original BlazingSQL notebook](the_archive/archived_rapids_demos/blazingsql/bsql_vs_pyspark_netflow.ipynb) by Shondace Thomas*

Below we have one of our popular workloads running with [Dask-SQL](https://docs.dask.org/en/latest/install.html) + [RAPIDS AI](https://rapids.ai) and then running the entire ETL phase again, only this time with Apache Spark + PySpark.

In this notebook, we will cover: 
- How to read and query csv files with cuDF and Dask-SQL.
- How Dask-SQL compares against Apache Spark (analyzing over 20M records).

In [1]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

In [2]:
# create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 2
Total threads: 2,Total memory: 45.79 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:42707,Workers: 2
Dashboard: http://127.0.0.1:8787/status,Total threads: 2
Started: Just now,Total memory: 45.79 GiB

0,1
Comm: tcp://127.0.0.1:34069,Total threads: 1
Dashboard: http://127.0.0.1:43989/status,Memory: 22.89 GiB
Nanny: tcp://127.0.0.1:35835,
Local directory: /rapids/notebooks/extra/notebooks-contrib/community_tutorials_and_guides/dask-sql/dask-worker-space/worker-headvo3o,Local directory: /rapids/notebooks/extra/notebooks-contrib/community_tutorials_and_guides/dask-sql/dask-worker-space/worker-headvo3o
GPU: Quadro GV100,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:46221,Total threads: 1
Dashboard: http://127.0.0.1:45063/status,Memory: 22.89 GiB
Nanny: tcp://127.0.0.1:33741,
Local directory: /rapids/notebooks/extra/notebooks-contrib/community_tutorials_and_guides/dask-sql/dask-worker-space/worker-7a0gssnu,Local directory: /rapids/notebooks/extra/notebooks-contrib/community_tutorials_and_guides/dask-sql/dask-worker-space/worker-7a0gssnu
GPU: Quadro GV100,GPU memory: 31.74 GiB


In [3]:
# cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True, jit_unspill=False)
# client = Client(cluster)
# client

#### Download Data
This cell will check if you have the data for this demo, and, if you don't, will download it for you.

In [4]:
import os
# relative path to data folder
data_dir = '../../data/dask-sql/'
# file name
fn = 'nf-chunk2.csv'

# does folder exist?
if not os.path.exists(data_dir):
    print('creating dask-sql directory')
    # create folder
    os.system('mkdir ../../data/dask-sql')

# do we have music file?
if not os.path.isfile(data_dir + fn):
    # save nf-chunk2 to data folder, may take a few minutes to download (21,526,138 records)
    !wget -P ../../data/dask-sql https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
else:
    print("You've got the data!")

You've got the data!


## Create DaskContext
You can think of the DaskContext much like a Spark Context, this is where information such as FileSystems you have registered and Tables you have created will be stored. 

In [5]:
import cudf
import dask_cudf
from dask_sql import Context
# start up Dask-SQL
dc = Context()

## Dask-SQL + cuDF 
Data in hand, we can test the preformance of cuDF and Dask-SQL on this dataset. 

In [6]:
%%time
# Load CSVs into GPU DataFrames (GDF)
netflow_gdf = cudf.read_csv(data_dir + fn)

CPU times: user 1.62 s, sys: 614 ms, total: 2.23 s
Wall time: 2.21 s


In [7]:
netflow_gdf.dtypes


TimeSeconds                  float64
parsedDate                    object
dateTimeStr                  float64
ipLayerProtocol                int64
ipLayerProtocolCode           object
firstSeenSrcIp                object
firstSeenDestIp               object
firstSeenSrcPort               int64
firstSeenDestPort              int64
moreFragments                  int64
contFragments                  int64
durationSeconds                int64
firstSeenSrcPayloadBytes       int64
firstSeenDestPayloadBytes      int64
firstSeenSrcTotalBytes         int64
firstSeenDestTotalBytes        int64
firstSeenSrcPacketCount        int64
firstSeenDestPacketCount       int64
recordForceOut                 int64
dtype: object

In [8]:
%%time
# Create DaskSQL table from GDF - There is no copy in this process
dc.create_table('netflow', netflow_gdf, persist=False)

CPU times: user 108 ms, sys: 4.65 ms, total: 113 ms
Wall time: 110 ms


In [9]:
result = dc.sql("SELECT * FROM netflow")
result.head()
#type(result.head())

Unnamed: 0,TimeSeconds,parsedDate,dateTimeStr,ipLayerProtocol,ipLayerProtocolCode,firstSeenSrcIp,firstSeenDestIp,firstSeenSrcPort,firstSeenDestPort,moreFragments,contFragments,durationSeconds,firstSeenSrcPayloadBytes,firstSeenDestPayloadBytes,firstSeenSrcTotalBytes,firstSeenDestTotalBytes,firstSeenSrcPacketCount,firstSeenDestPacketCount,recordForceOut
0,1364948000.0,2013-04-03 00:11:50,20130400000000.0,6,TCP,10.38.37.13,172.20.0.3,42559,25,0,0,10,36,125,422,403,7,5,0
1,1364948000.0,2013-04-03 00:11:53,20130400000000.0,6,TCP,10.13.77.49,172.30.0.4,42566,25,0,0,9,0,0,186,0,3,0,0
2,1364948000.0,2013-04-03 00:11:54,20130400000000.0,17,UDP,172.10.0.40,172.255.255.255,138,138,0,0,0,201,0,243,0,1,0,0
3,1364948000.0,2013-04-03 00:11:57,20130400000000.0,6,TCP,10.156.215.83,172.10.0.7,42593,80,0,0,0,170,336,448,506,5,3,0
4,1364948000.0,2013-04-03 00:12:00,20130400000000.0,6,TCP,10.170.32.110,172.20.0.4,42612,80,0,0,3,1870,79850,5730,84250,70,80,0


In [10]:
len(netflow_gdf.index)

21526138

In [11]:
# import time

# t0 = time.time()

In [12]:
# %%time
# define the query
import time

t0 = time.time()

query = '''
        select
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            sum(a.firstSeenSrcTotalBytes) as bytesOut,
            sum(a.firstSeenDestTotalBytes) as bytesIn,
            sum(a.durationSeconds) as durationSeconds,
            min(parsedDate) as firstFlowDate,
            max(parsedDate) as lastFlowDate,
            count(*) as attemptCount
        from 
            netflow a
        group by
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# query the table (returns cudf dataframe)
result_gdf = dc.sql(query)

t1 = time.time()
print(f"run_stuff took {t1-t0}s")

run_stuff took 1.6552035808563232s


In [13]:
# t1 = time.time()
# print(f"run_stuff took {t1-t0}s")

In [14]:
result_gdf.head()

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,10.0.0.10,172.10.1.10,1,571,108,0,2013-04-03 10:47:33,2013-04-03 10:47:33,1
1,10.0.0.10,172.10.1.100,1,633,392,0,2013-04-03 10:43:15,2013-04-03 10:43:15,1
2,10.0.0.10,172.10.1.102,1,571,108,0,2013-04-03 09:58:01,2013-04-03 09:58:01,1
3,10.0.0.10,172.10.1.109,1,633,392,0,2013-04-03 10:02:42,2013-04-03 10:02:42,1
4,10.0.0.10,172.10.1.112,1,571,108,0,2013-04-03 13:13:49,2013-04-03 13:13:49,1


In [15]:
result_gdf


Unnamed: 0_level_0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,object,object,int32,int64,int64,int64,object,object,int32
,...,...,...,...,...,...,...,...,...


In [16]:
# how's it looking?
result_gdf.head(10)


Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,10.0.0.10,172.10.1.10,1,571,108,0,2013-04-03 10:47:33,2013-04-03 10:47:33,1
1,10.0.0.10,172.10.1.100,1,633,392,0,2013-04-03 10:43:15,2013-04-03 10:43:15,1
2,10.0.0.10,172.10.1.102,1,571,108,0,2013-04-03 09:58:01,2013-04-03 09:58:01,1
3,10.0.0.10,172.10.1.109,1,633,392,0,2013-04-03 10:02:42,2013-04-03 10:02:42,1
4,10.0.0.10,172.10.1.112,1,571,108,0,2013-04-03 13:13:49,2013-04-03 13:13:49,1
5,10.0.0.10,172.10.1.114,1,633,392,0,2013-04-03 11:49:06,2013-04-03 11:49:06,1
6,10.0.0.10,172.10.1.116,1,54,0,0,2013-04-03 10:55:57,2013-04-03 10:55:57,1
7,10.0.0.10,172.10.1.117,1,54,0,0,2013-04-03 10:15:42,2013-04-03 10:15:42,1
8,10.0.0.10,172.10.1.119,1,54,0,0,2013-04-03 10:25:07,2013-04-03 10:25:07,1
9,10.0.0.10,172.10.1.120,1,571,108,0,2013-04-03 10:25:01,2013-04-03 10:25:01,1


## Apache Spark
The cell below installs Apache Spark ([PySpark](https://spark.apache.org/docs/latest/api/python/index.html)).

In [17]:
# installs Spark (2.4.4 Jan 2020)
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
     |████████████████████████████████| 281.3 MB 98 kB/s              
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 43.7 MB/s            
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=4d5a723adc6f54e0e2168f2e1bfb3be356418285ce90f31ed893c23f09cb4943
  Stored in directory: /root/.cache/pip/wheels/23/f6/d3/110e53bd43baeb8d7d38049733d48e39cbecd056f01dba7ee8
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [18]:
!pip install install-jdk

Collecting install-jdk
  Downloading install-jdk-0.3.0.tar.gz (3.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: install-jdk
  Building wheel for install-jdk (setup.py) ... [?25ldone
[?25h  Created wheel for install-jdk: filename=install_jdk-0.3.0-py3-none-any.whl size=3740 sha256=e3015ac2be81c909a5e918b6935010f3417e190a9f290da78149c8940a5dfe2a
  Stored in directory: /root/.cache/pip/wheels/89/a9/a3/03dc102cdcd442b9bca361f8c64fd4bb9b47ce75d9c8d56c91
Successfully built install-jdk
Installing collected packages: install-jdk
Successfully installed install-jdk-0.3.0


In [19]:
#import os
#os.environ["PYSPARK_SUBMIT_ARGS"]="--master local[3] pyspark-shell"
#os.environ['SPARK_LOCAL_IP'] =  '10.150.160.7'

#### Dask-SQL vs PySpark
With everything installed we can launch a SparkSession and see how Dask-SQL stacks up.

In [20]:
%%time
#I copied this cell's snippet from another Google Colab by Luca Canali here: https://colab.research.google.com/github/LucaCanali/sparkMeasure/blob/master/examples/SparkMeasure_Jupyter_Colab_Example.ipynb

from pyspark.sql import SparkSession
from pyspark import SparkConf, StorageLevel, SparkContext

# Create Spark Session
# This example uses a local cluster, you can modify master to use  YARN or K8S if available 
# This example downloads sparkMeasure 0.13 for scala 2_11 from maven central

# spark = SparkSession \
#         .builder \
#         .master("local[*]") \
#         .appName("PySpark Netflow Benchmark code") \
#         .config("spark.jars.packages","ch.cern.sparkmeasure:spark-measure_2.11:0.13")  \
#         .getOrCreate()

conf = (SparkConf().setMaster("local[*]")
                .setAppName("PySpark Netflow Benchmark code")
                .set('spark.driver.memory', '300G')
                .set('spark.driver.maxResultSize', '20G')
                .set('spark.network.timeout', '7200s'))

sc = SparkContext(conf=conf)

spark = SparkSession(sc)

CPU times: user 159 ms, sys: 113 ms, total: 273 ms
Wall time: 2.98 s


### Load & Query Table

In [21]:
%%time
# load CSV into Spark
netflow_df = spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(data_dir+fn)

CPU times: user 536 ms, sys: 247 ms, total: 783 ms
Wall time: 17 s


In [22]:
%%time
# create table for querying
netflow_df.createOrReplaceTempView('netflow')

CPU times: user 4.33 ms, sys: 439 µs, total: 4.77 ms
Wall time: 35.3 ms


In [23]:
import time

t0 = time.time()


In [24]:
%%time
# define the same query run tested on blazingsql above
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# query with Spark
edges_df = spark.sql(query)

# set/display results
edges_df.show(5)

+---------+------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
|   source| destination|targetPorts|bytesOut|bytesIn|durationSeconds|      firstFlowDate|       lastFlowDate|attemptCount|
+---------+------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
|10.0.0.10|172.10.1.116|          1|      54|      0|              0|2013-04-03 10:55:57|2013-04-03 10:55:57|           1|
|10.0.0.10|172.10.1.146|          1|     571|    108|              0|2013-04-03 10:21:51|2013-04-03 10:21:51|           1|
|10.0.0.10|172.10.1.156|          1|     571|    108|              0|2013-04-03 09:54:01|2013-04-03 09:54:01|           1|
|10.0.0.10|172.10.1.182|          1|     571|    108|              0|2013-04-03 10:44:59|2013-04-03 10:44:59|           1|
|10.0.0.10|172.10.1.201|          2|    1204|    500|              0|2013-04-03 09:50:16|2013-04-03 10:02:54|           2|
+---------+-----

In [25]:
t1 = time.time()
print(f"run_stuff took {t1-t0}s")

run_stuff took 10.61590027809143s


In [26]:
type(edges_df)


pyspark.sql.dataframe.DataFrame

In [27]:
edges_df.count()

18881