# BlazingSQL vs. Apache Spark 

Below we have one of our popular workloads running with [BlazingSQL](https://blazingsql.com/) + [RAPIDS AI](https://rapids.ai) and then running the entire ETL phase again, only this time with Apache Spark + PySpark.

In this notebook, we will cover: 
- How to read and query csv files with cuDF and BlazingSQL.
- How BlazingSQL compares against Apache Spark (analyzing over 20M records).

#### BlazingSQL install check
The next cell checks that you have BlazingSQL installed, and offers to install it if not (making sure the notebook will run as expected).

In [1]:
import sys 
# point import path notebooks-contrib/utils
sys.path.append('../../../utils/')
from sql_check import bsql_start
# check that BlazingSQL is installed
bsql_start()

"You've got BlazingSQL set up perfectly! Let's get started with SQL in RAPIDS AI!"

#### Download Data
This cell will check if you have the data for this demo, and, if you don't, will download it for you.

In [2]:
import os

# relative path to data folder
data_dir = '../../../data/blazingsql/'
# file name
fn = 'nf-chunk2.csv'

# does folder exist?
if not os.path.exists(data_dir):
    print('creating blazingsql directory')
    # create folder
    os.system('mkdir ../../data/blazingsql')

# do we have music file?
if not os.path.isfile(data_dir + fn):
    # save nf-chunk2 to data folder, may take a few minutes to download (21,526,138 records)
    !wget -P ../../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
else:
    print("You've got the data!")

--2020-01-21 17:27:12--  https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.115.3
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.115.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2725056295 (2.5G) [text/csv]
Saving to: ‘../../../data/blazingsql/nf-chunk2.csv’


2020-01-21 17:28:14 (46.0 MB/s) - ‘../../../data/blazingsql/nf-chunk2.csv’ saved [2725056295/2725056295]



## Create BlazingContext
You can think of the BlazingContext much like a Spark Context, this is where information such as FileSystems you have registered and Tables you have created will be stored. 

In [3]:
import cudf
from blazingsql import BlazingContext
# start up BlazingSQL
bc = BlazingContext()

BlazingContext ready


## BlazingSQL + cuDF 
Data in hand, we can test the preformance of cuDF and BlazingSQL on this dataset. 

In [4]:
%%time
# Load CSVs into GPU DataFrames (GDF)
netflow_gdf = cudf.read_csv(data_dir + fn)

CPU times: user 3.75 s, sys: 1.19 s, total: 4.94 s
Wall time: 4.93 s


In [5]:
%%time
# Create BlazingSQL table from GDF - There is no copy in this process
bc.create_table('netflow', netflow_gdf)

CPU times: user 4.11 ms, sys: 23 µs, total: 4.13 ms
Wall time: 3.32 ms


<pyblazing.apiv2.context.BlazingTable at 0x7f9230411278>

In [6]:
%%time
# define the query
query = '''
        select
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            sum(a.firstSeenSrcTotalBytes) as bytesOut,
            sum(a.firstSeenDestTotalBytes) as bytesIn,
            sum(a.durationSeconds) as durationSeconds,
            min(parsedDate) as firstFlowDate,
            max(parsedDate) as lastFlowDate,
            count(*) as attemptCount
        from 
            netflow a
        group by
            a.firstSeenSrcIp,
            a.firstSeenDestIp
            '''

# query the table (returns cudf dataframe)
result_gdf = bc.sql(query)

CPU times: user 1.98 s, sys: 514 ms, total: 2.49 s
Wall time: 1.95 s


In [7]:
# how's it look?
result_gdf.head(10)

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,172.10.1.162,10.0.0.11,87,39628,53983,24,2013-04-03 06:50:13,2013-04-03 14:58:35,87
1,172.30.2.60,10.0.0.9,82,34839,47716,134,2013-04-03 06:48:47,2013-04-03 12:12:37,82
2,172.30.1.56,172.0.0.1,25,3330,3240,67,2013-04-03 01:59:09,2013-04-03 22:05:39,25
3,172.10.1.234,10.0.0.5,104,47287,64750,18,2013-04-03 06:53:55,2013-04-03 15:11:07,104
4,10.1.0.76,172.10.1.82,1,633,392,0,2013-04-03 09:55:05,2013-04-03 09:55:05,1
5,172.30.1.85,10.0.0.8,84,37828,52864,3,2013-04-03 06:48:21,2013-04-03 12:06:53,84
6,172.30.1.10,10.0.0.12,69,31042,43044,25,2013-04-03 06:48:01,2013-04-03 12:11:40,69
7,172.30.1.201,172.0.0.1,29,2610,2610,0,2013-04-03 00:26:46,2013-04-03 23:06:00,29
8,172.30.2.125,10.0.0.9,69,30701,41558,341,2013-04-03 06:50:50,2013-04-03 12:12:37,69
9,172.10.1.89,10.0.0.5,112,51222,70260,24,2013-04-03 06:48:24,2013-04-03 15:17:39,112


## Apache Spark
The cell below installs Apache Spark ([PySpark](https://spark.apache.org/docs/latest/api/python/index.html)).

In [11]:
# installs Spark (2.4.4 Jan 2020)
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |################################| 215.7MB 77.0MB/s eta 0:00:011
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |################################| 204kB 72.4MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130387 sha256=7879a54a037a812709763c4abf7d3d85b5b9b9f8ac6278785942767dc8032f54
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d184230058e654eb1b04467dbc1292f00eaa186544604b471
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully 

#### PyBlazing vs PySpark
With everything installed we can launch a SparkSession and see how BlazingSQL stacks up.

In [12]:
%%time
#I copied this cell's snippet from another Google Colab by Luca Canali here: https://colab.research.google.com/github/LucaCanali/sparkMeasure/blob/master/examples/SparkMeasure_Jupyter_Colab_Example.ipynb

from pyspark.sql import SparkSession

# Create Spark Session
# This example uses a local cluster, you can modify master to use  YARN or K8S if available 
# This example downloads sparkMeasure 0.13 for scala 2_11 from maven central

spark = SparkSession \
        .builder \
        .master("local[*]") \
        .appName("PySpark Netflow Benchmark code") \
        .config("spark.jars.packages","ch.cern.sparkmeasure:spark-measure_2.11:0.13")  \
        .getOrCreate()

CPU times: user 73.4 ms, sys: 27.2 ms, total: 101 ms
Wall time: 6.59 s


### Load & Query Table

In [17]:
%%time
# load CSV into Spark
netflow_df = spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(data_dir+fn)

CPU times: user 20.4 ms, sys: 33.4 ms, total: 53.8 ms
Wall time: 48.8 s


In [18]:
%%time
# create table for querying
netflow_df.createOrReplaceTempView('netflow')

CPU times: user 1.87 ms, sys: 0 ns, total: 1.87 ms
Wall time: 28.6 ms


In [19]:
%%time
# define the same query run tested on blazingsql above
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# query with Spark
edges_df = spark.sql(query)

# set/display results
edges_df.show(5)

+------------+---------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
|      source|    destination|targetPorts|bytesOut|bytesIn|durationSeconds|      firstFlowDate|       lastFlowDate|attemptCount|
+------------+---------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
| 172.10.1.13|239.255.255.250|         15|    2975|      0|              6|2013-04-03 06:36:19|2013-04-03 06:36:27|          15|
|172.30.1.204|239.255.255.250|          8|    1750|      0|              6|2013-04-03 06:36:13|2013-04-03 06:36:20|           8|
| 172.30.2.86|      172.0.0.1|          1|     540|      0|              2|2013-04-03 06:36:09|2013-04-03 06:36:09|           1|
|172.30.1.246|      172.0.0.1|         29|    2610|   2610|              0|2013-04-03 00:26:46|2013-04-03 23:06:00|          29|
| 172.30.1.51|239.255.255.250|         16|    3850|      0|             18|2013-04-03 06:35:22|20