# BlazingSQL vs. Apache Spark 

Below we have one of our popular workloads running with [BlazingSQL](https://blazingsql.com/) + [RAPIDS AI](https://rapids.ai) and then running the entire ETL phase again, only this time with Apache Spark + PySpark.

In this notebook, we will cover: 
- How to read and query csv files with cuDF and BlazingSQL.
- How BlazingSQL compares against Apache Spark (analyzing over 20M records).

#### BlazingSQL install check
The next cell checks to determine if you have BlazingSQL installed.  If you do not have BlazingSQL installed, please first install RAPIDS and BlazingSQL via your preferred installation method (Docker or conda) from our [Release Selector](https://rapids.ai/start.html#rapids-release-selector). 

In [1]:
import sys 
# point import path notebooks-contrib/utils
sys.path.append('../../utils') 
from sql_check import bsql_start
# check that BlazingSQL is installed
bsql_start()

"You've got BlazingSQL set up perfectly! Let's get started with SQL in RAPIDS AI!"

#### Download Data
This cell will check if you have the data for this demo, and, if you don't, will download it for you.

In [2]:
import os

# relative path to data folder
data_dir = '../../data/blazingsql/'
# file name
fn = 'nf-chunk2.csv'

# does folder exist?
if not os.path.exists(data_dir):
    print('creating blazingsql directory')
    # create folder
    os.system('mkdir ../../data/blazingsql')

# do we have music file?
if not os.path.isfile(data_dir + fn):
    # save nf-chunk2 to data folder, may take a few minutes to download (21,526,138 records)
    !wget -P ../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
else:
    print("You've got the data!")

You've got the data!


## Create BlazingContext
You can think of the BlazingContext much like a Spark Context, this is where information such as FileSystems you have registered and Tables you have created will be stored. 

In [3]:
import cudf
from blazingsql import BlazingContext
# start up BlazingSQL
bc = BlazingContext()

BlazingContext ready


## BlazingSQL + cuDF 
Data in hand, we can test the preformance of cuDF and BlazingSQL on this dataset. 

In [4]:
%%time
# Load CSVs into GPU DataFrames (GDF)
netflow_gdf = cudf.read_csv(data_dir + fn)

CPU times: user 1.63 s, sys: 236 ms, total: 1.87 s
Wall time: 1.86 s


In [5]:
%%time
# Create BlazingSQL table from GDF - There is no copy in this process
bc.create_table('netflow', netflow_gdf)

CPU times: user 1.36 s, sys: 43.5 ms, total: 1.4 s
Wall time: 545 ms


In [6]:
%%time
# define the query
query = '''
        select
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            sum(a.firstSeenSrcTotalBytes) as bytesOut,
            sum(a.firstSeenDestTotalBytes) as bytesIn,
            sum(a.durationSeconds) as durationSeconds,
            min(parsedDate) as firstFlowDate,
            max(parsedDate) as lastFlowDate,
            count(*) as attemptCount
        from 
            netflow a
        group by
            a.firstSeenSrcIp,
            a.firstSeenDestIp
            '''

# query the table (returns cudf dataframe)
result_gdf = bc.sql(query)

CPU times: user 567 ms, sys: 52.4 ms, total: 619 ms
Wall time: 290 ms


In [7]:
# how's it look?
result_gdf.head(10)

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,172.10.1.33,10.0.0.11,110,49886,69630,0,2013-04-03 06:51:58,2013-04-03 14:45:47,110
1,172.30.1.126,239.255.255.250,9,2275,0,12,2013-04-03 06:35:52,2013-04-03 12:05:31,9
2,172.30.2.133,239.255.255.250,5,1225,0,6,2013-04-03 06:36:08,2013-04-03 06:36:15,5
3,172.30.1.149,10.0.0.9,78,35309,48556,30,2013-04-03 06:48:27,2013-04-03 11:52:55,78
4,10.0.0.13,172.10.1.81,1,633,392,0,2013-04-03 09:48:26,2013-04-03 09:48:26,1
5,172.10.0.5,10.247.58.129,3,1617,108,4,2013-04-03 10:16:11,2013-04-03 11:37:15,3
6,10.0.0.14,172.10.2.143,1,571,108,0,2013-04-03 10:13:57,2013-04-03 10:13:57,1
7,172.10.1.2,10.0.0.10,97,44092,61401,2,2013-04-03 06:48:54,2013-04-03 15:05:37,97
8,172.10.1.212,10.0.0.9,102,46260,64410,23,2013-04-03 06:50:02,2013-04-03 14:31:50,102
9,172.30.1.160,10.0.0.12,65,29402,40520,16,2013-04-03 06:55:18,2013-04-03 11:52:13,65


## Apache Spark
The cell below installs Apache Spark ([PySpark](https://spark.apache.org/docs/latest/api/python/index.html)).

In [8]:
# installs Spark (2.4.4 Jan 2020)
!pip install pyspark

Collecting pyspark
  Using cached pyspark-3.1.1.tar.gz (212.3 MB)
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 27.8 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=f924906e4f699df53b02459122288353d7e6790a0d5cb0181040406516c56b44
  Stored in directory: /root/.cache/pip/wheels/43/47/42/bc413c760cf9d3f7b46ab7cd6590e8c47ebfd19a7386cd4a57
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.1


#### PyBlazing vs PySpark
With everything installed we can launch a SparkSession and see how BlazingSQL stacks up.

In [9]:
%%time
#I copied this cell's snippet from another Google Colab by Luca Canali here: https://colab.research.google.com/github/LucaCanali/sparkMeasure/blob/master/examples/SparkMeasure_Jupyter_Colab_Example.ipynb

from pyspark.sql import SparkSession

# Create Spark Session
# This example uses a local cluster, you can modify master to use  YARN or K8S if available 
# This example downloads sparkMeasure 0.13 for scala 2_11 from maven central

spark = SparkSession \
        .builder \
        .master("local[*]") \
        .appName("PySpark Netflow Benchmark code") \
        .config("spark.jars.packages","ch.cern.sparkmeasure:spark-measure_2.11:0.13")  \
        .getOrCreate()

CPU times: user 39.5 ms, sys: 39.9 ms, total: 79.4 ms
Wall time: 6.87 s


### Load & Query Table

In [10]:
%%time
# load CSV into Spark
netflow_df = spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(data_dir+fn)

CPU times: user 27.5 ms, sys: 33.8 ms, total: 61.3 ms
Wall time: 40 s


In [11]:
%%time
# create table for querying
netflow_df.createOrReplaceTempView('netflow')

CPU times: user 1.21 ms, sys: 283 µs, total: 1.49 ms
Wall time: 155 ms


In [12]:
%%time
# define the same query run tested on blazingsql above
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# query with Spark
edges_df = spark.sql(query)

# set/display results
edges_df.show(5)

+---------+------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
|   source| destination|targetPorts|bytesOut|bytesIn|durationSeconds|      firstFlowDate|       lastFlowDate|attemptCount|
+---------+------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
|10.0.0.10| 172.20.1.73|          1|     571|    108|              0|2013-04-03 10:08:30|2013-04-03 10:08:30|           1|
|10.0.0.10|172.30.1.221|          1|     633|    392|              0|2013-04-03 10:10:39|2013-04-03 10:10:39|           1|
|10.0.0.10| 172.30.2.67|          1|     633|    392|              0|2013-04-03 10:43:48|2013-04-03 10:43:48|           1|
|10.0.0.11| 172.20.1.55|          1|     571|    108|              0|2013-04-03 10:11:52|2013-04-03 10:11:52|           1|
|10.0.0.11|172.30.1.245|          3|    1837|    892|              0|2013-04-03 09:45:12|2013-04-03 11:27:32|           3|
+---------+-----