# Graphistry Netflow Demo

In this example we are examining 73,397,810 rows of network traffic flow (netflow) data in search of anomalous activity within a network. 

We will query this data with BlazingSQL and than pass it to Graphistry for visualization.

## Imports

In [4]:
# Notebooks-contrib test 
import os
try:
    import matplotlib
except ModuleNotFoundError:
    os.system('conda install -y matplotlib')
    import matplotlib
    
# import blazingsql
from blazingsql import BlazingContext 

## Blazing Context
Here we are importing cuDF and BlazingContext. You can think of the BlazingContext like a Spark Context, this is where information such as FileSystems you have registered and Tables you have created will be stored. 

In [12]:
bc = BlazingContext()

Already connected to the Orchestrator
BlazingContext ready


### Create & Query Tables
In this next cell we identify the full path to the data we've downloaded.

In [1]:
# identify working directory path
path = !pwd  # type(path)==IPython.utils.text.SList

# extract path to notebooks-contrib
path = path[0].split('intermediate_notebooks')[0]

# add path to data w/ wildcard to load all 4 files at once
path = path + '/data/blazingsql/*_0.parquet'

# what's the path? 
path

'/home/winston/notebooks-contrib//data/blazingsql/*_0.parquet'

#### Create
Here use the path identified above to load all 4 parquet files into a single BlazingSQL table. This is done by using a wildcard (*) in the file path. 

In [31]:
%%time
# BlazingSQL table from multiple files via wildcard path
bc.create_table('netflow', path)

CPU times: user 4.16 ms, sys: 4.18 ms, total: 8.35 ms
Wall time: 298 ms


<pyblazing.apiv2.sql.Table at 0x7f7189dc4ac8>

#### Query
With the table made, we can simply run a SQL query.

We are going to run some joins and aggregations in order to condese these millions of rows into thousands of rows that represent nodes and edges.

In [32]:
%%time
# what are we looking for 
query = '''
        select
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            sum(a.firstSeenSrcTotalBytes) as bytesOut,
            sum(a.firstSeenDestTotalBytes) as bytesIn,
            sum(a.durationSeconds) as durationSeconds,
            min(parsedDate) as firstFlowDate,
            max(parsedDate) as lastFlowDate,
            count(*) as attemptCount
        from
            netflow a
        group by
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# run sql query (type(result)==cudf.core.dataframe.DataFrame)
result = bc.sql(query)

CPU times: user 29.3 ms, sys: 41.9 ms, total: 71.3 ms
Wall time: 4.51 s


In [33]:
# how do the results look?
gdf.head(25)

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,10.0.0.13,60805,5,3165,25,856,2013-04-04 12:37:43,2013-04-06 14:21:53,5
1,10.0.0.7,58945,8,5056,40,1352,2013-04-01 10:50:05,2013-04-06 08:59:53,8
2,10.0.0.6,64531,9,5688,45,1503,2013-04-01 14:34:49,2013-04-06 12:32:14,9
3,10.1.0.76,1588,16,10128,82,3168,2013-04-01 08:42:09,2013-04-07 08:24:19,16
4,10.0.0.13,48255,8,5064,40,1408,2013-04-02 13:37:22,2013-04-07 08:30:07,8
5,10.0.0.10,62076,1,633,5,152,2013-04-03 11:16:20,2013-04-03 11:16:20,1
6,10.7.5.5,18256,5,22118307,9128,912,2013-04-03 11:03:52,2013-04-05 12:24:54,5
7,10.0.0.5,54363,12,7584,60,1956,2013-04-01 11:14:43,2013-04-05 13:16:40,12
8,172.20.0.4,36513,60,552030,495,2925,2013-04-03 08:26:33,2013-04-03 11:33:10,60
9,10.1.0.100,60962,3,1902,15,531,2013-04-02 08:22:00,2013-04-03 13:17:20,3
