## **Sample workflow using cugraph, custreamz and clx**

## Authors:
- Shane Ding (NVIDIA) [shaned@nvidia.com]

## Development Notes
* Developed using: CLX v0.18 and RAPIDS v0.18.0
* Last tested using: CLX v0.18 and RAPIDS v0.18.0 on June 9th, 2021

## Table of Contents
- Downloading Data
- Starting Kafka
- Configuring Kafka
- Building custreamz pipeline
- Benchmarking
- Publishing results to Kafka

## Introduction
In this notebook, we show an example of a workflow wherein data is published to Kafka, which is then processed via [RAPIDS](https://rapids.ai/) (in particular `cudf`, `cugraph` and `custreamz`) and [CLX](https://github.com/rapidsai/clx) for graph analytic workflows. The data we use is a sample from the [UNSW-NB15 dataset](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/) which can be downloaded [here](https://cloudstor.aarnet.edu.au/plus/index.php/s/2DhnLGDdEECo4ys) or simply run the blocks below.

### Downloading Data

In [None]:
import cudf
import cugraph
from cugraph.utilities.utils import is_device_version_less_than
import pandas as pd

from clx.heuristics import ports
import clx.parsers.zeek as zeek
import clx.ip

import pandas as pd
from os import path
import s3fs
from streamz import Stream

In [None]:
S3_BASE_PATH = "rapidsai-data/cyber/clx"
CONN_LOG = "conn.log"

# Download Zeek conn log
if not path.exists(CONN_LOG):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + CONN_LOG, CONN_LOG)

Note, `conn.log` contains a header at the top of the file, which is not needed for this example and we can simply remove it. It also contains a `close` header at the bottom, which we can remove.

In [None]:
!tail -n +9 conn.log | head -n -1 > messages.log

### Following the instructions at https://kafka.apache.org/quickstart to start a Kafka broker

**NOTE:** At the topic creation step, make sure to name the new topic `streamz_n_graph`

In [None]:
# Ingesting data into kafka

!kafka_2.13-2.8.0/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streamz_n_graph < messages.log >/dev/null

In [None]:
# To see the data from the kafka topic

!kafka_2.13-2.8.0/bin/kafka-console-consumer.sh --topic streamz_n_graph --from-beginning --bootstrap-server localhost:9092

### Configuring Kafka Stream using custreamz

In [None]:
# Kafka
broker="localhost:9092"
input_topic="streamz_n_graph"
output_topic="output"

In [None]:
max_batch_size=100000
poll_interval="1s"

In [None]:
import random

# Generate a unique group_id to be able to re-run this demo notebook on the same data loaded to your kafka topic.
j = random.randint(0,10000)
group_id="fil-group-%d" % j

# Kafka consumer configuration
consumer_conf = {
    "bootstrap.servers": broker,
    "group.id": group_id,
    "session.timeout.ms": "60000",
    "enable.partition.eof": "true",
    "auto.offset.reset": "latest",
}

In [None]:
source = Stream.from_kafka_batched(
        input_topic,
        consumer_conf,
        poll_interval=poll_interval,
        npartitions=1,
        asynchronous=True,
        max_batch_size=max_batch_size
)

### Now we know that Kafka is setup correctly, we start customizing our `predict` function for clx

In [None]:
import time

def parse_message(line):
    split_line = line.split(b'\t')
    src, src_p = split_line[2], split_line[3]
    dest, dest_p = split_line[4], split_line[5]
    return (src, src_p, dest, dest_p)
    

In [None]:
edges_gdf = None


def process_batch(messages):
    global edges_gdf
    start_time = time.time()
    src_dest_tuples = list(map(parse_message, messages))
    
    evt_edges_df = cudf.DataFrame({
        'src': [x[0].decode('utf-8') for x in src_dest_tuples],
        'dst': [x[2].decode('utf-8') for x in src_dest_tuples]
    })
    
    # converting to ip
    evt_edges_df['src'] = clx.ip.ip_to_int(evt_edges_df['src'])
    evt_edges_df['dst'] = clx.ip.ip_to_int(evt_edges_df['dst'])
    
    if not edges_gdf:
        edges_gdf = evt_edges_df
    else:
        edges_gdf = cudf.concat([edges_gdf, evt_edges_df])
    
    end_time = time.time()
    time_diff = end_time - start_time
    return (time_diff, evt_edges_df)

In [None]:
def pagerank(message):    
    start_time = time.time()
    
    G = cugraph.Graph()
    G.from_cudf_edgelist(edges_gdf, source="src", destination="dst", renumber=True)    
    
    pr_gdf = cugraph.pagerank(G, alpha=0.85, max_iter=500, tol=1.0e-05)
    pr_gdf['idx'] = pr_gdf['vertex']
    
    print(pr_gdf.head())
    end_time = time.time()
    time_diff = end_time - start_time
    
    prev_time = message[0]
    return (prev_time, time_diff)

### Sinking the result to a list

In [None]:
output = source.map(process_batch).map(pagerank).sink_to_list()

In [None]:
source.start()

In [None]:
output

### Generating longer synthetic file from `messages.log`

In [None]:
file_content = open('messages.log').read()
factor = 46
messages_sent = 43410 * factor  # 46 * 43410 ~ 2 million

with open('messages_duplicate.log', 'w') as f:
    for i in range(factor):
        f.write(file_content)

### Benchmarking

In [None]:
import subprocess

cumulative_time, total_time = 0, 0
trials = 10
bashCommand = "kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streamz_n_graph < messages_duplicate.log >/dev/null"


In [None]:
for i in range(trials):
    process = subprocess.Popen(bashCommand, stdout=subprocess.PIPE, cwd='/rapids/clx/my_data', shell=True)
    process.communicate()

In [None]:
print(f'A total of {messages_sent*trials} messages will be sent')

if len(output)*max_batch_size >= messages_sent*trials:
    print('Done')
    print('Average seconds per message:', sum(x[0] + x[1] for x in output)/(messages_sent * trials))
else:
    print('Still running, current average seconds per message:', sum(x[0] + x[1] for x in output)/(messages_sent * trials))

### Publishing the results to Kafka

Instead of sinking to a list, we can also emit our edge-list/pagerank result to a kafka topic, we just need to convert our result to a string or byte object.

In [None]:
broker="localhost:9092"
input_topic="streamz_n_graph"

In [None]:
max_batch_size=5000
poll_interval="1s"

In [None]:
import random

# Generate a unique group_id to be able to re-run this demo notebook on the same data loaded to your kafka topic.
j = random.randint(0,10000)
group_id="fil-group-%d" % j

# Kafka consumer configuration
consumer_conf = {
    "bootstrap.servers": broker,
    "group.id": group_id,
    "session.timeout.ms": "60000",
    "enable.partition.eof": "true",
    "auto.offset.reset": "latest",
}

In [None]:
source = Stream.from_kafka_batched(
        input_topic,
        consumer_conf,
        poll_interval=poll_interval,
        npartitions=1,
        asynchronous=True,
        max_batch_size=max_batch_size
)

In [None]:
### Creating the two new topics

!kafka_2.13-2.8.0/bin/kafka-topics.sh --create --topic edge_list --bootstrap-server localhost:9092
!kafka_2.13-2.8.0/bin/kafka-topics.sh --create --topic pagerank --bootstrap-server localhost:9092

In [None]:
def parse_message(line):
    split_line = line.split(b'\t')
    src, src_p = split_line[2], split_line[3]
    dest, dest_p = split_line[4], split_line[5]
    return (src, src_p, dest, dest_p)

In [None]:
edges_gdf = None

def process_batch(messages):
    global edges_gdf
    src_dest_tuples = list(map(parse_message, messages))
    
    evt_edges_df = cudf.DataFrame({
        'src': [x[0].decode('utf-8') for x in src_dest_tuples],
        'dst': [x[2].decode('utf-8') for x in src_dest_tuples]
    })
    
    # converting to ip
    evt_edges_df['src'] = clx.ip.ip_to_int(evt_edges_df['src'])
    evt_edges_df['dst'] = clx.ip.ip_to_int(evt_edges_df['dst'])
    
    if not edges_gdf:
        edges_gdf = evt_edges_df
    else:
        edges_gdf = cudf.concat([edges_gdf, evt_edges_df])

    return evt_edges_df.to_json(orient='values')

In [None]:
def pagerank(messages):
    G = cugraph.Graph()
    G.from_cudf_edgelist(edges_gdf, source="src", destination="dst", renumber=True)    
    
    pr_gdf = cugraph.pagerank(G, alpha=0.85, max_iter=500, tol=1.0e-05)
    pr_gdf['idx'] = pr_gdf['vertex']
    
    return pr_gdf.to_json(orient='values')

In [None]:
ARGS = {'bootstrap.servers': 'localhost:9092'}
output = source.map(process_batch).to_kafka('edge_list', ARGS).map(pagerank).to_kafka('pagerank', ARGS).sink_to_list()

In [None]:
source.start()

#### Copy the two commands below and run in new windows to see the messages published to the `edge_list` and `pagerank` topic

In [None]:
# run below to see messages sent to output

!kafka_2.13-2.8.0/bin/kafka-console-consumer.sh --topic edge_list --from-beginning --bootstrap-server localhost:9092

In [None]:
!kafka_2.13-2.8.0/bin/kafka-console-consumer.sh --topic pagerank --from-beginning --bootstrap-server localhost:9092

#### Publishing data

In [None]:
# Ingesting data into kafka

!kafka_2.13-2.8.0/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streamz_n_graph < messages.log >/dev/null

## Conclusion

In this notebook, we have shown how CLX and RAPIDS can be used together for real-time graph analytics use cases, wherein speed and processing power is extremely important. Further addition to this work may include exploring ways we can generalize the graph creation process across message types and also running more complex analysis on the graph created.