# Streaming Log Processing on GPUs

Almost since the coining of the phrase "big data", log-processing has been a primary use-case for analytics platforms.

Logs are *voluminous*:

A single website visit can result in 10s to 100s of log entries, each with lengthy strings of duplicated client information.

They're *complex*:
Extracting user activities often requires combining multiple records by time and unique session identifier(s).

They're *time-sensitive*:
When something goes wrong, you need to know quickly.

While early big data architectures were oriented towards batch jobs, the focus has shifted to lower-latency solutions. Distributed data processing tools and APIs have made it easier for developers to write _streaming_ applications.

Below we provide an example of how to do streaming web-log processing with RAPIDS, Dask, and Streamz.

## Pre-Requisites

We assume you're running in a RAPIDS nightly or release container, and thus already have cuDF and Dask installed.

Make sure you have [streamz](https://github.com/python-streamz/streamz) installed.

In [1]:
!conda install -c conda-forge -y streamz

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



## The Data

For demonstration purposes, we'll use a [publicly available web-log dataset from NASA](http://opensource.indeedeng.io/imhotep/docs/sample-data/).

In [2]:
import os, urllib.request, gzip, shutil

data_dir = '/data/'
url = 'http://indeedeng.github.io/imhotep/files/nasa_19950630.22-19950728.12.tsv.gz'
fn = 'logs.tsv'

if not os.path.isfile(data_dir+fn):
    urllib.request.urlretrieve(url, data_dir+fn+'.gz')
    with gzip.open(data_dir+fn+'.gz', 'r') as f_in, open(data_dir+fn, 'wb') as f_out:
      shutil.copyfileobj(f_in, f_out)

## Inspect the Data

The Google SRE HandBook says it's a good idea to track the [4 Golden Signals](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals) for any important system.

In [3]:
import cudf

df = cudf.read_csv(data_dir+fn, sep='\t')
df.head().to_pandas()

Unnamed: 0,host,logname,time,method,url,response,bytes,referer,useragent
0,199.72.81.55,-,804571201,GET,/history/apollo/,200,6245,-1,-1
1,unicomp6.unicomp.net,-,804571206,GET,/shuttle/countdown/,200,3985,-1,-1
2,199.120.110.21,-,804571209,GET,/shuttle/missions/sts-73/mission-sts-73.html,200,4085,-1,-1
3,burger.letters.com,-,804571211,GET,/shuttle/countdown/liftoff.html,304,0,-1,-1
4,199.120.110.21,-,804571211,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,200,4179,-1,-1


The data above doesn't tell us anything about request latency, but we can aggregate it to get a view into traffic, errors, and saturation.

In [4]:
# calculate total requests served per host system
traffic = df.groupby(['host']).host.count()
traffic.sort_values(ascending=False).head().to_pandas()

piweba3y.prodigy.com    12830
piweba4y.prodigy.com     7787
piweba1y.prodigy.com     7015
alyssa.prodigy.com       5280
siltb10.orl.mmc.com      4298
Name: host, dtype: int32

In [5]:
# count HTTP error codes per host system
errors = df[df['response'] >= 500].groupby(['host', 'response']).host.count()
errors.to_pandas()

host                     response
129.130.115.19           501          1
134.57.9.77              501          6
163.205.1.45             500         53
163.205.16.23            501          1
192.83.171.94            501          1
cc.newcastle.edu.au      501          1
n1032036.ksc.nasa.gov    501          1
newcastle03.nbnet.nb.ca  501          2
titan02                  500          4
titan02f                 500          4
Name: host, dtype: int32

In [6]:
# measure possible saturation of host network cards
mb_sent = df.groupby(['host']).bytes.sum()/1000000
mb_sent.sort_values(ascending=False).head().to_pandas()

piweba3y.prodigy.com    333.174661
piweba1y.prodigy.com    197.037434
piweba4y.prodigy.com    170.347230
alyssa.prodigy.com      142.961911
news.ti.com             128.065425
Name: bytes, dtype: float64

### Single GPU Streaming with RAPIDS and Streamz

A single GPU can process a lot of data quickly. Thanks to the Streamz API, it's also easy to do it in streaming fashion.

In [7]:
# calculate traffic, errors, and saturation per batch
def process_on_gpu(messages):
    df = cudf.read_csv('\n'.join(messages), sep='\t')
    
    traffic = df.groupby(['host']).host.count()
    errors = df[df['response'] >= 500].groupby(['host', 'response']).host.count()
    mb_sent = df.groupby(['host']).bytes.sum()/1000000
    return ''
    #return {'traffic': traffic, 'errors': errors, 'mb_sent': mb_sent}

In [8]:
from streamz import Stream

# setup the stream
source = Stream()
# process 1k lines per batch
source.partition(10).map(process_on_gpu).sink(print)

AttributeError: 'sink' object has no attribute '_repr_html_'

<sink: print>

In [17]:
fn

'logs.tsv'

In [18]:
# stream data from file
with open(data_dir+fn, 'r', encoding="ISO-8859-1") as fp:
    for line in fp.readlines():
        source.emit(line)

ERROR:[Errno 2] No such file or directory: 'host\tlogname\ttime\tmethod\turl\tresponse\tbytes\treferer\tuseragent\n\n199.72.81.55\t-\t804571201\tGET\t/history/apollo/\t200\t6245\t\t\n\nunicomp6.unicomp.net\t-\t804571206\tGET\t/shuttle/countdown/\t200\t3985\t\t\n\n199.120.110.21\t-\t804571209\tGET\t/shuttle/missions/sts-73/mission-sts-73.html\t200\t4085\t\t\n\nburger.letters.com\t-\t804571211\tGET\t/shuttle/countdown/liftoff.html\t304\t0\t\t\n\n199.120.110.21\t-\t804571211\tGET\t/shuttle/missions/sts-73/sts-73-patch-small.gif\t200\t4179\t\t\n\nburger.letters.com\t-\t804571212\tGET\t/images/NASA-logosmall.gif\t304\t0\t\t\n\nburger.letters.com\t-\t804571212\tGET\t/shuttle/countdown/video/livevideo.gif\t200\t0\t\t\n\n205.212.115.106\t-\t804571212\tGET\t/shuttle/countdown/countdown.html\t200\t3985\t\t\n\nd104.aa.net\t-\t804571213\tGET\t/shuttle/countdown/\t200\t3985\t\t\n'
Traceback (most recent call last):
  File "/conda/envs/rapids/lib/python3.7/site-packages/streamz/core.py", line 558, i

host	logname	time	method	url	response	bytes	referer	useragent

199.72.81.55	-	804571201	GET	/history/apollo/	200	6245		

unicomp6.unicomp.net	-	804571206	GET	/shuttle/countdown/	200	3985		

199.120.110.21	-	804571209	GET	/shuttle/missions/sts-73/mission-sts-73.html	200	4085		

burger.letters.com	-	804571211	GET	/shuttle/countdown/liftoff.html	304	0		

199.120.110.21	-	804571211	GET	/shuttle/missions/sts-73/sts-73-patch-small.gif	200	4179		

burger.letters.com	-	804571212	GET	/images/NASA-logosmall.gif	304	0		

burger.letters.com	-	804571212	GET	/shuttle/countdown/video/livevideo.gif	200	0		

205.212.115.106	-	804571212	GET	/shuttle/countdown/countdown.html	200	3985		

d104.aa.net	-	804571213	GET	/shuttle/countdown/	200	3985		



FileNotFoundError: [Errno 2] No such file or directory: 'host\tlogname\ttime\tmethod\turl\tresponse\tbytes\treferer\tuseragent\n\n199.72.81.55\t-\t804571201\tGET\t/history/apollo/\t200\t6245\t\t\n\nunicomp6.unicomp.net\t-\t804571206\tGET\t/shuttle/countdown/\t200\t3985\t\t\n\n199.120.110.21\t-\t804571209\tGET\t/shuttle/missions/sts-73/mission-sts-73.html\t200\t4085\t\t\n\nburger.letters.com\t-\t804571211\tGET\t/shuttle/countdown/liftoff.html\t304\t0\t\t\n\n199.120.110.21\t-\t804571211\tGET\t/shuttle/missions/sts-73/sts-73-patch-small.gif\t200\t4179\t\t\n\nburger.letters.com\t-\t804571212\tGET\t/images/NASA-logosmall.gif\t304\t0\t\t\n\nburger.letters.com\t-\t804571212\tGET\t/shuttle/countdown/video/livevideo.gif\t200\t0\t\t\n\n205.212.115.106\t-\t804571212\tGET\t/shuttle/countdown/countdown.html\t200\t3985\t\t\n\nd104.aa.net\t-\t804571213\tGET\t/shuttle/countdown/\t200\t3985\t\t\n'

### Consuming Data from Kafka with Streamz

In [31]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# create a Dask cluster with 1 worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)

10.0.0.3    2057
10.2.34.6    1965
10.23.34.1    2070
14.2.3.4    1978
15.2.6.9    1930
Name: 0, dtype: int32


In [None]:
from streamz import Stream
import confluent_kafka

# Kafka specific configurations
topic = "haproxy-topic"
bootstrap_servers = 'localhost:9092'
consumer_conf = {'bootstrap.servers': bootstrap_servers, 'group.id': 'custreamz', 'session.timeout.ms': 60000}

stream = Stream.from_kafka_batched(topic, consumer_conf, poll_interval='1s', npartitions=1, asynchronous=True, dask=False)
final_output = stream.map(process_on_gpu).sink(print)
stream.start()