# Streaming Log Processing on GPUs

Almost since the coining of the phrase "big data", log-processing has been a primary use-case for analytics platforms.

Logs are *voluminous*:

A single website visit can result in 10s to 100s of log entries, each with lengthy strings of duplicated client information.

They're *complex*:
Extracting user activities often requires combining multiple records by time and unique session identifier(s).

They're *time-sensitive*:
When something goes wrong, you need to know quickly.

While early big data architectures were oriented towards batch jobs, the focus has shifted to lower-latency solutions. Distributed data processing tools and APIs have made it easier for developers to write _streaming_ applications.

Below we provide an example of how to do streaming web-log processing with RAPIDS, Dask, and Streamz.

## Pre-Requisites

We assume you're running in a RAPIDS nightly or release container, and thus already have cuDF and Dask installed.

Make sure you have [streamz](https://github.com/python-streamz/streamz) installed.

In [None]:
!conda install -c conda-forge -y streamz ipywidgets

## The Data

For demonstration purposes, we'll use a [publicly available web-log dataset from NASA](http://opensource.indeedeng.io/imhotep/docs/sample-data/).

In [58]:
import os, urllib.request, gzip, shutil

data_dir = '/data/'

if not os.path.exists(data_dir):
    os.mkdir(data_dir)

url = 'http://indeedeng.github.io/imhotep/files/nasa_19950630.22-19950728.12.tsv.gz'
fn = 'logs_noheader.tsv'

if not os.path.isfile(data_dir+"logs.tsv"):
    urllib.request.urlretrieve(url, data_dir+'logs.tsv'+'.gz')
    with gzip.open(data_dir+'logs.tsv'+'.gz', 'r') as f_in, open(data_dir+'logs.tsv', 'wb') as f_out:
      shutil.copyfileobj(f_in, f_out)
    
# We remove the header line to avoid sending it in some batches and not others    
with open(data_dir+"logs.tsv", 'rb') as fin:
    # This is a latin character set so we must re-encode it
    data = fin.read().decode('iso-8859-1').encode('utf8').splitlines(True)
    names = str(data[0].decode('UTF-8')).split('\t')
with open(data_dir+"logs_noheader.tsv", 'wb') as fout:
    fout.writelines(data[1:])

## Inspect the Data

The Google SRE HandBook says it's a good idea to track the [4 Golden Signals](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals) for any important system.

In [59]:
import cudf

df = cudf.read_csv(data_dir+fn, sep='\t', names=names)
df.head().to_pandas()

Unnamed: 0,host,logname,time,method,url,response,bytes,referer,useragent
0,199.72.81.55,-,804571201,GET,/history/apollo/,200,6245,-1,-1
1,unicomp6.unicomp.net,-,804571206,GET,/shuttle/countdown/,200,3985,-1,-1
2,199.120.110.21,-,804571209,GET,/shuttle/missions/sts-73/mission-sts-73.html,200,4085,-1,-1
3,burger.letters.com,-,804571211,GET,/shuttle/countdown/liftoff.html,304,0,-1,-1
4,199.120.110.21,-,804571211,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,200,4179,-1,-1


The data above doesn't tell us anything about request latency, but we can aggregate it to get a view into traffic, errors, and saturation.

In [60]:
# calculate total requests served per host system
traffic = df.groupby(['host']).host.count()
traffic[traffic > 5].head().to_pandas()

007.thegap.com                     34
01-dynamic-c.wokingham.luna.net    12
02-dynamic-c.wokingham.luna.net    13
03-dynamic-c.wokingham.luna.net    12
04-dynamic-c.rotterdam.luna.net    22
Name: host, dtype: int32

In [61]:
# count HTTP error codes per host system
errors = df[df['response'] >= 500].groupby(['host', 'response']).host.count()
errors.to_pandas()

host                     response
129.130.115.19           501          1
134.57.9.77              501          6
163.205.1.45             500         53
163.205.16.23            501          1
192.83.171.94            501          1
cc.newcastle.edu.au      501          1
n1032036.ksc.nasa.gov    501          1
newcastle03.nbnet.nb.ca  501          2
titan02                  500          4
titan02f                 500          4
Name: host, dtype: int32

In [62]:
# measure possible saturation of host network cards
mb_sent = df.groupby(['host']).bytes.sum()/1000000
mb_sent[mb_sent > 100].head().to_pandas()

163.206.137.21          109.174076
alyssa.prodigy.com      142.961911
news.ti.com             128.065425
piweba1y.prodigy.com    197.037434
piweba2y.prodigy.com    115.076846
Name: bytes, dtype: float64

You can see from the above that there are not many errors which is great and we can also see hits per host and total MBs sent per host

### Single GPU Streaming with RAPIDS and Streamz

A single GPU can process a lot of data quickly. Thanks to the Streamz API, it's also easy to do it in streaming fashion.

In many streaming systems you return events of interest for ops teams to investigate. That is what we will do.

In [99]:
from io import StringIO

# calculate traffic, errors, and saturation per batch
def process_on_gpu(messages):
    df = cudf.read_csv(StringIO('\n'.join(messages)), sep='\t', names=names)
    traffic = df.groupby(['host']).host.count()
    errors = df[df['response'] >= 500].groupby(['host', 'response']).host.count()
    mb_sent = df.groupby(['host']).bytes.sum()/1000000
    
    # Return - TSV versions of each metric
    return {'traffic': str(traffic[traffic > 200]), 'errors': str(errors), 'mb_sent': str(mb_sent[mb_sent > 120])}

In [127]:
import time, datetime

# save each metric type to its own file, instead of dumping lots of output to Jupyter
def save_to_file(events):
    dt = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    with open('/data/traffic.txt', 'w+') as fp:
        fp.write(str(dt) + ':' + events['traffic'])
    with open('/data/errors.txt', 'w+') as fp:
        fp.write(str(dt) + ':' + events['errors'])
    with open('/data/mb_sent.txt', 'w+') as fp:
        fp.write(str(dt) + ':' + events['mb_sent'])
    print(str(dt) + ': metrics batch written..')

In [131]:
from streamz import Stream

# setup the stream
source = Stream()
# process 250k lines per batch
out = source.partition(250000).map(process_on_gpu).sink(save_to_file)

In [None]:
# stream data from file. This is a slow means of emitting data to a stream
# see below for a faster approach with Kafka
with open(data_dir+fn, 'rb') as fp:
    for line in fp.readlines():
        source.emit(line.decode('iso-8859-1'))

2019-07-18 20:28:21: metrics batch written..


In [None]:
!echo "Error Log:"
!head /data/errors.txt
!echo "\nTraffic Log:"
!head /data/traffic.txt
!echo "\nMB Sent Log:"
!head /data/mb_sent.txt

### Scaling Streamz to multiple GPUs with Dask & Kafka

As opposed to streaming from files a very common pattern is to read from distributed log systems like Apache Kafka.

The below example assumes you have a running Kafka instance/cluster.

For help setting up your own, follow the [Kafka Quickstart guide](http://kafka.apache.org/quickstart).

In [None]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# create a Dask cluster with 1 worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)

In [None]:
from streamz import Stream
import confluent_kafka

# Kafka specific configurations
topic = "haproxy-topic"
bootstrap_servers = 'localhost:9092'
consumer_conf = {'bootstrap.servers': bootstrap_servers, 'group.id': 'custreamz', 'session.timeout.ms': 60000}

stream = Stream.from_kafka_batched(topic, consumer_conf, poll_interval='1s', npartitions=1, asynchronous=True, dask=False)
final_output = stream.map(process_on_gpu).sink(print)
stream.start()