# Streaming Log Processing on GPUs

Almost since the coining of the phrase "big data", log-processing has been a primary use-case for analytics platforms.

Logs are *voluminous*:

A single website visit can result in 10s to 100s of log entries, each with lengthy strings of duplicated client information.

They're *complex*:
Extracting user activities often requires combining multiple records by time and unique session identifier(s).

They're *time-sensitive*:
When something goes wrong, you need to know quickly.

While early big data architectures were oriented towards batch jobs, the focus has shifted to lower-latency solutions. Distributed data processing tools and APIs have made it easier for developers to write _streaming_ applications.

Below we provide an example of how to do streaming web-log processing with RAPIDS, Dask, and Streamz.

## Pre-Requisites

We assume you're running in a RAPIDS nightly or release container, and thus already have cuDF and Dask installed.

Make sure you have [streamz](https://github.com/python-streamz/streamz) installed.

In [None]:
!conda install -c conda-forge -y streamz ipywidgets

## The Data

For demonstration purposes, we'll use a [publicly available web-log dataset from NASA](http://opensource.indeedeng.io/imhotep/docs/sample-data/).

In [None]:
import os, urllib.request, gzip, io

data_dir = '/rapidsai/notebooks-extended/getting_started_notebooks/basics/'

if not os.path.exists(data_dir):
    os.mkdir(data_dir)

url = 'http://indeedeng.github.io/imhotep/files/nasa_19950630.22-19950728.12.tsv.gz'
fn = 'logs_noheader.tsv'

fileStream = io.BytesIO(urllib.request.urlopen(url).read())

# We remove the header line to avoid sending it in some batches and not others
with gzip.open(fileStream, 'rb') as f_in, open(data_dir + fn, 'wb') as fout:
    # This is a latin character set so we must re-encode it
    data = f_in.read().decode('iso-8859-1')
    p_data = data.partition('\n')
    names = p_data[0].split()
    fout.write(p_data[2].encode('utf8'))

## Inspect the Data

The Google SRE HandBook says it's a good idea to track the [4 Golden Signals](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals) for any important system.

In [None]:
import cudf

df = cudf.read_csv(data_dir + fn, sep='\t', names=names)
df.head().to_pandas()

The data above doesn't tell us anything about request latency, but we can aggregate it to get a view into traffic, errors, and saturation.

In [None]:
# calculate total requests served per host system
traffic = df.groupby(['host']).host.count()
traffic[traffic > 5].head().to_pandas()

In [None]:
# count HTTP error codes per host system
errors = df[df['response'] >= 500].groupby(['host', 'response']).host.count()
errors.to_pandas()

In [None]:
# measure possible saturation of host network cards
mb_sent = df.groupby(['host']).bytes.sum()/1000000
mb_sent[mb_sent > 100].head().to_pandas()

You can see from the above that there are not many errors which is great and we can also see hits per host and total MBs sent per host

### Single GPU Streaming with RAPIDS and Streamz

A single GPU can process a lot of data quickly. Thanks to the Streamz API, it's also easy to do it in streaming fashion.

In many streaming systems you return events of interest for ops teams to investigate. That is what we will do.

In [None]:
from io import BytesIO, StringIO

# calculate traffic, errors, and saturation per batch
def process_on_gpu(messages):
    # Check if input message stream is decoded string or utf-8 encoded bytes object
    try:
        message_stream = StringIO(('\n').join(messages))
    except:
        message_stream = BytesIO(b'\n'.join(messages))
                       
    df = cudf.read_csv(message_stream, sep='\t', names=names)
    traffic = df.groupby(['host']).host.count()
    errors = df[df['response'] >= 500].groupby(['host', 'response']).host.count()
    mb_sent = df.groupby(['host']).bytes.sum()/1000000

    # Return - TSV versions of each metric
    return {'traffic': str(traffic[traffic > 200]), 'errors': str(errors), 'mb_sent': str(mb_sent[mb_sent > 120])}

In [None]:
import time, datetime

# save each metric type to its own file, instead of dumping lots of output to Jupyter
def save_to_file(events):
    dt = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    with open(data_dir + 'traffic.txt', 'w+') as fp:
        fp.write(str(dt) + ':' + events['traffic'])
    with open(data_dir + 'errors.txt', 'w+') as fp:
        fp.write(str(dt) + ':' + events['errors'])
    with open(data_dir + 'mb_sent.txt', 'w+') as fp:
        fp.write(str(dt) + ':' + events['mb_sent'])
    print(str(dt) + ': metrics batch written..')

In [None]:
from streamz import Stream

# setup the stream
source = Stream.from_textfile(data_dir + fn)
# process 250k lines per batch
out = source.partition(250000).map(process_on_gpu).sink(save_to_file)

In [None]:
source.start()

In [None]:
# stream data from file. This is a slow means of emitting data to a stream
# see below for a faster approach with Kafka
with open(data_dir + fn, 'rb') as fp:
    for line in fp.readlines():
        source.emit(line)

In [None]:
!echo "Error Log:"
!head {data_dir}errors.txt
!echo "\nTraffic Log:"
!head {data_dir}traffic.txt
!echo "\nMB Sent Log:"
!head {data_dir}mb_sent.txt

### Scaling Streamz to multiple GPUs with Dask & Kafka

As opposed to streaming from files a very common pattern is to read from distributed log systems like Apache Kafka.

The below example assumes you have a running Kafka instance/cluster.

For help setting up your own, follow the [Kafka Quickstart guide](http://kafka.apache.org/quickstart).

In [None]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# create a Dask cluster with 1 worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)

In [None]:
from streamz import Stream
import confluent_kafka

# Kafka specific configurations
topic = "haproxy-topic"
bootstrap_servers = 'localhost:9092'
consumer_conf = {'bootstrap.servers': bootstrap_servers, 'group.id': 'custreamz', 'session.timeout.ms': 60000}

stream = Stream.from_kafka_batched(topic, consumer_conf, poll_interval='1s', npartitions=4, asynchronous=True, dask=False)
final_output = stream.map(process_on_gpu).sink(print)
stream.start()