# Streaming Datasets in Rapids

Rapids not only offers the capability of handling static datasets but also streaming datasets. Performing data science on streaming datasets offers several challenges that are not presented when working with static datasets. This is the inherit nature of working with data that might be coming from several different sources. Each of those sources generally have different SLAs and performance levels which make things complicated for the downstream consumer who might be interested in using several of those sources. Likewise the consumer would need to handle late arriving data, windowing of the data, and transformations of the data from the different sources.

As you can see this is no trivial effort. For these reasons Rapids has adopted Streamz to help alleviate these problems for our users. Streamz offers the benefits of allowing consumer to setup data pipelines to manage the complexities involved in streaming data more easily. This is why Rapids has chosen to use Streamz for managing streaming datasets. Lets take a look at whats involved to use Streams with Rapids.

## Install Streamz with Conda

In [None]:
!conda install -c conda-forge -y streamz

## Getting Started
Randy to put some BS here

### Basic Usage

When used in conjuntion with Rapids/cuDF Streamz offers an impressive list of features that can sit on top of 
Rapids while taking advantage of the increased computational efficiency of using GPUs that Rapids offers.

Lets take a look at a simple example of passing data from Streamz to cuDF.

In [24]:
from streamz import Stream
import cudf

with open ("haproxy_data.txt", "r") as myfile:
    data=myfile.readlines()

def gpu_preprocess_simple_agg(messages):
    json_input_string = "\n".join([msg for msg in messages])
    gdf = cudf.read_json(json_input_string, lines=True)
    tmp = gdf['log_ip'].str.split()
    ips = tmp[tmp.columns[0]]
    
    for ip in tmp.columns[1:]:
        ips = ips.append(tmp[ip])
    
    return str(ips.groupby(ips).count())

source = Stream()
source.partition(10000).map(gpu_preprocess_simple_agg).sink(print)

for line in data:
    source.emit(line)

10.0.0.3    2057
10.2.34.6    1965
10.23.34.1    2070
14.2.3.4    1978
15.2.6.9    1930
Name: 0, dtype: int32


### Consuming Data from Kafka with Streamz

In [31]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

from streamz import Stream
import cudf, io, json, confluent_kafka

# Kafka specific configurations
topic = "haproxy-topic"
bootstrap_servers = 'localhost:9092'
consumer_conf = {'bootstrap.servers': bootstrap_servers, 'group.id': 'custreamz', 'session.timeout.ms': 60000}

def gpu_preprocess_simple_agg(messages):
    json_input_string = "\n".join([msg.decode('utf-8') for msg in messages])
    gdf = cudf.read_json(json_input_string, lines=True)
    tmp = gdf['log_ip'].str.split()
    ips = tmp[tmp.columns[0]]
    
    for ip in tmp.columns[1:]:
        ips = ips.append(tmp[ip])
    
    return str(ips.groupby(ips).count())

stream = Stream.from_kafka_batched(topic, consumer_conf, poll_interval='1s', npartitions=1, asynchronous=True, dask=False)
final_output = stream.map(gpu_preprocess_simple_agg).sink(print)
stream.start()

10.0.0.3    2057
10.2.34.6    1965
10.23.34.1    2070
14.2.3.4    1978
15.2.6.9    1930
Name: 0, dtype: int32
