## Hello World with cuDF and Streamz

This notebook demonstrates use of cuDF to perform streaming word-count using a small portion of the [Streamz API](https://streamz.readthedocs.io/en/latest/).

First, make sure you have installed the [Streamz](https://github.com/python-streamz/streamz) library.

In [1]:
!conda install -c conda-forge -y streamz

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /conda/envs/rapids

  added / updated specs:
    - streamz


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    streamz-0.5.1              |             py_0          44 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          44 KB

The following NEW packages will be INSTALLED:

  streamz            conda-forge/noarch::streamz-0.5.1-py_0



Downloading and Extracting Packages
streamz-0.5.1        | 44 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


## Getting Started

First import the required packages. We'll be programmatically generating data and process in streaming batches.

In [1]:
from streamz import Stream
import cudf, json

# create a list of static messages
messages = [
    {'msg': 'Hello, World!'},
    {'msg': 'hi, world!'},
    {'msg': 'hey world'},
    {'msg': 'Hi'}
]

## Define A Function to Run Per Batch

While some streaming systems deal with single events at a time, using GPUs to run a per-event process is not ideal due to the high latency of PCI-E memory transfers and kernel call overhead.

For our example, we'll focus on processing batches at a time.

In [2]:
# define function to run per batch
def process_on_gpu(messages):
    # read the batch of messages into a GPU DataFrame
    df = cudf.read_json('\n'.join(messages), lines=True)
    
    # split each line into columns, one per word
    tmp = df['msg'].str.split()
    
    # combine all word columns into a single Series
    words = tmp[tmp.columns[0]]
    for word in tmp.columns[1:]:
        words = words.append(tmp[word])
    
    # remove punctuation, lower-case
    words = words.str.fillna('').replace(',', '').replace('!', '').str.lower()

    # compute and return word counts for the batch
    return str(words.groupby(words).count())

## Setup the Stream and Emit Data

In [3]:
source = Stream()

# GPUs like to process sizable chunks of data at once
# source.partition(n) sends n events at a time to downstream functions
source.partition(10).map(process_on_gpu).sink(print)

# with 30 events partitioned by 10 events per group will give 3 "batches"
n_messages = 30
for idx in range(0, n_messages):
    source.emit(json.dumps(messages[idx % len(messages)]))

    2
hello    3
hey    2
hi    5
world    8
dtype: int32
    3
hello    2
hey    3
hi    5
world    7
dtype: int32
    2
hello    3
hey    2
hi    5
world    8
dtype: int32
