**This is a cuStreamz job for the classic example of Streaming Word Count.**

For this example, we will be demonstrating how to stream from a textfile. Please install pytest using conda to use the tmpfile() function from streamz.

In [1]:
#Streamz and cudf imports
import cudf
from streamz import Stream
from streamz.dataframe import DataFrame
from streamz.utils_test import tmpfile
import numpy as np
import time

Let's assume that the data coming in to a textfile and each line is in the form of "this is line x", where x is an incremental counter.

Now, we write a function to parse each line to get the list of words in each line.

One can also make use of nvstrings (now custrings, the GPU-accelerated string manipulation library) to tokenise each line. Refer to process_line_nvstrings().

In [2]:
def process_line(line):
    words = line.strip('\n').split(" ")
    return words

import nvstrings, nvtext
def process_line_nvstrings(line):
    device_line = nvstrings.to_device(line)
    words = nvtext.tokenize(device_line)
    return words

Now we create a temporary textfile using tmpfile() which streamz.utils_test provides to simulate streaming word count from a textfile.

*One can write a separate function to write continuously to a textfile, and still use the same cuStreamz code as shown below to calculate word count.*

In [3]:
with tmpfile() as fn:
    with open(fn, 'wt') as f:
        #Write some random data to the file
        for i in range(0,10):
            f.write("this is line " + str(i) + "\n")
        f.flush()

        #Create a stream from the textfile, and specify the interval to poll the file at.
        source = Stream.from_textfile(fn, poll_interval=0.01, \
                                 asynchronous=True, start=False)
        
        #Apply the process_line helper function on each element/line streamed from the textfile.
        stream = source.map(process_line)
        
        '''
        Streamz DataFrame does the trick!
        
        After we get the parsed word list on our stream from the textfile, 
        we just perform simple aggregations using the Streamz DataFrame to get the word count.
        
        We then write the output (word count) to a list.
        '''
        stream_df = stream.map(lambda words: cudf.DataFrame({'word': words, 'count': np.ones(len(words),dtype=np.int32)}))
        sdf = DataFrame(stream_df, example=cudf.DataFrame({'word':[], 'count':[]}))
        output = sdf.groupby('word').sum().stream.gather().sink_to_list()
        
        #Starting the stream!
        source.start()
        
        time.sleep(2)
        '''
        We can see that we have cuDF dataframe that got produced to the output. 
        Let's see if we can print some actual word counts.
        ''' 
        print(output[-1].loc[9:])
        
        '''
        We can! :)

        Now, we write some more data to the text file and wait for some more time before checking the output again.

        If we're sure of what's happening, the output should now have a list of cuDF dataframes, 
        each having the cumulative streaming word count of all the data seen until now, 
        the last cuDF dataframe being the most recent.
        '''
        #Write more random data to the file
        for i in range(10,20):
            f.write("this is line " + str(i) + "\n")
        f.flush()
        
        time.sleep(2)
        print(output[-1].loc[9:])

   count
9      1
is     10
line     10
this     10
   count
9      1
is     20
line     20
this     20
