**This is a cuStreamz job which streams HAProxy logs from Kafka, parses them, and performs some basic aggregations.**

For this example, we will be demonstrating how to stream from Kafka. But one can also perform the same streaming operations reading from a text file which is being continuously written into.

You can refer to: https://kafka.apache.org/quickstart to start a local Kafka cluster.

*An example HAProxy log (JSON-encoded string as a message in Kafka) would be of the following form:*

{"logline": "[haproxy@10.0.0.1] <134>May 29 19:08:36 haproxy[113498]: 45.26.605.15:38738 [29/May/2019:19:08:36.691] HTTPS:443~ HTTP_ProvisionManagers/mp3 4/5/0/1/1 200 6182 - - --NI 3/3/0/0/0 0/0 {|} "GET /v2/serverinfo HTTP/1.1"}

In [1]:
#Streamz and cudf imports
from streamz import Stream
import cudf
from streamz.dataframe import DataFrame
import time

Below is a helper function which implements RegEx parsing on the HAProxy logs, and then calculates the average backend response time for each batch.

It also has timestamps to determine the time taken by each important phase of the stream processing — parsing and aggregations. These times are returned along with the average backend response time taken for each batch.

In [2]:
def haproxy_regex_aggregations(messages):
    
    preprocess_start_time = int(round(time.time()))
    
    size = len(messages)*len(messages[0]) 
    num_rows = len(messages)
    json_input_string = "\n".join([msg.decode('utf-8') for msg in messages])
    
    gdf = cudf.read_json(json_input_string, lines=True, engine='cudf')
    
    pre_regex_time = int(round(time.time()))
    
    '''
    Piecemeal log parsing for HAProxy
    '''
    
    clean_df = gdf['logline'].str.split(' ')

    clean_df['log_ip'] = clean_df[0].str.lstrip('[haproxy@').str.rstrip(']')
    clean_df.drop_column(0)

    clean_df[1] = clean_df[1].str.split('>')[1]
    syslog_timestamp = clean_df[1].data.cat([clean_df[2].data, clean_df[3].data, clean_df[4].data], sep=' ')
    clean_df['syslog_timestamp'] = cudf.Series(syslog_timestamp)
    for col in [1,2,3,4]:
        clean_df.drop_column(col)

    program_pid_df = clean_df[5].str.split('[')
    program_sr = program_pid_df[0]
    pid_sr = program_pid_df[1]
    clean_df['program'] = program_sr
    clean_df['pid'] = pid_sr.str.rstrip(']:')
    clean_df = clean_df.drop(labels=[5])
    del program_pid_df

    client_pid_port_df = clean_df[6].str.split(':')
    clean_df['client_ip'], clean_df['client_port'] = client_pid_port_df[0], client_pid_port_df[1]
    clean_df.drop_column(6)
    del client_pid_port_df

    clean_df['accept_date'] = clean_df[7].str.lstrip('[').str.rstrip(']')
    clean_df.drop_column(7)

    clean_df.rename({8: 'frontend_name'}, inplace=True)
    backend_server_df = clean_df[9].str.split('/')
    clean_df['backend_name'], clean_df['server_name'] = backend_server_df[0], backend_server_df[1]
    clean_df.drop_column(9)

    time_cols = ['time_request', 'time_queue', 'time_backend_connect', 'time_backend_response', 'time_duration']
    time_df = clean_df[10].str.split('/')
    for col_id, col_name in enumerate(time_cols):
        clean_df[col_name] = time_df[col_id]
    clean_df.drop_column(10)
    del time_df

    clean_df.rename({11: 'http_status_code'}, inplace=True)
    clean_df.rename({12: 'bytes_read'}, inplace=True)
    clean_df.rename({13: 'captured_request', 14: 'captured_response', 15: 'termination_state'}, inplace=True)

    con_cols = ['actconn', 'feconn', 'beconn', 'srvconn', 'retries']
    con_df = clean_df[16].str.split('/')
    for col_id, col_name in enumerate(con_cols):
        clean_df[col_name] = con_df[col_id]
    clean_df.drop_column(16)
    del con_df

    q_df = clean_df[17].str.split('/')
    clean_df['srv_queue'], clean_df['backend_queue'] = q_df[0], q_df[1]
    clean_df.drop_column(17)
    del q_df
    
    post_regex_time = int(round(time.time()))
    
    '''
    End of the piecemeal Regex log parsing for HAProxy.
    Simple aggregations to be performed now.
    '''
    
    clean_df['time_backend_response'] = clean_df['time_backend_response'].astype('int')
    avg_backend_response_time = clean_df['time_backend_response'].mean()
    
    post_agg_time = int(round(time.time()))
    
    return "{0},{1},{2},{3},{4},{5},{6}".format(num_rows, preprocess_start_time, pre_regex_time, \
                                            post_regex_time, post_agg_time, \
                                            avg_backend_response_time, size)

Let's create a Kafka consumer.

In [3]:
#Kafka topic to read streaming data from
topic = "haproxy-logs"

#Kafka brokers
bootstrap_servers = 'localhost:9092'

#Kafka consumer configuration
consumer_conf = {'bootstrap.servers': bootstrap_servers,
                 'group.id': 'custreamz', 'session.timeout.ms': 60000}

We now use Streamz to create a Stream from Kafka by polling the topic every 10s.

In [4]:
#If you changed Dask=True, please ensure you have a Dask cluster up and running
stream = Stream.from_kafka_batched(topic, consumer_conf, poll_interval='10s',
                                   npartitions=1, asynchronous=True, dask=False)

We then use the helper regex+aggregations function to perform the required operations on each batch polled from Kafka, and write the result into a list.

In [5]:
final_output = stream.map(haproxy_regex_aggregations).buffer(100000).gather().sink_to_list()

Let's start the stream!

In [6]:
stream.start()

Let's check the output.

In [7]:
final_output

['57,1565208776,1565208777,1565208777,1565208777,5.543859649122807,10545',
 '125,1565208788,1565208788,1565208788,1565208788,5.512,23250']

After waiting for some more time, let's check the output again — the list should have grown, since more batches have been processed on the stream.

In [8]:
final_output

['57,1565208776,1565208777,1565208777,1565208777,5.543859649122807,10545',
 '125,1565208788,1565208788,1565208788,1565208788,5.512,23250',
 '113,1565208800,1565208800,1565208800,1565208800,5.398230088495575,21244',
 '113,1565208811,1565208811,1565208811,1565208811,5.398230088495575,20905',
 '113,1565208823,1565208823,1565208823,1565208823,5.327433628318584,20905']