# Parsing HAProxy Logs

This notebook demonstrates real world log parsing examples. For this notebook we will use the popular HAProxy
format since that is a common realworld use case.

This notebook will cover
+ Streaming log entry event level logs from Kafka using Streamz
 - Those interested in Kafka can refer to: https://kafka.apache.org/quickstart to start a local Kafka cluster.
 - Also from a text file in case Kafka is not present to user.
+ Counting log entries - (Word Count)
+ Counting log entries from a file (no Kafka)
+ Calculating average backend response time

Ok so what do these HAProxy logs look like? Well here is an example.

```{"logline": "[haproxy@10.0.0.1] <134>May 29 19:08:36 haproxy[113498]: 45.26.605.15:38738 [29/May/2019:19:08:36.691] HTTPS:443~ HTTP_ProvisionManagers/mp3 4/5/0/1/1 200 6182 - - --NI 3/3/0/0/0 0/0 {|} "GET /v2/serverinfo HTTP/1.1"}```

It is unlikely you have a Kafka topic with these log messages in it already so lets generate some sample messages for you. First we need to install a few required dependencies and define some global configurations however.


In [1]:
!conda install -c conda-forge -y streamz python-confluent-kafka

Collecting package metadata: done
Solving environment: done


  current version: 4.6.14
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



# Global Configurations

In [1]:
import confluent_kafka as ck

num_messages_to_produce = 100000

kafka_brokers = ['localhost:9092'] # This is a list of your Kafka brokers and ports.
topic = 'haproxy-logs'

kafka_conf = {'bootstrap.servers': kafka_brokers, 'compression.type':'snappy', 'group.id': 'custreamz', 'session.timeout.ms': 60000}  # Kafka configuration parameters. Any additional Kafka configurations can be placed here ...

producer = ck.Producer(kafka_conf)  # Kafka producer

# Generate Sample HAProxy Data
First thing is first. Lets generate some sample HAProxy logs into our Kafka environment so the following examples have something to pull from. This code has nothing to do with RAPIDS but rather just a simple script for generating sample HAProxy logs and publishing them into Kafka.

This is certainly not the most efficient way to write to Kafka but it is the most simple for example purposes. Please be patient while the produce occurs.

In [None]:
import json
from random import randrange

sample_data = {}
sample_data['log_ip'] = ['10.0.0.3', '14.2.3.4', '15.2.6.9', '10.2.34.6', '10.23.34.1']
sample_data['syslog_timestamp'] = ['May 28 2019 00:00:09', 'May 28 2019 00:00:10', 'May 28 2019 00:00:11',\
                                   'May 28 2019 00:00:39','May 28 2019 00:00:51', 'May 28 2019 00:10:09']
sample_data['program'] = ['haproxy']
sample_data['pid'] = [113345, 756487, 352453, 352465, 164541]
sample_data['client_ip'] = ['156.23.224.56', '126.52.74.15', '247.81.56.21', '26.245.255.1', '255.116.145.2']
sample_data['client_port'] = [13345, 56487, 52453, 52465, 64541]
sample_data['accept_date'] = ['28/May/2019:00:10:09.492', '28/May/2019:00:09:10.006', '28/May/2019:00:02:10.748',\
                              '28/May/2019:00:20:10.891', '28/May/2019:00:02:10.461', '28/May/2019:00:02:11.959']
sample_data['frontend_name'] = ['px-http', 'https:443', 'tx-http']
sample_data['server_name'] = ['srv1', 'srv2', 'srv3', 'srv4', 'srv5']
sample_data['time_request'] = [0, 1, 2, 3]
sample_data['time_queue'] = [0, 1, 2, 3]
sample_data['time_backend_connect'] = [1, 2, 3]
sample_data['time_backend_response'] = [2, 3, 4, 5, 6, 7, 8, 9]
sample_data['time_duration'] = [13, 14, 16, 20, 23, 25]
sample_data['http_status_code'] = [200, 400, 201, 401, 403]
sample_data['bytes_read'] = [4, 573, 442, 234, 124, 1567]
sample_data['captured_request'] = ['-']
sample_data['captured_response'] = ['-']
sample_data['termination_state'] = ['----', 'PH--', 'CR--', '--NI', '--SG']
sample_data['actconn'] = [1, 2, 3, 4]
sample_data['feconn'] = [2, 3, 5, 7, 8]
sample_data['beconn'] = [0, 1, 2, 3, 4]
sample_data['srvconn'] = [0, 1, 3]
sample_data['retries'] = [0, 1, 2]
sample_data['srv_queue'] = [0, 1, 2, 3]
sample_data['backend_queue'] = [0, 2, 3, 4, 5, 7, 8, 9]

cols = ['log_ip','syslog_timestamp','program','pid','client_ip','client_port',\
        'accept_date','frontend_name','backend_name','server_name','time_request',\
        'time_queue','time_backend_connect', 'time_backend_response', 'time_duration',\
        'http_status_code', 'bytes_read', 'captured_request', 'captured_response',\
        'termination_state','actconn','feconn','beconn','srvconn','retries','srv_queue','backend_queue']

def generate_log():
    log_skelton = "[haproxy@{0}] <134>{1} {2}[{3}]: {4}:{5} [{6}] {7} {8}/{9} {10}/{11}/{12}/{13}/{14} {15} {16} {17} {18} {19} {20}/{21}/{22}/{23}/{24} {25}/{26}"
    values = []
    for idx, col in enumerate(cols):
        if col in sample_data:
            value_list = sample_data[col]
            values.append(value_list[randrange(len(value_list))])
        else:
            values.append(values[-1])
    dict_out = {}    
    dict_out["logline"] = log_skelton.format(*values)
    return json.dumps(dict_out)


count = 0
try:
    while count < num_messages_to_produce:
        producer.produce(topic, generate_log())
        count = count + 1
except KeyboardInterrupt:
    sys.stderr.write('%% Aborted by user\n')
    
producer.flush()

# Counting Log Entries

Let's assume that the data coming in to the Kafka topic — i.e., each record/message, is a line in the form of "this is line x", where x is an incremental counter. 
    
Now, we write a function to parse each such message to get the list of words in each line. 

One can also make use of nvstrings (now custrings, the GPU-accelerated string manipulation library) to tokenise all the messages in the batch. Refer to process_batch_nvstrings().

In [2]:
#Streamz and cudf imports
import cudf
from streamz import Stream
from streamz.dataframe import DataFrame
import numpy as np

#Helper function operating on every batch polled from Kafka for word count
def process_batch(messages):
    y = []
    for x in messages:
        y = y + x.decode('utf-8').strip('\n').split(" ")
    return y

import nvstrings, nvtext
def process_batch_nvstrings(messages):
    messages_decoded = []
    for message in messages:
        messages_decoded.append(message.decode('utf-8'))
    device_lines = nvstrings.to_device(messages_decoded)
    words = nvtext.tokenize(device_lines)
    return words


# We now use Streamz to create a Stream from Kafka by polling the topic every 10s. If you changed Dask=True, please ensure you have a Dask cluster up and running
source = Stream.from_kafka_batched(topic, kafka_conf, npartitions=1, poll_interval='10s', asynchronous=True, dask=False)

#Applying process_batch function to process word count on each batch
stream = source.map(process_batch)

*Streamz DataFrame does the trick!* 

After we get the parsed word list on our stream from Kafka, we just perform simple aggregations using the Streamz DataFrame to get the word count.

We then write the output (word count) to a list.

In [3]:
stream_df = stream.map(lambda words: cudf.DataFrame({'word': words, 'count': np.ones(len(words),dtype=np.int32)}))
sdf = DataFrame(stream_df, example=cudf.DataFrame({'word':[], 'count':[]}))
output = sdf.groupby('word').sum().stream.buffer(100000).gather().sink_to_list()

Start the stream!

In [None]:
source.start()

Let's see what output we have:

In [None]:
output

We can see that we have cuDF dataframe that got produced to the output. Let's see if we can print some actual word counts. 

In [None]:
#Printing the values
print(output[0].loc[65:])

# Counting Log Entries From File

For this example, we will be demonstrating how to stream from a textfile. Please install pytest using conda to use the tmpfile() function from streamz.

In [None]:
#Streamz and cudf imports
import cudf
from streamz import Stream
from streamz.dataframe import DataFrame
from streamz.utils_test import tmpfile
import numpy as np
import time

Let's assume that the data coming in to a textfile and each line is in the form of "this is line x", where x is an incremental counter.

Now, we write a function to parse each line to get the list of words in each line.

One can also make use of nvstrings (now custrings, the GPU-accelerated string manipulation library) to tokenise each line. Refer to process_line_nvstrings().

In [None]:
def process_line(line):
    words = line.strip('\n').split(" ")
    return words

import nvstrings, nvtext
def process_line_nvstrings(line):
    device_line = nvstrings.to_device(line)
    words = nvtext.tokenize(device_line)
    return words

Now we create a temporary textfile using tmpfile() which streamz.utils_test provides to simulate streaming word count from a textfile.

*One can write a separate function to write continuously to a textfile, and still use the same cuStreamz code as shown below to calculate word count.*

In [None]:
with tmpfile() as fn:
    with open(fn, 'wt') as f:
        #Write some random data to the file
        for i in range(0,10):
            f.write("this is line " + str(i) + "\n")
        f.flush()

        #Create a stream from the textfile, and specify the interval to poll the file at.
        source = Stream.from_textfile(fn, poll_interval=0.01, \
                                 asynchronous=True, start=False)
        
        #Apply the process_line helper function on each element/line streamed from the textfile.
        stream = source.map(process_line)
        
        '''
        Streamz DataFrame does the trick!
        
        After we get the parsed word list on our stream from the textfile, 
        we just perform simple aggregations using the Streamz DataFrame to get the word count.
        
        We then write the output (word count) to a list.
        '''
        stream_df = stream.map(lambda words: cudf.DataFrame({'word': words, 'count': np.ones(len(words),dtype=np.int32)}))
        sdf = DataFrame(stream_df, example=cudf.DataFrame({'word':[], 'count':[]}))
        output = sdf.groupby('word').sum().stream.gather().sink_to_list()
        
        #Starting the stream!
        source.start()
        
        time.sleep(2)
        '''
        We can see that we have cuDF dataframe that got produced to the output. 
        Let's see if we can print some actual word counts.
        ''' 
        print(output[-1].loc[9:])
        
        '''
        We can! :)

        Now, we write some more data to the text file and wait for some more time before checking the output again.

        If we're sure of what's happening, the output should now have a list of cuDF dataframes, 
        each having the cumulative streaming word count of all the data seen until now, 
        the last cuDF dataframe being the most recent.
        '''
        #Write more random data to the file
        for i in range(10,20):
            f.write("this is line " + str(i) + "\n")
        f.flush()
        
        time.sleep(2)
        print(output[-1].loc[9:])

# Calculating Average Backend Response Time

Below is a helper function which implements parsing on the HAProxy logs, and then calculates the average backend response time for each batch.

It also has timestamps to determine the time taken by each important phase of the stream processing — parsing and aggregations. These times are returned along with the average backend response time taken for each batch.

In [2]:
#Streamz and cudf imports
from streamz import Stream
import cudf
from streamz.dataframe import DataFrame
import time

def haproxy_parsing_aggregations(messages):
    
    preprocess_start_time = int(round(time.time()))
    
    size = len(messages)*len(messages[0]) 
    num_rows = len(messages)
    json_input_string = "\n".join([msg.decode('utf-8') for msg in messages])
    
    gdf = cudf.read_json(json_input_string, lines=True, engine='cudf')
    
    pre_parsing_time = int(round(time.time()))
    
    '''
    Piecemeal log parsing for HAProxy
    '''
    
    clean_df = gdf['logline'].str.split(' ')

    clean_df['log_ip'] = clean_df[0].str.lstrip('[haproxy@').str.rstrip(']')
    clean_df.drop_column(0)

    clean_df[1] = clean_df[1].str.split('>')[1]
    syslog_timestamp = clean_df[1].data.cat([clean_df[2].data, clean_df[3].data, clean_df[4].data], sep=' ')
    clean_df['syslog_timestamp'] = cudf.Series(syslog_timestamp)
    for col in [1,2,3,4]:
        clean_df.drop_column(col)

    program_pid_df = clean_df[5].str.split('[')
    program_sr = program_pid_df[0]
    pid_sr = program_pid_df[1]
    clean_df['program'] = program_sr
    clean_df['pid'] = pid_sr.str.rstrip(']:')
    clean_df = clean_df.drop(labels=[5])
    del program_pid_df

    client_pid_port_df = clean_df[6].str.split(':')
    clean_df['client_ip'], clean_df['client_port'] = client_pid_port_df[0], client_pid_port_df[1]
    clean_df.drop_column(6)
    del client_pid_port_df

    clean_df['accept_date'] = clean_df[7].str.lstrip('[').str.rstrip(']')
    clean_df.drop_column(7)

    clean_df.rename({8: 'frontend_name'}, inplace=True)
    backend_server_df = clean_df[9].str.split('/')
    clean_df['backend_name'], clean_df['server_name'] = backend_server_df[0], backend_server_df[1]
    clean_df.drop_column(9)

    time_cols = ['time_request', 'time_queue', 'time_backend_connect', 'time_backend_response', 'time_duration']
    time_df = clean_df[10].str.split('/')
    for col_id, col_name in enumerate(time_cols):
        clean_df[col_name] = time_df[col_id]
    clean_df.drop_column(10)
    del time_df

    clean_df.rename({11: 'http_status_code'}, inplace=True)
    clean_df.rename({12: 'bytes_read'}, inplace=True)
    clean_df.rename({13: 'captured_request', 14: 'captured_response', 15: 'termination_state'}, inplace=True)

    con_cols = ['actconn', 'feconn', 'beconn', 'srvconn', 'retries']
    con_df = clean_df[16].str.split('/')
    for col_id, col_name in enumerate(con_cols):
        clean_df[col_name] = con_df[col_id]
    clean_df.drop_column(16)
    del con_df

    q_df = clean_df[17].str.split('/')
    clean_df['srv_queue'], clean_df['backend_queue'] = q_df[0], q_df[1]
    clean_df.drop_column(17)
    del q_df
    
    post_parsing_time = int(round(time.time()))
    
    '''
    End of the piecemeal log parsing for HAProxy.
    Simple aggregations to be performed now.
    '''
    
    clean_df['time_backend_response'] = clean_df['time_backend_response'].astype('int')
    avg_backend_response_time = clean_df['time_backend_response'].mean()
    
    post_agg_time = int(round(time.time()))
    
    return "{0},{1},{2},{3},{4},{5},{6}".format(num_rows, preprocess_start_time, pre_parsing_time, \
                                            post_parsing_time, post_agg_time, \
                                            avg_backend_response_time, size)

In [3]:
#If you changed Dask=True, please ensure you have a Dask cluster up and running
stream = Stream.from_kafka_batched(topic, kafka_conf, poll_interval='10s',
                                   npartitions=1, asynchronous=True, dask=False)

We then use the helper parsing+aggregations function to perform the required operations on each batch polled from Kafka, and write the result into a list.

In [4]:
final_output = stream.map(haproxy_parsing_aggregations).buffer(100000).gather().sink_to_list()

In [None]:
stream.start()

In [None]:
print(final_output)