# CLX Workflow Overview (GTC DC 2019)
*Notebook 1 of 3*


## Author
 - Bianca Rhodes (NVIDIA) [brhodes@nvidia.com]

## Development Notes
* Developed using: RAPIDS v0.11.0 and CLX v0.11.0
* Last tested using: RAPIDS v0.11.0 and CLX v0.11.0 on Nov 5, 2019


## Introduction to the CLX Workflow

This notebook demonstrates the concept of a [CLX](https://github.com/rapidsai/clx) workflow. A CLX workflow performs analytical operations on a GPU dataframe. It also manages I/O components, allowing it to receive input data from a file or Kafka in the format of a gpu dataframe and output data in that format as well.

![Visualization](./image1.png)

1. Create a workflow that performs some operations on a GPU dataframe. For this we must subclass the CLX Workflow class, which handles the initiation of IO components and workflow processing.

In [26]:
from clx.workflow.workflow import Workflow

In [27]:
class TestWorkflowImpl(Workflow):
    
    # Define your data processing in a function called "workflow"
    def workflow(self, dataframe):
        dataframe["enriched"] = dataframe["raw"].str.len()
        return dataframe

2. Prepare your input file if needed

In [5]:
import os
curr_path = os.getcwd()

In [10]:
!cat $curr_path/input.csv

raw
hello gtcdc

3. Next indicate your source and destination for your workflow input and output. This can be a file or Kafka. For this test example, let's use a file. The underlying code uses cudf IO to read from a CSV.

In [7]:
source = {
    "type": "fs",
    "input_format": "csv",
    "input_path": curr_path + "/input.csv",
    "schema": ["raw"],
    "delimiter": ",",
    "usecols": ["raw"],
    "dtype": ["str"],
    "header": 0,
}
destination = {
    "type": "fs",
    "output_format": "csv",
    "output_path": curr_path + "/output.csv",
    "index": False
}

4. Instantiate your new workflow and run it.

In [30]:
![ -e $curr_path/output.csv ] && rm $curr_path/output.csv
workflow = TestWorkflowImpl(name="my-test-workflow", source=source, destination=destination)
workflow.run_workflow()

5. Inspect your output file.

In [31]:
!cat $curr_path/output.csv

"raw","enriched"
"hello gtcdc",11


( Talk about how this can easily be deployed to production )

# CLX Log Event Parsing

CLX Parsers use regex to extract meaningful key/value pairs from raw log event data. To implement your own CLX Event Parser you can subclass the EventParser class and indicate the regex values used to parse this particular type of event as well as the pre and prost processing methods as needed.

![Visualization](./image7.png)

In [32]:
from clx.parsers.event_parser import EventParser
import cudf

1. Create sample input

In [33]:
test_input = cudf.DataFrame()
test_input["raw"] = ["username=gtcdc host=1.2.3.4    "]

2. Define the regex for the event log.  
  
    Here we specify that the key value `username` will be found in the log event as `username=([a-z\.\-0-9$]+)`. The value captured within the parentheses or group will be extracted as the value for the given key.  
    
    It is also an option to define this regex easily within a yaml file and import via a yaml file reader. Within the CLX Windows Event Log parser, we do so here:
    https://github.com/rapidsai/clx/blob/branch-0.11/clx/parsers/resources/windows_event_regex.yaml

In [34]:
event_regex = {
   "username": "username=([a-z\.\-0-9$]+)",
}

3. Create your event parser. The event parser must contain a method named `parse`. This method will handle all functionality for parsing.

In [35]:
class TestEventParser(EventParser):
    def parse(self, dataframe, raw_column):
        # First we can pre-process the data. Let's strip trailing space
        dataframe["processed"] = dataframe["raw"].str.rstrip(" ")
        # Call parent class parse_raw_event method
        parsed_dataframe = self.parse_raw_event(
            dataframe, "processed", event_regex
        )
        return parsed_dataframe

4. Run the parser

In [36]:
parser = TestEventParser(columns=["username"], event_name="mylogevent")
parser.parse(test_input, "raw")

Unnamed: 0,username
0,gtcdc


## Integrating the parser and CLX workflow

We may want to perform analytics on pre-parsed data. In this example, we'll show how to integrate the custom log parser within CLX and integrate it 
above into a CLX workflow

1. Create input data file. This is a sample log that we will parse. Our goal is to extract username value `gtcdc` for our analytics.

In [37]:
!echo "raw\n    username=gtcdc host=1.2.3.4    " > input2.csv

2. Establish source and destination parameters

In [38]:
source = {
    "type": "fs",
    "input_format": "csv",
    "input_path": curr_path + "/input2.csv",
    "schema": ["raw"],
    "delimiter": ",",
    "required_cols": ["raw"],
    "dtype": ["str"],
    "header": 0,
}
destination = {
    "type": "fs",
    "output_format": "csv",
    "output_path": curr_path + "/output2.csv",
    "index": False
}

3. Create the custom workflow. This workflow first parses the data and then counts the characters in username.

In [39]:
class TestWorkflowImpl(Workflow):
    parser = TestEventParser(columns=["username"], event_name="mylogevent")
    
    # Define your data processing in a function called "workflow"
    def workflow(self, dataframe):
        output_dataframe = cudf.DataFrame()
        parsed_dataframe = self.parser.parse(test_input, "raw")
        output_dataframe["username"] = parsed_dataframe["username"]
        output_dataframe["count"] = parsed_dataframe["username"].str.len()
        return output_dataframe

4. Run the workflow

In [40]:
![ -e $curr_path/output2.csv ] && rm $curr_path/output2.csv
workflow = TestWorkflowImpl(name="my-test-workflow", source=source, destination=destination)
workflow.run_workflow()

5. Display the output

In [None]:
!cat $curr_path/output2.csv