# CLX Workflow Example: Notable Alerts in Splunk (GTC DC 2019)
*Notebook 2 of 3*


## Author
 - Bianca Rhodes (NVIDIA) [brhodes@nvidia.com]

## Development Notes
* Developed using: RAPIDS v0.11.0 and CLX v0.11.0
* Last tested using: RAPIDS v0.11.0 and CLX v0.11.0 on Nov 5, 2019


## Introduction to CLX Splunk Alert Workflow

This notebook demonstrates how we created a [CLX](https://github.com/rapidsai/clx) workflow to detect anomalies in splunk notable alert data.

In [1]:
from clx.workflow.workflow import Workflow
from clx.parsers.splunk_notable_parser import SplunkNotableParser
import cudf
import clx
import clx.analytics.stats

![Visualization](./image3.png)

In [2]:
class SplunkAlertWorkflow(Workflow):
    def workflow(self, dataframe):
        interval="hour"
        threshold=float(2.0)
        window=7
        raw_data_col_name="Raw"
        
        # We use a splunk notable parser to parse data raw Splunk notable data.
        snp = SplunkNotableParser()
        parsed_df = snp.parse(dataframe, raw_data_col_name)

        # Create alerts dataframe
        alerts_gdf = parsed_df
        alerts_gdf["time"] = alerts_gdf["time"].astype("int")
        alerts_gdf = alerts_gdf.rename(columns={"search_name": "rule"})
        if interval == "day":
            alerts_gdf[interval] = alerts_gdf.time.applymap(self.__round2day)
        else:  # hour
            alerts_gdf[interval] = alerts_gdf.time.applymap(self.__round2hour)

        # Group alerts by interval and pivot table
        day_rule_df = (
            alerts_gdf[["rule", interval, "time"]]
            .groupby(["rule", interval])
            .count()
            .reset_index()
        )
        day_rule_df.columns = ["rule", interval, "count"]
        day_rule_piv = self.__pivot_table(
            day_rule_df, interval, "rule", "count"
        ).fillna(0)

        # Calculate rolling zscore
        r_zscores = cudf.DataFrame()
        for rule in day_rule_piv.columns:
            x = day_rule_piv[rule]
            r_zscores[rule] = clx.analytics.stats.rzscore(x, window)

        # Flag z score anomalies
        output = self.__flag_anamolies(r_zscores, threshold)
        print(output)
        return output

    def __flag_anamolies(self, zc_df, threshold):
        output_df = cudf.DataFrame()
        for col in zc_df.columns:
            if col != 'hour':
                temp_df = cudf.DataFrame()
                temp_df['time'] = zc_df.index[zc_df[col].abs() > threshold]
                temp_df['rule'] = col
                output_df = cudf.concat([output_df, temp_df])
        output_df = output_df.reset_index(drop=True)
        return output_df

    def __pivot_table(self, gdf, index_col, piv_col, v_col):
        index_list = gdf[index_col].unique()
        piv_gdf = cudf.DataFrame({index_col: list(range(len(index_list)))})
        piv_gdf[index_col] = index_list
        for group in gdf[piv_col].unique():
            temp_df = gdf[gdf[piv_col] == group]
            temp_df = temp_df[[index_col, v_col]]
            temp_df.columns = [index_col, group]
            piv_gdf = piv_gdf.merge(temp_df, on=[index_col], how="left")
        piv_gdf = piv_gdf.set_index(index_col)
        piv_gdf = piv_gdf.sort_index()
        return piv_gdf

    def __round2day(self, epoch_time):
        return int(epoch_time / 86400) * 86400

    def __round2hour(self, epoch_time):
        return int(epoch_time / 3600) * 3600

This time let's read input data from kafka and output data to kafka

In [3]:
source = {
    "type": "kafka",
    "kafka_brokers": "kafka:9092",
    "group_id": "gtcdc",
    "batch_size": 100000,
    "consumer_kafka_topics": ["gtcdemo_raw"],
    "time_window": 5,
}
dest = {
    "type": "kafka",
    "kafka_brokers": "kafka:9092",
    "batch_size": 5,
    "publisher_kafka_topic": "gtcdemo_enriched",
    "output_delimiter": ",",
}

To send test data to the kafka queue I use a command such as this one:  

```
kafka-console-producer --broker-list kafka:9092 --topic gtcdemo_raw < test_splunk_alert_data.txt
```

Due to being a subclass of the Workflow class, source, destination and name of the workflow must be specified

In [None]:
workflow = SplunkAlertWorkflow(name="my-splunk-alert-workflow", source=source, destination=dest)
workflow.run_workflow()

Assignment: [TopicPartition{topic=gtcdemo_raw4,partition=0,offset=-1001,error=None}]
           time                                               rule
0    1548936000  Access - Brute Force Access Behavior Detected ...
1    1549065600  Access - Brute Force Access Behavior Detected ...
2    1549681200  Access - Brute Force Access Behavior Detected ...
3    1549911600  Access - Brute Force Access Behavior Detected ...
4    1549965600  Access - Brute Force Access Behavior Detected ...
5    1550458800  Access - Brute Force Access Behavior Detected ...
6    1550491200  Access - Brute Force Access Behavior Detected ...
7    1550563200  Access - Brute Force Access Behavior Detected ...
8    1550631600  Access - Brute Force Access Behavior Detected ...
9    1550754000  Access - Brute Force Access Behavior Detected ...
10   1550782800  Access - Brute Force Access Behavior Detected ...
11   1550934000  Access - Brute Force Access Behavior Detected ...
12   1550995200  Access - Brute Force Access