# Deep Learning Toolkit for Splunk - Robust Random Cut Forest for Anomaly Detection

This notebook contains a barebone example workflow how to work on custom containerized code that seamlessly interfaces with the Deep Learning Toolkit for Splunk.

Note: By default every time you save this notebook the cells are exported into a python module which is then invoked by Splunk MLTK commands like <code> | fit ... | apply ... | summary </code>. Please read the Model Development Guide in the Deep Learning Toolkit app for more information.

## Stage 0 - import libraries
At stage 0 we define all imports necessary to run our subsequent code depending on various libraries.

In [1]:
# this definition exposes all python module imports that should be available in all subsequent commands
import json
import numpy as np
import pandas as pd
import rrcf as rcf
# ...
# global constants
MODEL_DIRECTORY = "/srv/app/model/data/"

In [2]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print("numpy version: " + np.__version__)
print("pandas version: " + pd.__version__)
print("rrcf version: " + rcf.__version__)

numpy version: 1.19.2
pandas version: 1.1.3
rrcf version: 0.4.3


## Stage 1 - get a data sample from Splunk
In Splunk run a search to pipe a dataset into your notebook environment. Note: mode=stage is used in the | fit command to do this.

| inputlookup app_usage.csv<br>
| table _time OTHER Recruiting<br>
| fit MLTKContainer mode=stage algo=random_cut_forest OTHER from Recruiting threshold=0.1 into app:random_cut_forest<br>

After you run this search your data set sample is available as a csv inside the container to develop your model. The name is taken from the into keyword ("barebone_model" in the example above) or set to "default" if no into keyword is present. This step is intended to work with a subset of your data to create your custom model.

In [3]:
# this cell is not executed from MLTK and should only be used for staging data into the notebook environment
def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

In [4]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
df, param = stage("random_cut_forest")
print(df.describe())
print(df.head())
print(param)

             OTHER   Recruiting
count    91.000000    91.000000
mean    418.912088   229.890110
std     361.962234   244.979113
min      24.000000     7.000000
25%     174.000000    42.500000
50%     380.000000   247.000000
75%     482.000000   305.500000
max    2102.000000  2168.000000
   OTHER  Recruiting
0    144          33
1    188          30
2   1175         297
3   1475         308
4   1111         305
{'options': {'params': {'mode': 'stage', 'algo': 'random_cut_forest', 'threshold': '0.1'}, 'args': ['OTHER', 'Recruiting'], 'target_variable': ['OTHER'], 'feature_variables': ['Recruiting'], 'model_name': 'random_cut_forest', 'algo_name': 'MLTKContainer', 'mlspl_limits': {'disabled': False, 'handle_new_cat': 'default', 'max_distinct_cat_values': '10000', 'max_distinct_cat_values_for_classifiers': '10000', 'max_distinct_cat_values_for_scoring': '10000', 'max_fit_time': '6000', 'max_inputs': '100000000', 'max_memory_usage_mb': '4000', 'max_model_size_mb': '150', 'max_score_time': '

## Stage 2 - create and initialize a model

In [5]:
# Create the random cut forest from the source data
def init(df,param):
    # Set model parameters
    features=len(df)
    num_trees=15
    tree_size=30
    sample_size_range=(features // tree_size, tree_size)
    
    if 'options' in param:
        if 'params' in param['options']:
            if 'num_trees' in param['options']['params']:
                num_trees = int(param['options']['params']['num_trees'])
            if 'tree_size' in param['options']['params']:
                tree_size = int(param['options']['params']['tree_size'])
    
    # Convert data to nparray
    variables=[]
    
    if 'target_variables' in param:
        variables=param['target_variables']
        
    other_variables=[]
    
    if 'feature_variables' in param:
        other_variables=param['feature_variables']

    for item in other_variables:
        variables.append(item)
    
    data=df[variables].to_numpy().astype(float)
    
    # Create the random cut forest
    forest = []
    while len(forest) < num_trees:
        # Select random subsets of points uniformly
        ixs = np.random.choice(features, size=sample_size_range,
                               replace=False)
        # Add sampled trees to forest
        trees = [rcf.RCTree(data[ix], index_labels=ix)
                 for ix in ixs]
        forest.extend(trees)
    return forest

In [6]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
model=init(df,param)

## Stage 3 - fit the model

In [7]:
# train your model
# returns a fit info json object and may modify the model object
def fit(model,df,param):
    
    return len(model)

In [8]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print(fit(model,df,param))

15


## Stage 4 - apply the model

In [9]:
# apply your model
# returns the calculated results
def apply(model,df,param):
    # Calculate the collusive displacement of the points in the random trees
    features=len(df)
    threshold=0.01
    
    if 'options' in param:
        if 'params' in param['options']:
            if 'threshold' in param['options']['params']:
                threshold = float(param['options']['params']['threshold'])
    
    avg_codisp = pd.Series(0.0, index=np.arange(features))
    index = np.zeros(features)

    for tree in model:
        codisp = pd.Series({leaf : tree.codisp(leaf)
                           for leaf in tree.leaves})

        avg_codisp[codisp.index] += codisp
        np.add.at(index, codisp.index.values, 1)
    avg_codisp /= index
    
    # Identify outliers based on the collusive displacement values
    threshold_percentage=int(threshold*features)
    threshold = avg_codisp.nlargest(n=threshold_percentage).min()
    
    outlier=(avg_codisp >= threshold).astype(float)
    
    result=pd.DataFrame({'outlier':outlier,'collusive_displacement':avg_codisp})
    return result

In [10]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
results=apply(model,df,param)
results['outlier'].sum()

9.0

## Stage 5 - save the model

In [11]:
# save model to name in expected convention "<algo_name>_<model_name>"
def save(model,name):
    return model

## Stage 6 - load the model

In [12]:
# load model from name in expected convention "<algo_name>_<model_name>"
def load(name):
    model = {}
    return model

## Stage 7 - provide a summary of the model

In [13]:
# return a model summary
def summary(model=None):
    returns = {"version": {"numpy": np.__version__, "pandas": pd.__version__} }
    return returns

## End of Stages
All subsequent cells are not tagged and can be used for further freeform code

### Logic: 

Set some basic parameters (tree size for example)

Convert DF into an array

Create a RRCF based on random splits of the data

Calculate co-displacement based on the random cuts

Return co-displacement to Splunk


### Next steps:

Save the tree

Add logic to append to and re-calucate the co-disp as new data is seen

### Recommendations:

Scale the data

Use NPR to convert high cardinatlity data points