# Deep Learning Toolkit for Splunk - Notebook for 'Auto-Sklearn 2.0'

This notebook contains an example workflow how to work with the AutoML framework 'Auto-Sklearn' (https://automl.github.io/auto-sklearn/master/index.html) with the Deep Learning Toolkit for Splunk. The example uses the 'AutoSklearn2Classifier' classifier and can be used as template to implement other functionalities, described in the docs: https://automl.github.io/auto-sklearn/master/manual.html

Note: By default every time you save this notebook the cells are exported into a python module which is then invoked by Splunk MLTK commands like <code> | fit ... | apply ... | summary </code>. Please read the Model Development Guide in the Deep Learning Toolkit app for more information.

#### example SPL to create and fit sample data

| inputlookup track_day.csv</br>
| rename * as x_*</br>
| rename x_vehicleType as y_vehicleType</br>
| eventstats avg(x_engineCoolantTemperature) as avg_x_engineCoolantTemperature</br>
| eval x_engineCoolantTemperature = coalesce(x_engineCoolantTemperature, floor(avg_x_engineCoolantTemperature))</br>
| fields y_* x_*</br>
| sample seed=123 100 by y_vehicleType</br>
| fit MLTKContainer algo=autosklearn_classification dataset_name=trackday_autosklearn time_left_for_this_task=30 per_run_time_limit=10 y_vehicleType from x_* into app:trackday_autosklearn_classifier

## Stage 0 - import libraries
At stage 0 we define all imports necessary to run our subsequent code depending on various libraries.

In [None]:
# mltkc_import
# this definition exposes all python module imports that should be available in all subsequent commands
import json
import pandas as pd
import pickle

import autosklearn
#from autosklearn.classification import AutoSklearnClassifier
from autosklearn.experimental.askl2 import AutoSklearn2Classifier

from copy import deepcopy
import re

# global constants
MODEL_DIRECTORY = "/srv/app/model/data/"

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

print("pandas: " + pd.__version__)
print("autosklearn: " + autosklearn.__version__)

## Stage 1 - get a data sample from Splunk
In Splunk run a search to pipe a dataset into your notebook environment. Note: mode=stage is used in the | fit command to do this.

In [None]:
# mltkc_stage
# this cell is not executed from MLTK and should only be used for staging data into the notebook environment
def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

df, param = stage("trackday_autosklearn_classifier")
print(df[0:5])
print(df.shape)
print(str(param))

## Stage 2 - create and initialize a model

In [None]:
# mltkc_init
# initialize the model
# params: data and parameters
# returns the model object which will be used as a reference to call fit, apply and summary subsequently
def init(df,param):
    params = deepcopy(param['options']['params'])
    params.pop('algo', None)
    params.pop('mode', None)
    params.pop('dataset_name', None)
    for key in params:
        try:
            if params[key].isdigit():
                params[key] = int(params[key])
        except:
            pass
    model = {}
    model["model"] = AutoSklearn2Classifier(
        **params
    )
    return model

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

df, param = stage("trackday_autosklearn_classifier")
model = init(df,param)
print(model)

## Stage 3 - fit the model

In [None]:
# mltkc_stage_create_model_fit
# returns a fit info json object
def fit(model,df,param):
    returns = {}
    for col in df.select_dtypes(['object']):
        df[col] = df[col].astype('category')
    X = df[param['feature_variables']]
    y = df[param['target_variables']].values
    dsname = param['options']['params']['dataset_name'] if ("dataset_name" in param['options']['params']) else None
    returns['dataset_name'] = dsname
    returns['fit_history'] = model["model"].fit(X, y, dataset_name=dsname)
    return returns

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

print(fit(model,df,param))

## Stage 4 - apply the model

In [None]:
# mltkc_stage_create_model_apply
def apply(model,df,param):
    for col in df.select_dtypes(['object']):
        df[col] = df[col].astype('category')
    X = df[param['feature_variables']]
    y_hat = model["model"].predict(X)
    return y_hat

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

df, param = stage("trackday_autosklearn_classifier")
print(apply(model,df,param))

## Stage 5 - save the model

In [None]:
# save model to name in expected convention "<algo_name>_<model_name>.h5"
def save(model,name):
    model["summary"] = {}
    model["summary"]["statistics"] = {}
    for s in model["model"].sprint_statistics().split("\n")[1:-1]:
        match = re.search('(.*):\s(.*)', s.strip(), re.IGNORECASE)
        if match:
            model["summary"]["statistics"][match.group(1)] = str(match.group(2))

    cv_result_keys = {'mean_test_score': 1, 'mean_fit_time': 1, 'status': 0, 'rank_test_scores': 1}
    for k,v in cv_result_keys.items():
        model["summary"][k] = str(model["model"].cv_results_[k].tolist()) if (v) else model["model"].cv_results_[k]

    model["summary"]["models"] = []
    models_ww = model["model"].get_models_with_weights()
    p = re.compile('(?<!\\\\)\'')
    for m in models_ww:
        curr_weight = m[0]
        curr_model = p.sub('\"', re.search('.*\((\{.*\})', str(m[1]), re.IGNORECASE).group(1))
        model_json = json.loads(curr_model)
        model_json["weight"] = curr_weight
        model["summary"]["models"].append(model_json)
    pickle.dump(model, open(MODEL_DIRECTORY + name + ".pickle", 'wb'))
    return model

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

save(model,"trackday_autosklearn_classifier")

## Stage 6 - load the model

In [None]:
# load model from name in expected convention "<algo_name>_<model_name>.h5"
def load(name):
    with open(MODEL_DIRECTORY + name + ".pickle", 'rb') as pickle_file:
        model = pickle.load(pickle_file)
    return model

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

model = load("trackday_autosklearn_classifier")

## Stage 7 - provide a summary of the model

In [None]:
# return model summary
def summary(model=None):
    returns = {"version": {"autosklearn": autosklearn.__version__} }
    return returns

In [None]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes

summary(model)

## End of Stages
All subsequent cells are not tagged and can be used for further freeform code