# Deep Learning Toolkit for Splunk - Graph Algorithms with NetworkX

This notebook contains examples for graph algorithms available in NetworkX

Note: By default every time you save this notebook the cells are exported into a python module which is then invoked by Splunk MLTK commands like <code> | fit ... | apply ... | summary </code>. Please read the Model Development Guide in the Deep Learning Toolkit app for more information.

## Stage 0 - import libraries
At stage 0 we define all imports necessary to run our subsequent code depending on various libraries.

In [3]:
# this definition exposes all python module imports that should be available in all subsequent commands
import json
import numpy as np
import pandas as pd
import networkx as nx
# ...
# global constants
MODEL_DIRECTORY = "/srv/app/model/data/"

In [4]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print("numpy version: " + np.__version__)
print("pandas version: " + pd.__version__)
print("networkx version: " + nx.__version__)

numpy version: 1.18.1
pandas version: 1.0.1
networkx version: 2.4


## Stage 1 - get a data sample from Splunk
In Splunk run a search to pipe a dataset into your notebook environment. Note: mode=stage is used in the | fit command to do this.

| inputlookup bitcoin_transactions.csv<br>
| head 1000<br>
| rename user_id_from as src user_id_to as dest<br>
| fit MLTKContainer mode=stage algo=graph_algo compute="eigenvector_centrality,cluster_coefficient,betweenness_centrality" from src dest into app:bitcoin_graph as graph

After you run this search your data set sample is available as a csv inside the container to develop your model. The name is taken from the into keyword ("barebone_model" in the example above) or set to "default" if no into keyword is present. This step is intended to work with a subset of your data to create your custom model.

In [7]:
# this cell is not executed from MLTK and should only be used for staging data into the notebook environment
def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

In [8]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
df, param = stage("bitcoin_graph")
print(df[0:1])
print(param)

   src  dest
0    2     2
{'options': {'params': {'mode': 'stage', 'algo': 'graph_algo', 'compute': '"eigenvector_centrality,cluster_coefficient,betweenness_centrality"'}, 'feature_variables': ['src', 'dest'], 'args': ['src', 'dest'], 'model_name': 'bitcoin_graph', 'output_name': 'graph', 'algo_name': 'MLTKContainer', 'mlspl_limits': {'disabled': False, 'handle_new_cat': 'default', 'max_distinct_cat_values': '1000', 'max_distinct_cat_values_for_classifiers': '1000', 'max_distinct_cat_values_for_scoring': '1000', 'max_fit_time': '6000', 'max_inputs': '100000000', 'max_memory_usage_mb': '4000', 'max_model_size_mb': '150', 'max_score_time': '6000', 'streaming_apply': '0', 'use_sampling': '1'}, 'kfold_cv': None}, 'feature_variables': ['src', 'dest']}


## Stage 2 - create and initialize a model

In [9]:
# initialize your model
# available inputs: data and parameters
# returns the model object which will be used as a reference to call fit, apply and summary subsequently
def init(df,param):
    model = nx.Graph()
    return model

In [10]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
model = init(df,param)

## Stage 3 - fit the model

In [11]:
# train your model
# returns a fit info json object and may modify the model object
def fit(model,df,param):
    model.clear()
    src_dest_name = param['feature_variables']
    dfg = df[src_dest_name]
    for index, row in dfg.iterrows():
        model.add_edge(row[src_dest_name[0]], row[src_dest_name[1]]) #, value=row['value'])
    return model

In [12]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
g = fit(model,df,param)


## Stage 4 - apply the model

In [13]:
param['options']['params']['compute'].lstrip("\"").rstrip("\"").lower().split(',')

['eigenvector_centrality', 'cluster_coefficient', 'betweenness_centrality']

In [14]:
# apply your model
# returns the calculated results
def apply(model,df,param):
    src_dest_name = param['feature_variables']
    algos = param['options']['params']['compute'].lstrip("\"").rstrip("\"").lower().split(',')
    outputcolumns = []
    for algo in algos:
        if algo=='degree_centrality':
            cents = nx.algorithms.centrality.degree_centrality(model)
            outputcolumns.append(algo)
        elif algo=='betweenness_centrality':
            cents = nx.algorithms.centrality.betweenness_centrality(model)
            outputcolumns.append(algo)
        elif algo=='eigenvector_centrality':
            cents = nx.algorithms.centrality.eigenvector_centrality(model, max_iter=200)
            outputcolumns.append(algo)
        elif algo=='cluster_coefficient':
            cents = nx.algorithms.cluster.clustering(model)
            outputcolumns.append(algo)
        else:
            continue
        degs = pd.DataFrame(list(cents.items()), columns=[src_dest_name[0], algo])
        df = df.join(degs.set_index(src_dest_name[0]), on=src_dest_name[0])
    return df[outputcolumns]

In [15]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print(apply(model,df,param))

     eigenvector_centrality  cluster_coefficient  betweenness_centrality
0              7.193626e-48                  0.0                0.000000
1              7.193626e-48                  0.0                0.000000
2              8.522787e-43                  0.0                0.000016
3              8.522787e-43                  0.0                0.000016
4              1.535753e-42                  0.0                0.000037
..                      ...                  ...                     ...
995            3.035609e-03                  0.0                0.031779
996            3.035609e-03                  0.0                0.031779
997            3.035609e-03                  0.0                0.031779
998            3.035609e-03                  0.0                0.031779
999            3.035609e-03                  0.0                0.031779

[1000 rows x 3 columns]


## Stage 5 - save the model

In [16]:
# save model to name in expected convention "<algo_name>_<model_name>"
def save(model,name):
    # with open(MODEL_DIRECTORY + name + ".json", 'w') as file:
    #    json.dump(model, file)
    return model

## Stage 6 - load the model

In [17]:
# load model from name in expected convention "<algo_name>_<model_name>"
def load(name):
    model = init(None,None)
    # with open(MODEL_DIRECTORY + name + ".json", 'r') as file:
    #    model = json.load(file)
    return model

## Stage 7 - provide a summary of the model

In [18]:
# return a model summary
def summary(model=None):
    returns = {"version": {"numpy": np.__version__, "pandas": pd.__version__, "networkx": nx.__version__} }
    return returns

## End of Stages
All subsequent cells are not tagged and can be used for further freeform code

In [None]:
# DRAFT for weakly connected component
comps = nx.algorithms.components.weakly_connected_components(g)
d = dict()
i = 0
for x in comps:
    i=i+1
    for n in x:
        d[n]=i
d