# Splunk App for Data Science and Deep Learning - Graphistry Notebook for Graph Exploration

This notebook contains an example how to connect to graphistry for advanced graph explorations. Please note that this notebook is more suited for ad hoc investigations on Splunk data. Therefore the operationalization functions for fit, apply and summary are intentionally left empty but you can fill and use them as shown in most other DSDL notebook examples. More information about graphistry: https://www.graphistry.com/

Note: By default every time you save this notebook the cells are exported into a python module which is then invoked by Splunk MLTK commands like <code> | fit ... | apply ... | summary </code>. Please read the Model Development Guide in the DSDL app docs for more information: https://docs.splunk.com/Documentation/DSDL

## Stage 0 - import libraries
At stage 0 we define all imports necessary to run our subsequent code depending on various libraries.

In [13]:
# this definition exposes all python module imports that should be available in all subsequent commands
import json
import numpy as np
import pandas as pd
import networkx as nx
import graphistry
# please use your graphistry credentials to use their services.
# security note: your graph data is sent to graphistry hub, so please ensure all your data security and compliance is in sync with this operation
graphistry.register(api=3, protocol="https", server="hub.graphistry.com", username="username", password="XXXXXXXXX")    

# ...
# global constants
MODEL_DIRECTORY = "/srv/app/model/data/"

In [14]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
print("numpy version: " + np.__version__)
print("pandas version: " + pd.__version__)
print("networkx version: " + nx.__version__)
print('graphistry', graphistry.__version__)

numpy version: 1.20.3
pandas version: 1.3.5
networkx version: 2.6.3
graphistry 0.28.7


## Stage 1 - get a data sample from Splunk

### Option 1 push data from Splunk
In Splunk run a search to pipe a dataset into your notebook environment. Note: mode=stage is used in the | fit command to do this.

| inputlookup firewall_traffic.csv<br>
| fit MLTKContainer mode=stage algo=graphistry_notebook * into app:graphistry_firewall_traffic

After you run this search your data set sample is available as a csv inside the container to develop your model. The name is taken from the into keyword ("barebone_model" in the example above) or set to "default" if no into keyword is present. This step is intended to work with a subset of your data to create your custom model.

In [15]:
# this cell is not executed from MLTK and should only be used for staging data into the notebook environment
def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

In [16]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
df, param = stage("graphistry_firewall_traffic")

In [17]:
df

Unnamed: 0,receive_time,serial_number,session_id,src_ip,dst_ip,bytes_sent,bytes_received,packets_sent,packets_received,dest_port,src_port,used_by_malware,has_known_vulnerability
0,10/7/15 23:59,sn_1606046662,sid_14787,138.52.78.14,73.147.88.91,85,170,1,1,p_53,p_57375,yes,yes
1,10/7/15 23:59,sn_1606046662,sid_1838,205.77.248.110,73.147.88.91,75,107,1,1,p_53,p_6289,yes,yes
2,10/7/15 23:59,sn_1606046662,sid_17519,44.165.220.174,27.90.179.152,76,108,1,1,p_53,p_45700,yes,yes
3,10/7/15 23:59,sn_1606046662,sid_36258,227.45.212.95,73.147.88.91,85,170,1,1,p_53,p_33298,yes,yes
4,10/7/15 23:59,sn_1606046662,sid_48945,40.149.50.140,226.58.156.109,1872,4620,19,18,p_443,p_55362,no,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98938,10/7/15 23:53,sn_0009C101998,sid_34900121,64.73.95.187,216.148.42.38,482,66,3,1,p_8089,p_60076,yes,yes
98939,10/7/15 23:53,sn_0009C101998,sid_389256,109.197.174.142,42.93.13.188,570,570,5,5,p_0,p_0,no,yes
98940,10/7/15 23:53,sn_0009C101998,sid_444276,109.197.174.142,20.205.92.188,570,570,5,5,p_0,p_0,no,yes
98941,10/7/15 23:53,sn_0009C101998,sid_61729,109.197.174.142,143.232.70.184,570,570,5,5,p_0,p_0,no,yes


### Option 2 - interactively get a data sample from Splunk
Note you need to setup a splunk access token and configure the DSDL app to use this functionality

In [8]:
import libs.SplunkSearch as SplunkSearch

In [9]:
search = SplunkSearch.SplunkSearch(search='| inputlookup firewall_traffic.csv\n| fit MLTKContainer mode=stage algo=graphistry_notebook * into app:graphistry_firewall_traffic')

VBox(children=(HBox(children=(Textarea(value='| inputlookup firewall_traffic.csv\n| fit MLTKContainer mode=sta…

In [19]:
df2 = search.as_df()
df2

## Interactive Graph Exploration with Graphistry
This section shows a few simple examples how to work with graphistry and the Splunk data received from above interactively in this notebook.
### Example 1 - simple graph from defined source and destination pairs

In [18]:
g1 = graphistry.edges(df).bind(source='src_ip', destination='dst_ip')
g1.plot()

### Example 2 - explore a subset of this dataset with a hypergraph
In Splunk run a search to pipe a dataset into your notebook environment. Note: mode=stage is used in the | fit command to do this.

| inputlookup firewall_traffic.csv<br/>
| stats count by src_ip dst_ip serial_number used_by_malware<br/>
| fit MLTKContainer mode=stage algo=graphistry_notebook * into app:graphistry_firewall_traffic_hypergraph

In [18]:
df2, param = stage("graphistry_firewall_traffic_hypergraph")

In [19]:
g2 = graphistry.hypergraph(df2)
g2['graph'].plot()

# links 72155
# events 14431
# attrib entities 10348


## Stage 2 - create and initialize a model

In [31]:
# initialize your model
# available inputs: data and parameters
# returns the model object which will be used as a reference to call fit, apply and summary subsequently
def init(df,param):
    model = graphistry.edges(df).bind(source='src_ip', destination='dst_ip')
    return model

In [36]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
model = init(df,param)

## Stage 3 - fit the model

In [34]:
# train your model
# returns a fit info json object and may modify the model object
def fit(model,df,param):
    model = graphistry.edges(df).bind(source='src_ip', destination='dst_ip')
    return model

In [35]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
g = fit(model,df,param)
g

{'bindings': {'edges':         receive_time   serial_number    session_id           src_ip  \
  0      10/7/15 23:59   sn_1606046662     sid_14787     138.52.78.14   
  1      10/7/15 23:59   sn_1606046662      sid_1838   205.77.248.110   
  2      10/7/15 23:59   sn_1606046662     sid_17519   44.165.220.174   
  3      10/7/15 23:59   sn_1606046662     sid_36258    227.45.212.95   
  4      10/7/15 23:59   sn_1606046662     sid_48945    40.149.50.140   
  ...              ...             ...           ...              ...   
  98938  10/7/15 23:53  sn_0009C101998  sid_34900121     64.73.95.187   
  98939  10/7/15 23:53  sn_0009C101998    sid_389256  109.197.174.142   
  98940  10/7/15 23:53  sn_0009C101998    sid_444276  109.197.174.142   
  98941  10/7/15 23:53  sn_0009C101998     sid_61729  109.197.174.142   
  98942  10/7/15 23:53  sn_0009C101998    sid_138461  109.197.174.142   
  
                 dst_ip  bytes_sent  bytes_received  packets_sent  \
  0        73.147.88.91        

## Stage 4 - apply the model

In [37]:
# apply the model
# returns the calculated results
def apply(model,df,param):
    # example to utilize graphistry functions to derive insights from the graph and return to Splunk
    topo = model.get_topological_levels()
    return topo._nodes


In [38]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing or development purposes
results = apply(model,df,param)
results

Cycle on computing level 7Cycle on computing level 12Cycle on computing level 16Cycle on computing level 18Cycle on computing level 20Cycle on computing level 22Cycle on computing level 26Cycle on computing level 28Cycle on computing level 33Cycle on computing level 36

Unnamed: 0,id,level
0,138.52.78.14,0
1,205.77.248.110,0
2,44.165.220.174,0
3,227.45.212.95,0
4,40.149.50.140,0
...,...,...
2,56.42.67.133,36
0,16.44.180.233,37
1,176.226.66.23,37
2,165.116.26.184,37


## Stage 5 - save the model

In [16]:
# save model to name in expected convention "<algo_name>_<model_name>"
def save(model,name):
    # with open(MODEL_DIRECTORY + name + ".json", 'w') as file:
    #    json.dump(model, file)
    return model

## Stage 6 - load the model

In [17]:
# load model from name in expected convention "<algo_name>_<model_name>"
def load(name):
    model = init(None,None)
    # with open(MODEL_DIRECTORY + name + ".json", 'r') as file:
    #    model = json.load(file)
    return model

## Stage 7 - provide a summary of the model

In [40]:
# return a model summary
def summary(model=None):
    returns = {"version": {"numpy": np.__version__, "pandas": pd.__version__, "graphistry": graphistry.__version__} }
    return returns

In [41]:
summary()

{'version': {'numpy': '1.20.3', 'pandas': '1.3.5', 'graphistry': '0.28.7'}}

## End of Stages
All subsequent cells are not tagged and can be used for further freeform code