# Cyber Use Case Tutorial #2: Network Mapping using RAPIDS

### GTC SJ 2019 (18 March 2019)
### Authors:
- Bianca Rhodes (NVIDIA)
- Bartley Richardson (NVIDIA)
- Eli Fajardo (NVIDIA)
- Bhargav Suryadevara (NVIDIA)
- Nick Becker (NVIDIA)

### Goals:
- Parse raw Windows Event Logs using cuDF
- Load netflow data into a cuDF
- Map parsed data to network graph edges using cuDF
- Use cuGraph pagerank
- Build a network graph

### Imports

In [None]:
import os
import re
import time
import dask_cudf
import dask.delayed
import nvstrings
import nvcategory
import yaml
import pandas as pd
import numpy as np
import json
import cugraph
import dask.dataframe as dd
import dask
import cudf

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from collections import OrderedDict
from cudf.core import DataFrame

### Background

With the insurmountable flow of data and connected devices today, it becomes critical to be able to map that data into a network graph for easy visual reference and analytics. We strive to recognize patterns and anomolies to combat cyber attacks.  
  
One of the common struggles today is the ability to parse data with speed. Here we will demonstrate how to parse raw Windows Event Logs.
  
By the end of this tutorial, we'll be able to parse raw Windows Event Logs containing authorization data, combine with netflow data, to form a network mapping graph.

### Parsing Windows Event Logs

In the cell below we'll be setting variables for the input columns and output columns.

First, define the input columns and dtypes. These input columns are defined by the data source provided by [Los Alamos National Laboratory](https://csr.lanl.gov/data/2017.html). The additional column "Raw" integrates the values from those columns to form a raw Windows Event Log.
  
Next, define the output columns and dtypes. These output columns are defined by the content of the Windows Event logs and more directly defined by the configuration of regex values `conf/lanl_regex_configs` used to parse each key value pair from the raw log.

In [None]:
INPUT_COLS_SET = ['Time',
                  'EventID',
                  'LogHost',
                  'LogonType',
                  'LogonTypeDescription',
                  'UserName',
                  'DomainName',
                  'LogonID',
                  'SubjectUserName',
                  'SubjectDomainName',
                  'SubjectLogonID',
                  'Status',
                  'Source',
                  'ServiceName',
                  'Destination',
                  'AuthenticationPackage',
                  'FailureReason',
                  'ProcessName',
                  'ProcessID',
                  'ParentProcessName',
                  'ParentProcessID',
                  'Raw']
INPUT_DTYPES = ['str' for x in INPUT_COLS_SET]

OUTPUT_COLS_SUPERSET = ['detailed_authentication_information_authentication_package',
                        'new_logon_logon_guid',
                        'failure_information_failure_reason',
                        'failure_information_status',
                        'computername',
                        'new_logon_logon_id',
                        'subject_security_id',
                        'detailed_authentication_information_package_name_ntlm_only',
                        'logon_type',
                        'account_for_which_logon_failed_security_id',
                        'detailed_authentication_information_key_length',
                        'subject_logon_id',
                        'process_information_caller_process_name',
                        'eventcode',
                        'process_information_caller_process_id',
                        'subject_account_name',
                        'process_information_process_name',
                        'new_logon_account_name',
                        'process_information_process_id',
                        'failure_information_sub_status',
                        'new_logon_security_id',
                        'network_information_source_network_address',
                        'detailed_authentication_information_transited_services',
                        'new_logon_account_domain',
                        'subject_account_domain',
                        'detailed_authentication_information_logon_process',
                        'account_for_which_logon_failed_account_domain',
                        'account_for_which_logon_failed_account_name',
                        'network_information_workstation_name',
                        'network_information_source_port',
                        'subject_security_id']
OUTPUT_DTYPES = ['float64' for x in OUTPUT_COLS_SUPERSET]

### Parsing Data Pipeline

Preprocess, or clean the raw logs, by removing the non-printable characters then begin to parse logs by event code type. Here we use regex mappings per event code to parse out each data element defined in OUTPUT_COLS_SUPERSET

In [None]:
def pipeline(df, event_codes, regex_mappings, clean=True):
    """
    """
    
    if clean:
        df = preprocess_logs(df)
    out_dfs = [] 
    
    # separate by eventcode and process differently
    
    for code in event_codes:
        portion = filter_by_pattern(df, code)
        temp = process_logtype(portion, regex_mappings, code)
        temp['time'] = portion['Time'].astype('int')
        temp['eventcode'] = portion['EventID']
        out_dfs.append(temp)
        
    # recombine the processed output
    out_df = cudf.concat(out_dfs)
    return out_df


def concat_wrapper(df_list):
    return cudf.concat(df_list)


def preprocess_logs(logs_gdf):
    """Lowercasing and replacing characters
    """
    logs_gdf['Raw'] = (
        logs_gdf['Raw'].str.lower()
        .str.replace('\\\\t', '')
        .str.replace('\\\\r', '')
        .str.replace('\\\\n', ' | ')
    )
    return logs_gdf


def process_logtype(df, regexes, eventcode):
    """Ongoing strings development/fixes will allow for cleaner log processing code in the future
    """

    log_df_processed = cudf.read_csv('conf/LANL_OUTPUT_COLS_SUPERSET.csv', dtype=OUTPUT_DTYPES)
    
    log_df_processed = log_df_processed[:0]
    columns = list(regexes[eventcode].keys())

    for col in columns:
        regex_pattern = regexes[eventcode].get(col)
        extracted_nvstrings = df['Raw'].str.extract(regex_pattern)
        
        if not extracted_nvstrings.empty:
            log_df_processed[col] = extracted_nvstrings[0]
    
    for col in log_df_processed.columns:
        if not log_df_processed[col].empty:
            if log_df_processed[col].dtype == 'float64':
                log_df_processed[col] = log_df_processed[col].astype('int').astype('str')
            elif log_df_processed[col].dtype == 'object':
                pass
            
            else:
                log_df_processed[col] = log_df_processed[col].astype('str')
        
        if log_df_processed[col].empty:
            log_df_processed[col] = nvstrings.to_device([])
    
    return log_df_processed


def filter_by_pattern(df, pattern):
    """Filter based on whether a string contains a reex pattern
    """
    df['present'] = df['EventID'].str.contains(pattern)
    return df[df.present == True]


def read_data(filename, **kwargs):
    """
    """
    gdf = dask_cudf.read_csv(filename, **kwargs)
    return gdf


def load_regex_yaml(yaml_file):
    with open(yaml_file) as f:
        regex_dict = yaml.safe_load(f)
        regex_dict = {k: v[0] for k, v in regex_dict.items()}
    return regex_dict


def create_regex_dictionaries(yaml_directory):
    regex_dict = {}
    for f in os.listdir(yaml_directory):
        if(os.path.splitext(f)[1] == ".yaml"):
            temp_regex = load_regex_yaml(yaml_directory + '/' + f)
            regex_dict[f[:-5]] = temp_regex
        
    return regex_dict


### Run Parsing Data Pipeline

In this instance, we'll be focusing on parsing raw Windows Event Logs that are of event code [4624](https://www.ultimatewindowssecurity.com/securitylog/encyclopedia/event.aspx?eventid=4624) and [4625](https://www.ultimatewindowssecurity.com/securitylog/encyclopedia/event.aspx?eventid=4625).

In [None]:
!mkdir -p ../../../data/input/lanl
!if [ ! -f ../../../data/input/lanl/wls.csv ]; then tar -xzvf ../../../data/lanl/wls.tar.gz -C ../../../data/input/lanl; fi

In [None]:
#raw lanl data parsing.
AUTH_INPUT_PATH = '../../../data/input/lanl/wls.csv'
REGEX_CONF_PATH = 'conf/lanl_regex_configs'
EVENT_CODES_OF_INTEREST = ['4624','4625']
REQUIRED_COLS = ['Time','EventID','Raw']
DELIMITER = ','

logs_gddf = dask_cudf.read_csv(AUTH_INPUT_PATH,
                               names=INPUT_COLS_SET,
                               delimiter=DELIMITER,
                               usecols=REQUIRED_COLS,
                               dtype=INPUT_DTYPES,
                               skip_blank_lines=True,
                              )
logs_gddf.head()

In [None]:
logs_gddf.dtypes

In [None]:
#raw lanl data parsing.
AUTH_INPUT_PATH = '../../../data/input/lanl/wls.csv'
REGEX_CONF_PATH = 'conf/lanl_regex_configs'
EVENT_CODES_OF_INTEREST = ['4624','4625']
REQUIRED_COLS = ['Time','EventID','Raw']
DELIMITER = ','

logs_gddf = dask_cudf.read_csv(AUTH_INPUT_PATH,
                               names=INPUT_COLS_SET,
                               delimiter=DELIMITER,
                               usecols=REQUIRED_COLS,
                               dtype=INPUT_DTYPES,
                               skip_blank_lines=True,
                              )

REGEX_MAPPINGS = create_regex_dictionaries(REGEX_CONF_PATH)

#parts = [dask.delayed(pipeline)(x, EVENT_CODES_OF_INTEREST, REGEX_MAPPINGS) for x in logs_gddf.to_delayed()]
parts = [dask.delayed(pipeline)(x, EVENT_CODES_OF_INTEREST, REGEX_MAPPINGS) for x in logs_gddf.to_delayed()]
temp_df = dask_cudf.from_delayed(parts)
# Bring data back to a single GPU, for downstream graph analytics
auth_gdf = temp_df.compute()
print(auth_gdf.shape)

### Read edge definitions from JSON file

Now the parsing of Windows Event Logs has concluded, next we prepare for the network mapping portion. Within the edge definitions configuration file we define our edges by indicating the source and destination for each edge, referencing the column names of our input data.  
  
Below we also read in the netflow data.

In [None]:
filename = 'conf/edge-definitions.json'
with open(filename) as f:
    edge_defs = json.load(f)

### Build network mapping edge list

This function helps to recognize the data types we read in via csv, particulary the string objects.

In [None]:
def get_dtypes(fn, delim, floats, strings):
    with open(fn, errors='replace') as fp:
        header = fp.readline().strip()
    
    types = []
    for col in header.split(delim):
        if 'date' in col:
            types.append((col, 'date'))
        elif col in floats:
            types.append((col, 'float64'))
        elif col in strings:
            types.append((col, 'str'))
        else:
            types.append((col, 'int32'))

    return OrderedDict(types)

### Read in Netflow Data

The netflow data is also provided by [Los Alamos National Laboratory](https://csr.lanl.gov/data/2017.html)

In [None]:
!mkdir -p ../../../data/input/lanl
!if [ ! -f ../../../data/input/lanl/netflow.csv ]; then tar -xzvf ../../../data/lanl/netflow.tar.gz -C ../../../data/input/lanl; fi

In [None]:
flow_input_path = '../../../data/input/lanl/netflow.csv'
dtypes_data_processed = get_dtypes(flow_input_path, ',', floats=[], strings=["SrcDevice", "DstDevice"])   
flow_gdf = cudf.io.csv.read_csv(flow_input_path, delimiter=',', names=list(dtypes_data_processed), 
                                       dtype=list(dtypes_data_processed.values()), skiprows=1)

Create a dictionary to reference both the auth data (parsed Windows Event Logs) and netflow data

In [None]:
ds_gdfs = {
           'lanl_auth': auth_gdf, 
           'lanl_flow': flow_gdf
          }

### Build edges dataframe

In the cell below, reference each datasource and its corresponding edge configuration to build a new dataframe containing edges. This dataframe will notably contain `srcCol` and `dstCol` along with other reference data.

In [None]:
edges_gdf = None

for ds in edge_defs:
    
    ds_gdf = ds_gdfs[ds['dataSource']]
    
    for e in ds["edges"]:
        
        evtCols = ds["stringCols"].copy()
        
        evtCols.append(e["srcCol"])
        evtCols.append(e["dstCol"])
        evtCols.append(ds["timeCol"])
        if 'filters' in e:
            for f in e['filters']:
                evtCols.append(f['key'])
        evtCols = list(set(evtCols))
        eventsDF = ds_gdf
        eventsDF = eventsDF[evtCols]
        
        # Apply filters indicated in the edge configuration file
        if 'filters' in e:
            for f in e['filters']:
                eventsDF = eventsDF[eventsDF[f['key']].str.contains(f['value']) == True]
        
        # Remove any None values
        src_idx = eventsDF[e['srcCol']].str.contains("None")
        if len(eventsDF[src_idx])>0:
            eventsDF = eventsDF[src_idx==False]
        
        dst_idx = eventsDF[e['dstCol']].str.contains("None")
        if len(eventsDF[dst_idx])>0:
            eventsDF = eventsDF[dst_idx==False]        
                
        evt_edges_gdf = cudf.DataFrame()
        evt_edges_gdf['src'] = eventsDF[e["srcCol"]]
        evt_edges_gdf['dst'] = eventsDF[e["dstCol"]]
        
        # Adjust time to recent date (LANL data source begins at 1 second)
        evt_edges_gdf['time'] = eventsDF[ds["timeCol"]]+1442131200
        
        evt_edges_gdf['src_node_type'] = e["srcNodeType"]
        evt_edges_gdf['dst_node_type'] = e["dstNodeType"]
        evt_edges_gdf['relationship'] = e["relationship"]
        evt_edges_gdf['data_source'] = ds["dataSource"]
        
        if edges_gdf is None:
            edges_gdf = evt_edges_gdf
        else:
            edges_gdf = cudf.concat([edges_gdf, evt_edges_gdf])

Use pandas to drop duplicates as this is not yet available in cudf for strings

In [None]:
edges_pd = edges_gdf.to_pandas().drop_duplicates()
edges_gdf = cudf.DataFrame.from_pandas(edges_pd)

### Create node list and assign numeric ids

Now that we have `edges_gdf` we can prepare the data for cuGraph by assigning continguous ids to the nodes and edges. cuGraph requires that all edges and nodes be identified using contiguous IDs.

In [None]:
src_nodes_pd = edges_pd[['src', 'src_node_type']].rename(columns={"src": "id", "src_node_type": "node_type"}).drop_duplicates()
dst_nodes = edges_pd[['dst', 'dst_node_type']].rename(columns={"dst": "id", "dst_node_type": "node_type"}).drop_duplicates()
all_nodes_pd = src_nodes_pd.append(dst_nodes).drop_duplicates()
all_nodes_gdf = cudf.DataFrame.from_pandas(all_nodes_pd)

In [None]:
# Assign contiguous id's to nodes for cugraph
idx = np.arange(len(all_nodes_gdf))
all_nodes_gdf['idx'] = idx
idmap_gdf = cudf.DataFrame({'id': all_nodes_gdf['id'], 'idx': idx})

### Add numeric src and dst node ids to edge list

In [None]:
# Add contiguous src id's to edges
edges_gdf['id'] = edges_gdf['src']
edges_gdf = edges_gdf.merge(idmap_gdf, on=['id'])
edges_gdf['src_idx'] = edges_gdf['idx']
edges_gdf = edges_gdf.drop(['id', 'idx'])

In [None]:
# Add contiguous dst id's to edges
edges_gdf['id'] = edges_gdf['dst']
edges_gdf = edges_gdf.merge(idmap_gdf, on=['id'])
edges_gdf['dst_idx'] = edges_gdf['idx']
edges_gdf = edges_gdf.drop(['id', 'idx'])

### Create input edge list for cuGraph

In [None]:
cg_edges_gdf = edges_gdf[['src_idx', 'dst_idx']]

In [None]:
cg_edges_gdf['src_idx'] = cg_edges_gdf['src_idx'].astype('int32')
cg_edges_gdf['dst_idx'] = cg_edges_gdf['dst_idx'].astype('int32')

### Run cuGraph PageRank

Next we create our graph and run pagerank.

In [None]:
G = cugraph.Graph()
G.add_edge_list(cg_edges_gdf['src_idx'], cg_edges_gdf['dst_idx'], None)

In [None]:
%time pr_gdf = cugraph.pagerank(G, alpha=0.85, max_iter=500, tol=1.0e-05)

In [None]:
print(pr_gdf)

### Add PageRank scores to node list

In [None]:
pr_gdf['idx'] = pr_gdf['vertex'].astype('int64')
all_nodes_gdf = all_nodes_gdf.merge(pr_gdf, on=['idx'], how='left')
all_nodes_gdf = all_nodes_gdf.drop(['vertex'])

### Graphistry Viz

We use [Graphistry](https://www.graphistry.com/) to visualize the network mapping graph. A snapshot of the graph constructed from this notebook can be viewed below. Run the code below to contruct a graph using Graphistry.

A snapshot of the graph constructed from this notebook is provided below. To generate it yourself, you'll need to register an account with Graphistry and configure your key below.

  ![Network Mapping Graphistry Image 1](images/graphistry1.png "Network Mapping using Graphistry")    
  
  
Zoom in to search for interesting subgraphs.

![Network Mapping Graphistry Image 2](images/graphistry2.png "Network Mapping using Graphistry")

In [None]:
import graphistry

In [None]:
# Register Graphistry key
# A graphistry instance is required to proceed. Please enter your own graphistry key and server information in the line below.
# Please visit https://www.graphistry.com/ for more information on Graphistry.
graphistry.register(key='', 
                    protocol='http', server='')

In [None]:
g_edges_pd = edges_gdf.to_pandas()
g_edges_pd = g_edges_pd.drop(columns=['dst_idx', 'dst_node_type', 'src_idx', 'src_node_type'])

In [None]:
g_nodes_pd = all_nodes_gdf.to_pandas()
acct_nodes_pd = g_nodes_pd[g_nodes_pd['node_type']=='account'].assign(color=228004, icon="user")
addr_nodes_pd = g_nodes_pd[g_nodes_pd['node_type']=='address'].assign(color=228010, icon="desktop")
g_nodes_pd = pd.concat([acct_nodes_pd, addr_nodes_pd])

In [None]:
g = graphistry.edges(g_edges_pd) \
                .bind(source='src', destination='dst')

In [None]:
g.nodes(g_nodes_pd).bind(node='id', point_color='color').plot()