# cyBERT: A Flexible Cyber Log Parser Based on the BERT Language Model

## Table of Contents
* Introduction
* Generating Labeled Logs
* Subword Tokenization
* Data Loading
* Fine-tuning a Pretrained BERT Model
* Model Evaluation
* Parsing with cyBERT

## Introduction

One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates how to train a [BERT language model](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270) using a dataset of 1000 Apache server logs as labeled data. We fine-tune a pretrained BERT model from [HuggingFace](https://github.com/huggingface) with a classification layer for Named Entity Recognition (NER).

We're calling this approach cyBERT - a nod to its cybersecurity application using a BERT-based language model. If you're interested in learning more about our motivation for creating cyBERT, we have a [blog post](https://medium.com/rapids-ai/cybert-28b35a4c81c4) that goes into further detail. You can also learn more about cyBERT and all of our cybersecurity work as part of RAPIDS on our [CLX GitHub repo](https://github.com/rapidsai/clx/).

A picture that better illustrates the cyBERT workflow for fine-tuning/training (upper portion) and inference (lower portion) is below.

![cyBERT Workflow](resources/cybert_workflow.png)

Let's get started on the notebook, making sure we have a GPU and doing some necessary imports.

In [None]:
!nvidia-smi

In [None]:
import torch
from transformers import BertForTokenClassification
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data.dataset import random_split
from torch.utils.dlpack import from_dlpack
import torch.nn.functional as F
from seqeval.metrics import classification_report,accuracy_score,f1_score

from tqdm import tqdm,trange
from collections import defaultdict

import pandas as pd
import numpy as np

import cupy
import cudf

import pickle
import os
import s3fs

We can also configure directories for our training data, necessary resources, and pre-trained models.

In [None]:
DATA_DIR = os.environ.get('DATA_DIR', './training_data')
RESOURCES_DIR = os.environ.get('RESOURCES_DIR', './resources')
MODELS_DIR = os.environ.get('MODELS_DIR', './models')

And we'll configure specific files we need throughout this example.

In [None]:
# file paths for the step-by-step example
TRAIN_DATA_PATH = os.path.join(DATA_DIR, 'apache_sample_1k.csv')
BERT_BASE_CASED_HASH_PATH = os.path.join(RESOURCES_DIR, 'bert-base-cased-hash.txt')
APACHE_CASED_MODEL_PATH = os.path.join(MODELS_DIR, 'apache_cased_example.pth')
APACHE_CASED_LABELMAP_PATH = os.path.join(MODELS_DIR, 'apache_label_map_example.txt')

# file paths for the larger model example
INPUT_MODEL_PATH = os.path.join(MODELS_DIR, 'apache_cased.pth')
INPUT_LABELMAP_PATH = os.path.join(MODELS_DIR, 'apache_label_map.txt')
INPUT_BERT_CASED_VOCAB_PATH = os.path.join(RESOURCES_DIR, 'bert-base-cased-vocab.txt')
INPUT_BERT_BASE_CASED_HASH_PATH = os.path.join(RESOURCES_DIR, 'bert-base-cased-hash.txt')

Some of the models we'll use are too large for a GitHub repo, but we can download them separately.

In [None]:
# configure paths for models
S3_BASE_PATH = "rapidsai-data/cyber/kdd2020/cybert/"

# specific model file for sample data
APACHE_CASED_MODEL_S3 = "apache_cased_example.pth"
INPUT_MODEL_S3 = "apache_cased.pth"

# download the example model file if it doesn't exist
# Note that this is ONLY for troubleshooting. We generate this model as part of the exercise. Only
# uncomment this code if something isn't working.
# if not os.path.exists(APACHE_CASED_MODEL_PATH):
#     fs = s3fs.S3FileSystem(anon=True)
#     print(">> '" + APACHE_CASED_MODEL_S3 + "' was not found, downloading now")
#     fs.get(S3_BASE_PATH + APACHE_CASED_MODEL_S3, APACHE_CASED_MODEL_PATH)
# else:
#     print(">> '" + APACHE_CASED_MODEL_S3 + "' was found at: " + APACHE_CASED_MODEL_PATH)
    
# download the lager/later model file if it doesn't exist
if not os.path.exists(INPUT_MODEL_PATH):
    fs = s3fs.S3FileSystem(anon=True)
    print(">> '" + INPUT_MODEL_S3 + "' was not found, downloading now")
    fs.get(S3_BASE_PATH + INPUT_MODEL_S3, INPUT_MODEL_PATH)
else:
    print(">> '" + INPUT_MODEL_S3 + "' was found at: " + INPUT_MODEL_PATH)

We're finally ready to get started!

## Generating Labels For Our Training Dataset

To train our model we begin with a dataframe containing parsed logs and additional `raw` column containing the whole raw log as a string. We will use the column names as our labels. For completelness, we're presenting a method to label raw data here in the notebook. In real world scenarios, you likely already have these labels somewhere in your SIEM and can so a simple query to extract them. It's unlikely that, in a production setting, you'd have to spend time manually generating labels like this. As with any deep learning task, the goal is to acquire enough target/training data (in this case, parsed logs with labeled entities) to successfully train a more robust model.

In [None]:
logs_df = cudf.read_csv(TRAIN_DATA_PATH)

We can inspect what a typical Apache web log looks like and print an example raw log.

In [None]:
logs_df.sample(1)

In [None]:
# sample raw log
print(logs_df.raw.loc[10])

In [None]:
def labeler(index_no, cols):
    """
    label the words in the raw log with the column name from the parsed log
    """
    raw_split = logs_df.raw_preprocess[index_no].split()
    
    # words in raw but not in parsed logs labeled as 'other'
    label_list = ['other'] * len(raw_split) 
    
    # for each parsed column find the location of the sequence of words (sublist) in the raw log
    for col in cols:
        if str(logs_df[col][index_no]) not in {'','-','None','NaN'}:
            sublist = str(logs_df[col][index_no]).split()
            sublist_len=len(sublist)
            match_count = 0
            for ind in (i for i,el in enumerate(raw_split) if el==sublist[0]):
                # words in raw log not present in the parsed log will be labeled with 'other'
                if (match_count < 1) and (raw_split[ind:ind+sublist_len]==sublist) and (label_list[ind:ind+sublist_len] == ['other'] * sublist_len):
                    label_list[ind:ind+sublist_len] = [col] * sublist_len
                    match_count = 1
    return label_list

As of the publication of this notebook, RAPIDS does not support strings in UDFs (you can track the status [here](https://github.com/rapidsai/cudf/issues/1195)). We can extract the labels using a very slow loop. To save time for the purposes of this tutorial, we've pre-computed the labels. As mentioned earlier, in a production environment, these labels and the associated training data would be gathered via an organization's security environment. The code to generate them in this notebook is left for transparency.

In [None]:
logs_df['raw_preprocess'] = logs_df.raw.str.replace('"','')

# column names to use as lables
cols = logs_df.columns.values.tolist()

# do not use raw columns as labels
cols.remove('raw')
cols.remove('raw_preprocess')

# using for loop for labeling funcition until string UDF capability in RAPIDS - it is currently slow
# uncomment the code below to run the label extraction yourself (not necessary for the tutorial)
# labels = []
# for indx in range(len(logs_df)):
#     labels.append(labeler(indx, cols))

# to save time, we've pre-computed the labels and import them below
labels = pickle.load(open(os.path.join(MODELS_DIR, 'apache_cased_example_labels.p'),'rb'))

We can inspect a few of the labels we extracted.

In [None]:
print(labels[10])

## Subword Labeling

We are using the `bert-base-cased` tokenizer vocabulary. This tokenizer splits our whitespace separated words further into in dictionary subword pieces. The model eventually uses the label from the first piece of a word as the sole label for the word, so we do not care about the model's ability to predict individual labels for the subword pieces. For training, the label used for these pieces is `X`. To learn more, see the [BERT paper](https://arxiv.org/abs/1810.04805).

The tokenization step is critical to the pipeline, and it's an area that has historially performed slowly on CPUs. For that reason, we created a [GPU-based subword tokenizer](https://medium.com/rapids-ai/preprocess-your-training-data-at-lightspeed-with-our-gpu-based-tokenizer-for-bert-language-models-561cf9c46e15) that can easily be used for any BERT and BERT-based tokenization preprocessing. Compared with popular CPU-based tokenizers, the GPU subword tokenizer is up to 271 times faster.

Here we create a function `subword_labeler` that contains all of the logic necessary to flag subword pieces in a tokenized log with an `X`.

In [None]:
def subword_labeler(log_list, label_list):
    """
    label all subword pieces in tokenized log with an 'X'
    """
    subword_labels = []
    for log, tags in zip(log_list,label_list):
        temp_tags = []
        words = cudf.Series(log.split())
        words_size = len(words)
        subword_counts = words.str.subword_tokenize(BERT_BASE_CASED_HASH_PATH, 10000, 10000,\
                                                    max_num_strings=words_size,max_num_chars=10000,\
                                                    max_rows_tensor=words_size,\
                                                    do_lower=False, do_truncate=False)[2].reshape(words_size, 3)[:,2]
        for i, tag in enumerate(tags):
            temp_tags.append(tag)
            temp_tags.extend('X'* subword_counts[i].item())
        subword_labels.append(temp_tags)
    return subword_labels

We call the `subword_labeler` to perform that action. Notice that the function uses the [`subword_tokenize` function](https://docs.rapids.ai/api/cudf/nightly/api.html?highlight=subword#cudf.core.column.string.StringMethods.subword_tokenize) of cuDF.

In [None]:
subword_labels = subword_labeler(logs_df['raw_preprocess'].to_arrow().to_pylist(), labels)

In [None]:
print(subword_labels[10])

The next step is to create a set list of all labels from our dataset, add `X` for wordpiece tokens we will not have tags for, and `[PAD]` for logs shorter than the length of the model's embedding.

In [None]:
# set of labels
label_values = list(set(x for l in labels for x in l))

label_values[:0] = ['[PAD]']  
label_values.append('X')

# Set a dict for mapping id to tag name
label2idx = {t: i for i, t in enumerate(label_values)}

We can inspect the label-to-index mapping we created.

In [None]:
print(label2idx)

In [None]:
def pad(l, content, width):
    l.extend([content] * (width - len(l)))
    return l

In [None]:
padded_labels = [pad(x[:256], '[PAD]', 256) for x in subword_labels]
int_labels = [[label2idx.get(l) for l in lab] for lab in padded_labels]

We have the labels that exist on the CPU (when RAPIDS supports string UDFs, we can move this whole process to the GPU), so we'll move them to a tensor and to the GPU for further processing.

In [None]:
label_tensor = torch.tensor(int_labels).to('cuda')

## Training and Validation Datasets

Like any other good ML/DL modeling project, we need to generate robust training and validation datasets. Our datasets need three specific features:

1. `input_ids` - subword tokens as integers padded to the specific length of the model,
2. `attention_mask` - a binary mask that allows the model to ignore padding, and
3. `labels` - corresponding labels for tokens as integers. 

This is easy to do on the GPU becasue of RAPIDS, CuPy, DLPack, and the GPU subword tokenizer. We define the function `bert_cased_tokenizer` to handle the conversion of a cuDF series object to two Torch tensors - one tensor with token IDs and another tensor that represents an attention mask.

In [None]:
def bert_cased_tokenizer(strings):
    """
    converts cudf.Seires of strings to two torch tensors- token ids and attention mask with padding
    """    
    num_strings = len(strings)
    num_bytes = strings.str.byte_count().sum()
    token_ids, mask = strings.str.subword_tokenize("resources/bert-base-cased-hash.txt", 256, 256,
                                                            max_num_strings=num_strings,
                                                            max_num_chars=num_bytes,
                                                            max_rows_tensor=num_strings,
                                                            do_lower=False, do_truncate=True)[:2]
    # convert from cupy to torch tensor using dlpack
    input_ids = from_dlpack(token_ids.reshape(num_strings,256).astype(cupy.float).toDlpack())
    attention_mask = from_dlpack(mask.reshape(num_strings,256).astype(cupy.float).toDlpack())
    return input_ids.type(torch.long), attention_mask.type(torch.long)

In [None]:
input_ids, attention_masks = bert_cased_tokenizer(logs_df.raw_preprocess)

We can finally create a consolidated dataset.

In [None]:
dataset = TensorDataset(input_ids, attention_masks, label_tensor)

PyTorch includes the `random_split` function, and we'll utilize it to create training (80%) and validation (20%) datasets.

In [None]:
dataset_size = len(input_ids)
training_dataset, validation_dataset = random_split(dataset, (int(dataset_size*.8), int(dataset_size*.2)))

The last step here is to create training and validation dataloaders in PyTorch.

In [None]:
train_dataloader = DataLoader(dataset=training_dataset, shuffle=True, batch_size=32)
val_dataloader = DataLoader(dataset=validation_dataset, shuffle=False, batch_size=1)

## Fine-tuning A Pretrained BERT Model

There are a number of different models we could use, and we could always take the step to train a BERT model from scratch ourselves. After some internal experimentation, we found that a pretrained BERT model works fairly well for most tasks. There are some slight advantages to training a transformer model from scratch using a combination of natural language and cybersecurity-specific training data, but that is outside the scope for this tutorial.

We'll make use of a BERT cased model for now, and we pass it the number of labels that we generated earlier.

In [None]:
model = BertForTokenClassification.from_pretrained("bert-base-cased", num_labels=len(label2idx))

The pretrained model isn't on the GPU, so be sure to move it there.

In [None]:
model.cuda();

We need to define an optimizer and a learning rate for the training process. There's a lot to experiment with here. Since the purpose of this tutorial isn't to go deep into transformer architecture but rather to show how RAPIDS and PyTorch can easily integrate, we'll use some values we've found are useful during our experiments.

In [None]:
FULL_FINETUNING = True
if FULL_FINETUNING:
    #fine tune all layer parameters
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0}
    ]
else:
    # only fine tune classifier parameters
    param_optimizer = list(model.classifier.named_parameters()) 
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = Adam(optimizer_grouped_parameters, lr=3e-5)

One of the goals here is to train a model to parse logs that it _hasn't seen before_. We need to be especially sensitive to overfitting our model to our training data. We can control this by limiting the number of training epochs. There's a bit of an art to this as well, and we'll use two epochs for this tutorial.

In [None]:
%%time

epochs = 2
max_grad_norm = 1.0

for _ in trange(epochs, desc="Epoch"):
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(train_dataloader):
        b_input_ids, b_input_mask, b_labels = batch
        
        # forward pass
        loss, scores = model(b_input_ids, token_type_ids=None,
                     attention_mask=b_input_mask, labels=b_labels)
        
        # backward pass
        loss.backward()
        
        # track train loss
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1
        
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)
        
        # update parameters
        optimizer.step()
        model.zero_grad()
        
    # print train loss per epoch
    print("Train loss: {}".format(tr_loss/nb_tr_steps))

## Model Evaluation

We want to be able to evlauate our model and see how it's working. We'll utilize the validation dataset and corresponding validataion dataloader (`val_dataloader`) we created earlier in the notebook. The code below walks though this, but essentially we're performing a NER prediction task and then matching that with the ground-truth label. This lets us calculate the precision, recall, and F1 score for every label/type.

In [None]:
# no dropout or batch norm during eval
model.eval();

In [None]:
# map index to label name
idx2label={label2idx[key] : key for key in label2idx.keys()}

eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
y_true = []
y_pred = []

for step, batch in enumerate(val_dataloader):
    input_ids, input_mask, label_ids = batch
        
    with torch.no_grad():
        outputs = model(input_ids, token_type_ids=None,
        attention_mask=input_mask,)
        
        # for eval mode, the first result of outputs is logits
        logits = outputs[0] 
        
    # get NER predicted result
    logits = torch.argmax(F.log_softmax(logits,dim=2),dim=2)
    logits = logits.detach().cpu().numpy()
    
    # get NER true result (the actual label)
    label_ids = label_ids.detach().cpu().numpy()
    
    # only predict the groud truth, mask=0, will not calculate
    input_mask = input_mask.detach().cpu().numpy()
    
    # compare the valuable predict result
    for i,mask in enumerate(input_mask):
        
        # ground truth 
        temp_1 = []
        
        # prediction
        temp_2 = []
        
        for j, m in enumerate(mask):
            
            # Mask=0 is PAD, do not compare
            if m: # xxclude the X label
                if idx2label[label_ids[i][j]] != "X" and idx2label[label_ids[i][j]] != "[PAD]": 
                    temp_1.append(idx2label[label_ids[i][j]])
                    temp_2.append(idx2label[logits[i][j]])
            else:
                break      
        y_true.append(temp_1)
        y_pred.append(temp_2)

print("f1 score: %f"%(f1_score(y_true, y_pred)))
print("Accuracy score: %f"%(accuracy_score(y_true, y_pred)))

# get acc, recall, F1 result report
print(classification_report(y_true, y_pred,digits=4))

## Saving Model Files for Future Parsing with cyBERT

Obviously we don't want to fine-tune (or train) a new cyBERT model constantly. We can save the model weights and the label mapping out for use later.

In [None]:
# model weights 
torch.save(model.state_dict(), APACHE_CASED_MODEL_PATH)

# label map
with open(APACHE_CASED_LABELMAP_PATH, mode='wt') as output:
    for label in idx2label.values():
        output.writelines(label)

## Inference with cyBERT

After creating a cyBERT model (or collection of models), you will likely want to use one or many of them for inference - actually parsing logs with the existing model. We already have this loaded since everything is in one notebook, but we'll include how to easily load this information back in.

We're going to take this time to actually load in a better model file. To simplify this tutorial, we showed fine-tuning a cyBERT model on only 1000 logs. This won't give great results. We have a pre-trained model that will work on Apache logs ready to go, so we'll load that back in. We can use the same label map file, and we'll obviously use the same vocab and hash files (since we're still using a BERT cased pre-trained model). All of these files were previously configured at the beginning of the input (they begin with `INPUT_` in the variable names).

In [None]:
# load label map
label_map = {}
with open(INPUT_LABELMAP_PATH) as f:
    for index, line in enumerate(f):
        label_map[index] = line.split()[0]

# load vocab lookup
vocab_lookup = {}
with open(INPUT_BERT_CASED_VOCAB_PATH) as f:
        for index, line in enumerate(f):
            vocab_lookup[index] = line.split()[0]
            
# load model state dictionary from fine-tuning
model_state_dict = torch.load(INPUT_MODEL_PATH)

# load model
model = BertForTokenClassification.from_pretrained('bert-base-cased', state_dict=model_state_dict, num_labels=len(label_map))
model.cuda();
model.eval();

Much like our fine-tuning pipeline, we need to do a preprocessing step before we can inference. We take the time to create this as a function so we can reuse it. Note that since we are not truncating our logs, we don't necessairily know how large the tensor needs to be. We solve that right now by calculating the maximum number of rows in the tensor by making some acceptable assumptions about the input logs and calculating the value by counting and summing bytes.

In [None]:
def preprocess(raw_data_df):
    """transform string data into token ids and attention mask"""
    raw_data_df = raw_data_df.str.replace('"','')
    byte_count = raw_data_df.str.byte_count()
    max_num_chars = byte_count.sum()
    max_rows_tensor = int((byte_count/120).ceil().sum())
        
    input_ids, attention_mask, meta_data = raw_data_df.str.subword_tokenize(INPUT_BERT_BASE_CASED_HASH_PATH, 128, 116,
                                                                            max_num_strings=len(raw_data_df),\
                                                                            max_num_chars=max_num_chars,\
                                                                            max_rows_tensor=max_rows_tensor,\
                                                                            do_lower=False, do_truncate=False)
    num_rows = int(len(input_ids)/128)
    input_ids = from_dlpack((input_ids.reshape(num_rows,128).astype(cupy.float)).toDlpack())
    attention_mask = from_dlpack((attention_mask.reshape(num_rows,128).astype(cupy.float)).toDlpack())
    meta_data = meta_data.reshape(num_rows, 3)
            
    return input_ids.type(torch.long), attention_mask.type(torch.long), meta_data

We also define an `inference` function to perform the actual parsing of the logs.

In [None]:
def inference(model, input_ids, attention_masks, meta_data):
    with torch.no_grad():
        logits = model(input_ids, attention_masks)[0]
    logits = F.softmax(logits, dim=2)
    confidences, labels = torch.max(logits,2)
    infer_pdf = pd.DataFrame(meta_data).astype(int)
    infer_pdf.columns = ['doc','start','stop']
    infer_pdf['confidences'] = confidences.detach().cpu().numpy().tolist()
    infer_pdf['labels'] = labels.detach().cpu().numpy().tolist()
    infer_pdf['token_ids'] = input_ids.detach().cpu().numpy().tolist() 
    return infer_pdf

There's a lot of post-processing that we need to do to get logs back in a format where they're useful for humans and easy to ingest into a security infrastructure or database. In addition to everything we do to make the logs look nice and exist in a convenient format, we also take the step to record a confidence score for every label for every parsed log. This allows us to monitor how cyBERT is performing and adjust (including fine-tuning or retraining a new model) accordingly. The `decode_cleanup` function is really just to make things nice to look at; we remove escape characters and extra spaces.

In [None]:
def postprocess(infer_pdf):
    
    # cut overlapping edges
    infer_pdf['confidences'] = infer_pdf.apply(lambda row: row['confidences'][row['start']:row['stop']], axis=1)
    infer_pdf['labels'] = infer_pdf.apply(lambda row: row['labels'][row['start']:row['stop']], axis=1)
    infer_pdf['token_ids'] = infer_pdf.apply(lambda row: row['token_ids'][row['start']:row['stop']], axis=1)
        
    # aggregated logs
    infer_pdf = infer_pdf.groupby('doc').agg({'token_ids': 'sum', 'confidences': 'sum', 'labels': 'sum'})
        
    # parse_by_label
    parsed_dfs = infer_pdf.apply(lambda row: parsed_by_label(row), axis=1, result_type='expand')
    parsed_df = pd.DataFrame(parsed_dfs[0].tolist())
    confidence_df = pd.DataFrame(parsed_dfs[1].tolist())
    confidence_df = confidence_df.drop(['X'], axis=1).applymap(np.mean)
        
    # decode cleanup
    parsed_df = decode_cleanup(parsed_df)
    return cudf.from_pandas(parsed_df), cudf.from_pandas(confidence_df)

def parsed_by_label(row):
    token_dict = defaultdict(str)
    confidence_dict = defaultdict(list) 
    for label, confidence, token_id in zip(row['labels'], row['confidences'], row['token_ids']):
        text_token = vocab_lookup[token_id]
        if text_token[:2] != '##':  
       ## if not a subword use the current label, else use previous
            new_label = label
            new_confidence = confidence 
        token_dict[label_map[new_label]] = token_dict[label_map[new_label]] + ' ' + text_token
        confidence_dict[label_map[label]].append(new_confidence)
    return token_dict, confidence_dict

def decode_cleanup(df):
    return df.replace(' ##', '', regex=True) \
             .replace(' : ', ':', regex=True) \
             .replace('\[ ', '[', regex=True) \
             .replace(' ]', ']', regex=True) \
             .replace(' /', '/', regex=True) \
             .replace('/ ', '/', regex=True) \
             .replace(' - ', '-', regex=True) \
             .replace(' \( ', ' (', regex=True)\
             .replace(' \) ', ') ', regex=True)\
             .replace('\+ ', '+', regex=True)\
             .replace(' . ', '.', regex=True)

We now have everything we need to preprocess, inference, and postprocess raw logs. To illustrate how this works, we'll define a small amount of raw logs directly in a cuDF.

In [None]:
test_data = cudf.Series(['109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"',
'109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"',
'46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"',
'46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"',
'83.167.113.100 - - [12/Dec/2015:18:31:25 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"'])

In [None]:
# preprocess
input_ids, attention_masks, meta_data = preprocess(test_data)

# model inference
infer_df = inference(model, input_ids, attention_masks, meta_data)

# postprocess
parsed_df, confidence_df = postprocess(infer_df)

We can now examine the parsed log data.

In [None]:
parsed_df

Also of interest is the `confidence_df`. It lists as a single number how confident we are, per log, that that particular tag was parsed correctly. Note that if a value is not in a particular log (row), the confidence score is `null`.

In [None]:
confidence_df

## Conclusions and Wrap-UP

In this notebook, we've shown how it's possible to use a natural language model to parse cybersecurity logs. Although we only demonstrated it with one log type, in practice a single model is able to accurately parse multiple log types. Internally, we've experimented with using one model for Windows Event (all codes we collect), Active Directory, DNS (two ISVs), and Apache web logs. We're continuing to build out capabilities here, and we're always pushing new content and example notebooks to the [CLX GitHub repo](https://github.com/rapidsai/clx/) as well as new [blog posts on Medium](http://medium.com/rapids-ai/). We welcome all of your suggestions, comments, and ideas, so please engage with us on GitHub or via any of the [community engagement methods](https://rapids.ai/community.html) we have as a part of RAPIDS.

## Acknowledgments

This notebok was orignally created by Rachel Allen as part of her work on cyBERT.