# cyBERT: a flexible log parser based on the BERT language model

## Authors
 - Rachel Allen, PhD (NVIDIA)
 - Bartley Richardson, PhD (NVIDIA)
 
## Table of Contents
* Introduction
* Generating Labeled Logs
* Tokenization
* Data Loading
* Fine-tuning pretrained BERT
* Model Evaluation

## Introduction

One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates how to train a BERT language model using a toy dataset of just 1000 previously parsed windows event logs as a labeled data. We will fine-tune a pretrained BERT model using the PyTorch interface from the [HuggingFace](https://github.com/huggingface) library with a classification layer for Named Entity Recognition. The HuggingFace library is th emost widely accepted interface for working with BERT. It includes pre-built modifications of BERT for specific tasks. In our case we are using `BertForTokenClassification`.

In [1]:
import torch

from pytorch_transformers import BertTokenizer, BertModel, BertForTokenClassification, AdamW
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
from seqeval.metrics import classification_report,accuracy_score,f1_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm,trange

import pandas as pd
import numpy as np

## Generating Labeled Logs

To train our model we begin with a dataframe containing parsed logs and additional `raw` column containing the whole raw log as a string. We will use the column names as our labels. Note that this original version is on CPU, future versions utlize RAPIDS.

In [2]:
logs_df = pd.read_csv('../data/winevt_sample.csv')

In [3]:
# sample parsed log for training
logs_df.loc[0]

time                                                                              1525108499000
eventcode                                                                                  4624
uuid                                                       fc2f8767-9917-40ce-b1b2-19ac4ec80e76
computername                                                                 laptop-70.wong.com
eventtype                                                                                     0
keywords                                                                          Audit Success
logname                                                                                Security
message                                                  An account was successfully logged on.
opcode                                                                                     Info
recordnumber                                                                         1920306769
sourcename                              

In [4]:
# sample raw log
logs_df.raw[1]

'02/28/2019 12:49:04 AM LogName= Security SourceName= Microsoft Windows security auditing. EventCode= 4624 EventType= 0 Type= Information ComputerName= lt-95.melton.com TaskCategory= Logon OpCode= Info RecordNumber= 474033423 Keywords= Audit Success Message= An account was successfully logged on.    Subject:   Account Name:  gonzalespeter   Account Domain:  taylor.com   New Logon:   Account Name:  gonzalespeter@acme.com   Account Domain:  blair.com    Network Information:   Workstation Name:  desktop-gonzalespeter   Network Address:  192.175.54.118'

In [5]:
cols = logs_df.columns.values.tolist()
cols.remove('raw')

def labeler(row):
    raw_split = row['raw'].split()
    label_list = ['other'] * len(raw_split) 
    for col in cols:
        if str(row[col]) not in {'','-','None','NaN'}:
            sublist = str(row[col]).split()
            sublist_len=len(sublist)
            match_count = 0
            for ind in (i for i,el in enumerate(raw_split) if el==sublist[0]):
                if match_count < 1:
                    if raw_split[ind:ind+sublist_len]==sublist:
                        if label_list[ind:ind+sublist_len] == ['other'] * sublist_len:
                            label_list[ind:ind+sublist_len] = [col] * sublist_len
                            match_count = 1
    return label_list

In [6]:
logs_df['labels'] = logs_df.apply(lambda x : labeler(x), axis=1)

In [7]:
print(logs_df.labels[0])

['insert_time', 'insert_time', 'insert_time', 'other', 'logname', 'other', 'sourcename', 'sourcename', 'sourcename', 'sourcename', 'other', 'eventcode', 'other', 'eventtype', 'other', 'type', 'other', 'computername', 'other', 'taskcategory', 'other', 'opcode', 'other', 'recordnumber', 'other', 'keywords', 'keywords', 'other', 'message', 'message', 'message', 'message', 'message', 'message', 'other', 'other', 'other', 'subject_account_name', 'other', 'other', 'subject_account_domain', 'other', 'other', 'other', 'other', 'new_logon_account_name', 'other', 'other', 'new_logon_account_domain', 'other', 'other', 'other', 'other', 'network_information_workstation_name', 'other', 'other', 'network_information_source_network_address']


In [8]:
logs_df['logs'] = logs_df['raw'].apply(lambda x: x.split())

In [9]:
labels = logs_df['labels'].tolist()
logs = logs_df['logs'].tolist()

## Tag/Label list

We create a set list of all labels(tags) from our dataset, add `X` for wordpiece tokens we will not have tags for and `[PAD]` for logs shorter than the length of the model's embedding.

In [10]:
# set of tags
tag_values = list(set(x for l in labels for x in l))

# add 'X' tag for wordpiece 
tag_values.append('X')
tag_values.append('[PAD]')

# Set a dict for mapping id to tag name
tag2idx = {t: i for i, t in enumerate(tag_values)}

## Wordpiece tokenization
We are using the `bert-base-uncased` tokenizer from the pretrained BERT library from [HuggingFace](https://github.com/huggingface). This tokenizer splits our whitespace separated words further into in dictionary sub-word pieces. The model eventually uses the label from the first piece of a word as it's tag, so we do not care about the model's ability to predict labels for the sub-word pieces. For training, the tag used for these pieces is `X`. To learn more see the [BERT paper](https://arxiv.org/abs/1810.04805).

In [11]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenized_texts = []
new_labels = []
for sentence, tags in zip(logs,labels):
    new_tags = []
    new_text = []
    for word, tag in zip(sentence,tags):
        sub_words = tokenizer.wordpiece_tokenizer.tokenize(word.lower())
        for count, sub_word in enumerate(sub_words):
            if count > 0:
                tag = 'X'
            new_tags.append(tag)
            new_text.append(sub_word)
    tokenized_texts.append(new_text)
    new_labels.append(new_tags)

## Model inputs
For training our models needs (1) wordpiece tokens as integers padded to the specific length of the model (2) corresponding tags as integers and (3) a binary attention mask that ignores padding. Here we have have used 256 for the model size for each log or log piece. 

In [12]:
# convert string tokens into ints
input_ids = [tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts]

# pad with input_ids with zeros and labels with [PAD]
def pad(l, content, width):
    l.extend([content] * (width - len(l)))
    return l

input_ids = [pad(x, 0, 256) for x in input_ids]

new_labels = [pad(x, '[PAD]', 256) for x in new_labels]


# attention mask for model to ignore padding
attention_masks = [[int(i>0) for i in ii] for ii in input_ids]

# convert labels/tags to ints
tags = [[tag2idx.get(l) for l in lab] for lab in new_labels]

## Training and testing datasets
We split the data into training and validation sets.

In [13]:
tr_inputs, val_inputs, tr_tags, val_tags,tr_masks, val_masks = train_test_split(input_ids, tags, attention_masks, random_state=1234, test_size=0.1)

Move the datasets to the GPU

In [14]:
device = torch.device("cuda")

In [15]:
tr_inputs = torch.tensor(tr_inputs)
val_inputs = torch.tensor(val_inputs)
tr_tags = torch.tensor(tr_tags)
val_tags = torch.tensor(val_tags)
tr_masks = torch.tensor(tr_masks)
val_masks = torch.tensor(val_masks)

We create dataloaders to make batches of data ready to feed into the model. Authors recommend batch size of 16 or 32. Here we use a batch size of 32.

In [16]:
train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=32)

valid_data = TensorDataset(val_inputs, val_masks, val_tags)
valid_sampler = SequentialSampler(valid_data)
valid_dataloader = DataLoader(valid_data, sampler=valid_sampler, batch_size=32)

# Fine-tuning pretrained BERT
Download pretrained model from HuggingFace and move to GPU

In [17]:
model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=len(tag2idx))
#model to gpu
model.cuda();

Fine tune all parameter layers from the pretrained model.

In [18]:
FULL_FINETUNING = True
if FULL_FINETUNING:
    #fine tune all layer parameters
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0}
    ]
else:
    # only fine tune classifier parameters
    param_optimizer = list(model.classifier.named_parameters()) 
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = Adam(optimizer_grouped_parameters, lr=3e-5)

We're using an simple measure for accuracy. Total correct predictions over total number of labeled tokens accross logs. 

In [19]:
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [20]:
# using 2 epochs to avoid overfitting- paper recommends 2 to 4 epochs


epochs = 2
max_grad_norm = 1.0

for _ in trange(epochs, desc="Epoch"):
    # TRAIN loop
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(train_dataloader):
        # add batch to gpu
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        # forward pass
        loss, scores = model(b_input_ids, token_type_ids=None,
                     attention_mask=b_input_mask, labels=b_labels)
        # backward pass
        loss.backward()
        # track train loss
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)
        # update parameters
        optimizer.step()
        model.zero_grad()
    # print train loss per epoch
    print("Train loss: {}".format(tr_loss/nb_tr_steps))
    # VALIDATION on validation set
    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    predictions , true_labels = [], []
    for batch in valid_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        
        with torch.no_grad():
            tmp_eval_loss, logits = model(b_input_ids, token_type_ids=None,
                           attention_mask=b_input_mask, labels=b_labels)
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
        true_labels.append(label_ids)
        
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        eval_loss += tmp_eval_loss.mean().item()
        eval_accuracy += tmp_eval_accuracy
        
        nb_eval_examples += b_input_ids.size(0)
        nb_eval_steps += 1
    eval_loss = eval_loss/nb_eval_steps
    print("Validation loss: {}".format(eval_loss))
    print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
    pred_tags = [tag_values[p_i] for p in predictions for p_i in p]
    valid_tags = [tag_values[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
    print("F1-Score: {}".format(f1_score(pred_tags, valid_tags)))

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Train loss: 1.1063271555407295


Epoch:  50%|█████     | 1/2 [00:15<00:15, 15.51s/it]

Validation loss: 0.2913314178586006
Validation Accuracy: 0.550140380859375
F1-Score: 0.8453539528062924
Train loss: 0.14663322655291394


Epoch: 100%|██████████| 2/2 [00:30<00:00, 15.36s/it]

Validation loss: 0.026326983235776424
Validation Accuracy: 0.58355712890625
F1-Score: 0.9789872096058471





## Model Evaluation

We want to look at our model's performance for individual fields and ignore the predictions for the `X` tag that we used for subword pieces.

In [21]:
model.eval();

In [22]:
# Mapping index to name
tag2name={tag2idx[key] : key for key in tag2idx.keys()}

In [23]:
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
y_true = []
y_pred = []

for step, batch in enumerate(valid_dataloader):
    batch = tuple(t.to(device) for t in batch)
    input_ids, input_mask, label_ids = batch
        
    with torch.no_grad():
        outputs = model(input_ids, token_type_ids=None,
        attention_mask=input_mask,)
        
        # For eval mode, the first result of outputs is logits
        logits = outputs[0] 
        
    # Get NER predict result
    logits = torch.argmax(F.log_softmax(logits,dim=2),dim=2)
    logits = logits.detach().cpu().numpy()
    
    # Get NER true result
    label_ids = label_ids.to('cpu').numpy()
    
    # Only predict the groud truth, mask=0, will not calculate
    input_mask = input_mask.to('cpu').numpy()
    
    # Compare the valuable predict result
    for i,mask in enumerate(input_mask):
        # ground truth 
        temp_1 = []
        # Prediction
        temp_2 = []
        
        for j, m in enumerate(mask):
            # Mask=0 is PAD, do not compare
            if m: # Exclude the X label
                if tag2name[label_ids[i][j]] != "X" : 
                    temp_1.append(tag2name[label_ids[i][j]])
                    temp_2.append(tag2name[logits[i][j]])
            else:
                break      
        y_true.append(temp_1)
        y_pred.append(temp_2)

print("f1 score: %f"%(f1_score(y_true, y_pred)))
print("Accuracy score: %f"%(accuracy_score(y_true, y_pred)))

# Get acc , recall, F1 result report
print(classification_report(y_true, y_pred,digits=4))

f1 score: 0.996850
Accuracy score: 0.998094
                                            precision    recall  f1-score   support

                                     other     1.0000    1.0000    1.0000      1696
                    subject_account_domain     1.0000    0.9400    0.9691       100
                                    opcode     1.0000    1.0000    1.0000       100
                              recordnumber     1.0000    1.0000    1.0000       100
                                 eventcode     1.0000    1.0000    1.0000       100
                    new_logon_account_name     1.0000    1.0000    1.0000       100
                              computername     1.0000    1.0000    1.0000       100
                              taskcategory     1.0000    1.0000    1.0000       100
network_information_source_network_address     1.0000    1.0000    1.0000       100
                                   logname     0.9524    1.0000    0.9756       100
                               

Even using a small toy dataset, the model performs pretty well!

## Stay tuned-- All GPU pipeline coming soon!