# cyBERT: a flexible log parser based on the BERT language model

## Table of Contents
* Introduction
* Generating Labeled Logs
* Subword Tokenization
* Data Loading
* Fine-tuning pretrained BERT
* Model Evaluation
* Parsing with cyBERT

## Introduction

One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates how to train a BERT language model using a toy dataset of just 1000 previously parsed apache server logs as a labeled data. We will fine-tune a pretrained BERT model from [HuggingFace](https://github.com/huggingface) with a classification layer for Named Entity Recognition.

In [1]:
from os import path
import s3fs
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data.dataset import random_split
from torch.utils.dlpack import from_dlpack
from seqeval.metrics import classification_report,accuracy_score,f1_score
from transformers import BertForTokenClassification
from tqdm import tqdm,trange
from collections import defaultdict
import pandas as pd
import numpy as np
import cupy
import cudf

## Generating Labels For Our Training Dataset

To train our model we begin with a dataframe containing parsed logs and additional `raw` column containing the whole raw log as a string. We will use the column names as our labels.

In [2]:
# download log data
APACHE_SAMPLE_CSV = "apache_sample_1k.csv"
S3_BASE_PATH = "rapidsai-data/cyber/clx"

if not path.exists(APACHE_SAMPLE_CSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + APACHE_SAMPLE_CSV, APACHE_SAMPLE_CSV)

In [3]:
logs_df = cudf.read_csv(APACHE_SAMPLE_CSV)

In [4]:
# sample parsed log
logs_df.sample(1)

Unnamed: 0,error_level,error_message,raw,remote_host,remote_logname,remote_user,request_header_referer,request_header_user_agent,request_header_user_agent__browser__family,request_header_user_agent__browser__version_string,...,request_url_username,response_bytes_clf,status,time_received,time_received_datetimeobj,time_received_isoformat,time_received_tz_datetimeobj,time_received_tz_isoformat,time_received_utc_datetimeobj,time_received_utc_isoformat
829,,,46.105.57.86 - - [21/Oct/2018:16:52:55 +0200] ...,46.105.57.86,-,-,http://almhuette-raith.at/administrator/index....,Mozilla/5.0 (Linux; U; Android 2.2) AppleWebKi...,Android,2.2,...,,4494,200.0,[21/Oct/2018:16:52:55 +0200],1540141000000.0,2018-10-21T16:52:55,1540134000000.0,2018-10-21T16:52:55+02:00,1540134000000.0,2018-10-21T14:52:55+00:00


In [5]:
# sample raw log
print(logs_df.raw.loc[10])

95.108.213.19 - - [18/Jul/2018:21:53:07 +0200] "GET / HTTP/1.1" 200 10439 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"



In [6]:
def labeler(index_no, cols):
    """
    label the words in the raw log with the column name from the parsed log
    """
    raw_split = logs_df.raw_preprocess[index_no].split()
    
    # words in raw but not in parsed logs labeled as 'other'
    label_list = ['O'] * len(raw_split) 
    
    # for each parsed column find the location of the sequence of words (sublist) in the raw log
    for col in cols:
        if str(logs_df[col][index_no]) not in {'','-','None','NaN'}:
            sublist = str(logs_df[col][index_no]).split()
            sublist_len=len(sublist)
            match_count = 0
            for ind in (i for i,el in enumerate(raw_split) if el==sublist[0]):
                # words in raw log not present in the parsed log will be labeled with 'O'
                if (match_count < 1) and (raw_split[ind:ind+sublist_len]==sublist) and (label_list[ind:ind+sublist_len] == ['O'] * sublist_len):
                    label_list[ind] = 'B-'+col
                    label_list[ind+1:ind+sublist_len] = ['I-'+col] * (sublist_len - 1)
                    match_count = 1
    return label_list

In [7]:
logs_df['raw_preprocess'] = logs_df.raw.str.replace('"','')

# column names to use as lables
cols = logs_df.columns.values.tolist()

# do not use raw columns as labels
cols.remove('raw')
cols.remove('raw_preprocess')

# using for loop for labeling funcition until string UDF capability in rapids- it is currently slow
labels = []
for indx in range(len(logs_df)):
    labels.append(labeler(indx, cols))

In [8]:
print(labels[10])

['B-remote_host', 'O', 'O', 'B-time_received', 'I-time_received', 'B-request_method', 'B-request_url', 'O', 'O', 'B-response_bytes_clf', 'O', 'B-request_header_user_agent', 'I-request_header_user_agent', 'I-request_header_user_agent', 'I-request_header_user_agent', 'O']


## Subword Labeling
We are using the `bert-base-cased` tokenizer vocabulary. This tokenizer splits our whitespace separated words further into in dictionary sub-word pieces. The model eventually uses the label from the first piece of a word as the sole label for the word, so we do not care about the model's ability to predict individual labels for the sub-word pieces. For training, the label used for these pieces is `X`. To learn more see the [BERT paper](https://arxiv.org/abs/1810.04805)

In [9]:
def subword_labeler(tokenizer, log_list, label_list):
    """
    label all subword pieces in tokenized log with an 'X'
    """
    subword_labels = []
    for log, tags in zip(log_list,label_list):
        temp_tags = []
        words = cudf.Series(log.split())
        subword_counts = tokenizer(words,
               max_length=10000,
               max_num_rows=len(words),
              add_special_tokens=False
              )['metadata'][:,2]

        for i, tag in enumerate(tags):
            temp_tags.append(tag)
            temp_tags.extend('X'* subword_counts[i].item())
        subword_labels.append(temp_tags)
    return subword_labels

In [10]:
%%time
from cudf.core.subword_tokenizer import SubwordTokenizer
tokenizer  = SubwordTokenizer("resources/bert-base-cased-hash.txt",do_lower_case=False)
subword_labels = subword_labeler(tokenizer, logs_df.raw_preprocess.to_arrow().to_pylist(), labels)



CPU times: user 1.22 s, sys: 300 ms, total: 1.52 s
Wall time: 1.52 s


We create a set list of all labels from our dataset, add `X` for wordpiece tokens we will not have tags for and `[PAD]` for logs shorter than the length of the model's embedding.

In [11]:
# set of labels
label_values = list(set(x for l in labels for x in l))

label_values[:0] = ['[PAD]']  

# Set a dict for mapping id to tag name
label2id = {t: i for i, t in enumerate(label_values)}
label2id.update({'X': -100})

In [12]:
print(label2id)

{'[PAD]': 0, 'O': 1, 'B-request_url': 2, 'B-time_received': 3, 'I-error_message': 4, 'B-remote_host': 5, 'B-error_level': 6, 'I-time_received': 7, 'I-request_header_user_agent': 8, 'B-response_bytes_clf': 9, 'B-request_header_referer': 10, 'B-request_header_user_agent': 11, 'B-error_message': 12, 'B-request_method': 13, 'X': -100}


In [13]:
def pad(l, content, width):
    l.extend([content] * (width - len(l)))
    return l

In [14]:
padded_labels = [pad(x[:256], '[PAD]', 256) for x in subword_labels]
int_labels = [[label2id.get(l) for l in lab] for lab in padded_labels]
label_tensor = torch.tensor(int_labels).to('cuda')

# Training and Validation Datasets
For training and validation our datasets need three features. (1) `input_ids` subword tokens as integers padded to the specific length of the model (2) `attention_mask` a binary mask that allows the model to ignore padding (3) `labels` corresponding labels for tokens as integers. 

In [15]:
output = tokenizer(logs_df.raw_preprocess,
          max_length=256,
          truncation=True,
          max_num_rows = len(logs_df.raw_preprocess),
          add_special_tokens=False,
          return_tensors='pt'
     )
input_ids=output['input_ids'].type(torch.long)
attention_masks=output['attention_mask'].type(torch.long)
del output

In [16]:
# create dataset
dataset = TensorDataset(input_ids, attention_masks, label_tensor)

In [17]:
# use pytorch random_split to create training and validation data subsets
dataset_size = len(input_ids)
training_dataset, validation_dataset = random_split(dataset, (int(dataset_size*.8), int(dataset_size*.2)))

In [18]:
# create dataloader
train_dataloader = DataLoader(dataset=training_dataset, shuffle=True, batch_size=8)
val_dataloader = DataLoader(dataset=validation_dataset, shuffle=False, batch_size=1)

# Fine-tuning pretrained BERT
Download pretrained model from HuggingFace and move to GPU

In [19]:
model = BertForTokenClassification.from_pretrained("bert-base-cased", num_labels=len(label2id))

# model to gpu
model.cuda()
# use multi-gpu if available
model = nn.DataParallel(model)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

Define optimizer and learning rate for training

In [20]:
FULL_FINETUNING = True
if FULL_FINETUNING:
    #fine tune all layer parameters
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0}
    ]
else:
    # only fine tune classifier parameters
    param_optimizer = list(model.classifier.named_parameters()) 
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = Adam(optimizer_grouped_parameters, lr=3e-5)

In [21]:
%%time
# using 2 epochs to avoid overfitting

epochs = 2
max_grad_norm = 1.0

for _ in trange(epochs, desc="Epoch"):
    # TRAIN loop
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(train_dataloader):
        b_input_ids, b_input_mask, b_labels = batch
        # forward pass
        loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)[0]
        # backward pass
        loss.sum().backward()
        # track train loss
        tr_loss += loss.sum().item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)
        # update parameters
        optimizer.step()
        model.zero_grad()
    # print train loss per epoch
    print("Train loss: {}".format(tr_loss/nb_tr_steps))

Epoch:  50%|█████     | 1/2 [00:14<00:14, 14.28s/it]

Train loss: 0.23484498262405396


Epoch: 100%|██████████| 2/2 [00:28<00:00, 14.34s/it]

Train loss: 0.002498794225975871
CPU times: user 28.5 s, sys: 156 ms, total: 28.7 s
Wall time: 28.7 s





## Model Evaluation

In [22]:
# no dropout or batch norm during eval
model.eval();

In [23]:
# Mapping id to label
id2label={label2id[key] : key for key in label2id.keys()}

eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
y_true = []
y_pred = []

for step, batch in enumerate(val_dataloader):
    input_ids, input_mask, label_ids = batch
        
    with torch.no_grad():
        outputs = model(input_ids, token_type_ids=None,
        attention_mask=input_mask,)
        
        # For eval mode, the first result of outputs is logits
        logits = outputs[0] 
        
    # Get NER predicted result
    logits = torch.argmax(F.log_softmax(logits,dim=2),dim=2)
    logits = logits.detach().cpu().numpy()
    
    # Get NER true result
    label_ids = label_ids.detach().cpu().numpy()
    
    # Only predict the groud truth, mask=0, will not calculate
    input_mask = input_mask.detach().cpu().numpy()
    
    # Compare the valuable predict result
    for i,mask in enumerate(input_mask):
        # ground truth 
        temp_1 = []
        # Prediction
        temp_2 = []
        
        for j, m in enumerate(mask):
            # Mask=0 is PAD, do not compare
            if m: # Exclude the X label
                if id2label[label_ids[i][j]] != "X" and id2label[label_ids[i][j]] != "[PAD]": 
                    temp_1.append(id2label[label_ids[i][j]])
                    temp_2.append(id2label[logits[i][j]])
            else:
                break      
        y_true.append(temp_1)
        y_pred.append(temp_2)

print("f1 score: %f"%(f1_score(y_true, y_pred)))
print("Accuracy score: %f"%(accuracy_score(y_true, y_pred)))

# Get acc , recall, F1 result report
print(classification_report(y_true, y_pred,digits=3))

f1 score: 0.997121
Accuracy score: 0.998702
                           precision    recall  f1-score   support

              error_level      0.900     1.000     0.947        18
            error_message      0.895     0.944     0.919        18
              remote_host      1.000     1.000     1.000       182
   request_header_referer      1.000     1.000     1.000        86
request_header_user_agent      1.000     1.000     1.000       165
           request_method      1.000     1.000     1.000       182
              request_url      0.995     1.000     0.997       182
       response_bytes_clf      1.000     1.000     1.000       180
            time_received      0.995     1.000     0.998       200

                micro avg      0.995     0.999     0.997      1213
                macro avg      0.976     0.994     0.985      1213
             weighted avg      0.995     0.999     0.997      1213



## Saving model files for future parsing with cyBERT

In [24]:
model.module.config.id2label = id2label
model.module.config.label2id = label2id

In [25]:
# model.module.save_pretrained('path/to/model/directory')