# cyBERT: a flexible log parser based on the BERT language model

## Table of Contents
* Introduction
* Download cyBERT Apache model from HuggingFace
* Load model into cyBERT
* Download a sample of Apache logs
* Parse raw log data with cyBERT

## Introduction

One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates the simple steps to parse a sample of Apache log data using cyBERT.

In [None]:
import cudf
import s3fs
from os import path

from clx.analytics.cybert import Cybert

In [2]:
CLX_S3_BASE_PATH = "rapidsai-data/cyber/clx"
HF_S3_BASE_PATH = "models.huggingface.co/bert/raykallen/cybert_apache_parser"

CONFIG_FILENAME = "config.json"
MODEL_FILENAME = "pytorch_model.bin"
APACHE_SAMPLE_CSV = "apache_sample_1k.csv"

In [3]:
fs = s3fs.S3FileSystem(anon=True)

## Download cyBERT Apache model from HuggingFace

In [4]:
if not path.exists(MODEL_FILENAME):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(HF_S3_BASE_PATH + "/" + MODEL_FILENAME, MODEL_FILENAME)

In [5]:
if not path.exists(CONFIG_FILENAME):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(HF_S3_BASE_PATH + "/" + CONFIG_FILENAME, CONFIG_FILENAME)

## Load model into cyBERT

In [6]:
cybert = Cybert()
cybert.load_model(MODEL_FILENAME, CONFIG_FILENAME)

## Download a sample of Apache logs

In [7]:
if not path.exists(APACHE_SAMPLE_CSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(CLX_S3_BASE_PATH + "/" + APACHE_SAMPLE_CSV, APACHE_SAMPLE_CSV)

In [8]:
logs_df = cudf.read_csv(APACHE_SAMPLE_CSV)

In [9]:
# Print raw logs
logs_df["raw"]

0      [Sun Dec 04 20:22:49 2005] [notice] workerEnv....
1      193.106.31.130 - - [01/Sep/2019:03:28:00 +0200...
2      100.1.14.108 - - [29/Sep/2019:19:41:25 +0200] ...
3      13.84.43.203 - - [06/Nov/2019:03:15:15 +0100] ...
4      90.188.40.9 - - [18/Feb/2016:12:38:21 +0100] "...
                             ...                        
995    154.0.14.250 - - [06/Dec/2016:16:59:06 +0100] ...
996    62.210.33.127 - - [20/Oct/2019:15:15:40 +0200]...
997    100.1.14.108 - - [04/Oct/2019:12:21:10 +0200] ...
998    198.50.156.189 - - [01/Apr/2017:19:47:53 +0200...
999    100.1.14.108 - - [23/Sep/2019:17:55:54 +0200] ...
Name: raw, Length: 1000, dtype: object

## Parse raw log data with cyBERT

In [10]:
parsed_df, confidence_df = cybert.inference(logs_df["raw"])

In [11]:
parsed_df

Unnamed: 0,time_received,error_level,error_message,remote_host,other,request_method,request_url,request_http_ver,status,response_bytes_clf,request_header_user_agent,request_header_referer,X
0,[Sun Dec 04 20:22:49 2005],[notice],workerEnv.init () ok/etc/httpd/conf/workers2 .,,,,,,,,,,
1,[01/Sep/2019:03:28:00 +0200],,,193.106.31.130,---,POST,/administrator/index.php,HTTP/1.0,200,4481,Mozilla/4.0 (compatible.MSIE...; Windows NT...),,
2,[29/Sep/2019:19:41:25 +0200],,,100.1.14.108,---,GET,/components/com.users/dispacher.php,HTTP/1.1,404,240,python-requests/2.22.0,,
3,[06/Nov/2019:03:15:15 +0100],,,13.84.43.203,---,GET,//administrator/index.php,HTTP/1.1,200,4270,Mozilla/5.0 (Windows NT 10.0.Win64.x64.rv:65.0...,,
4,[18/Feb/2016:12:38:21 +0100],,,90.188.40.9,---,GET,/administrator/,HTTP/1.1,200,4263,Mozilla/5.0 (Windows NT.. 1) AppleWebKit/537.3...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,[06/Dec/2016:16:59:06 +0100],,,154.0.14.250,- -,POST,/administrator/index.php,HTTP/1.1,200,4263,Mozilla/5.0 (Windows NT...; WOW64.rv:17.0) Gec...,http://almhuette-raith.at/administrator/index.php,
996,[20/Oct/2019:15:15:40 +0200],,,62.210.33.127,--http://www.almhuette-raith.at/index.php.opti...,GET,/images/stories/slideshow/almhuette.raith.06.jpg,HTTP/1.1,200,68977,Mozilla/5.0 (Macintosh.Intel Mac OS.10.14.5 )W...,,
997,[04/Oct/2019:12:21:10 +0200],,,100.1.14.108,---,GET,/modules/mod.bowslideshow/tmpl/js/sliderman.1....,HTTP/1.1,200,33472,Python-urllib/3.7,,
998,[01/Apr/2017:19:47:53 +0200],,,198.50.156.189,--- -,POST,/administrator/index.php,HTTP/1.1,200,4498,,,
