### Log Anomaly Detection on BGL dataset using LogBERT model
This is a running example of an end-to-end workflow of Log Anomaly Detection on public dataset HDFS using the LogBERT model.

There are similar workflows on the BGL datasets using other anomaly detectors (like LSTM based one in `bgl_lstm_unsupervised_parsed_sequential.ipynb`). 

The actual workflow script is exactly identical in these cases, except in the LogBERT case we choose to skip the log-parsing step. This is simply done following past literature, but there are no restrictions from the LogAI library side. 


Also check out the other config files that in this directory that cater to other datasets (HDFS), or other experimental configs like (parsing/nonparsing based, sliding/session window based log partitioning, sequential/semantic log feature representations, supervised/unsupervised setting, LSTM/CNN/Transformer/BERT model). 

To use these different experimental configs, you only need to point to the correct config file and the same workflow code should work perfectly for those!

Only in case of changing the dataset (eg. from BGL to HDFS) you need to not only change the config.yaml file but also use the HDFSPreprocessor in the preprocessing step. Note that each custom dataset that are added should have its own Preprocessor class (which should inherit from logai.preproces.preprocessor.Preprocessor). 

For more complete explanations of each step of the workflow check out the `hdfs_lstm_unsupervised_parsed_sequential.ipynb` notebook instead.


In [1]:
import os 
from logai.applications.openset.anomaly_detection.openset_anomaly_detection_workflow import OpenSetADWorkflowConfig, validate_config_dict
from logai.utils.file_utils import read_file
from logai.utils.dataset_utils import split_train_dev_test_for_anomaly_detection
import logging 
from logai.dataloader.data_loader import FileDataLoader
from logai.preprocess.bgl_preprocessor import BGLPreprocessor
from logai.information_extraction.log_parser import LogParser
from logai.preprocess.openset_partitioner import OpenSetPartitioner
from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
from logai.information_extraction.log_vectorizer import LogVectorizer
from logai.utils import constants

In [2]:
config_path = "configs/bgl_logbert_config.yaml"
config_parsed = read_file(config_path)
config_dict = config_parsed["workflow_config"]
config = OpenSetADWorkflowConfig.from_dict(config_dict)

In [3]:
dataloader = FileDataLoader(config.data_loader_config)
logrecord = dataloader.load_data()
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
358455    RAS KERNEL FATAL idoproxy communication failur...
358456    RAS KERNEL FATAL idoproxy communication failur...
358457    RAS KERNEL FATAL idoproxy communication failur...
358458    RAS KERNEL FATAL idoproxy communication failur...
358459    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 358460, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected[constants.LOG_TIMESTAMPS] = pd.to_datetime(


In [4]:
preprocessor = BGLPreprocessor(config.preprocessor_config)
preprocessed_filepath = os.path.join(config.output_dir, 'BGL_11k_processed.csv')            
logrecord = preprocessor.clean_log(logrecord)
logrecord.save_to_csv(preprocessed_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
358455    RAS KERNEL FATAL idoproxy communication failur...
358456    RAS KERNEL FATAL idoproxy communication failur...
358457    RAS KERNEL FATAL idoproxy communication failur...
358458    RAS KERNEL FATAL idoproxy communication failur...
358459    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 358460, dtype: object


In [5]:
partitioner = OpenSetPartitioner(config.open_set_partitioner_config)
partitioned_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session.csv')
logrecord = partitioner.partition(logrecord)
logrecord.save_to_csv(partitioned_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0       RAS KERNEL INFO instruction cache parity error...
1       RAS KERNEL INFO instruction cache parity error...
2       RAS KERNEL INFO instruction cache parity error...
3       RAS KERNEL INFO instruction cache parity error...
4       RAS KERNEL INFO instruction cache parity error...
                              ...                        
1848    RAS APP FATAL ciod Error reading message prefi...
1849    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
1850    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
1851    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
1852    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 1853, dtype: object


In [6]:
train_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session_unsupervised_train.csv')
dev_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session_unsupervised_dev.csv')
test_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session_unsupervised_test.csv')

(train_data, dev_data, test_data) = split_train_dev_test_for_anomaly_detection(
                logrecord,training_type=config.training_type,
                test_data_frac_neg_class=config.test_data_frac_neg,
                test_data_frac_pos_class=config.test_data_frac_pos,
                shuffle=config.train_test_shuffle
            )

train_data.save_to_csv(train_filepath)
dev_data.save_to_csv(dev_filepath)
test_data.save_to_csv(test_filepath)
print ('Train/Dev/Test Anomalous', len(train_data.labels[train_data.labels[constants.LABELS]==1]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==1]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==1]))
print ('Train/Dev/Test Normal', len(train_data.labels[train_data.labels[constants.LABELS]==0]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==0]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==0]))

indices_train/dev/test:  32 4 1817
Train/Dev/Test Anomalous 0 0 1808
Train/Dev/Test Normal 32 4 9


In [7]:
vectorizer = LogVectorizer(config.log_vectorizer_config)
vectorizer.fit(train_data)
train_features = vectorizer.transform(train_data)
dev_features = vectorizer.transform(dev_data)
test_features = vectorizer.transform(test_data)
print (train_features)

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 32
})


In [8]:
anomaly_detector = NNAnomalyDetector(config=config.nn_anomaly_detection_config)
anomaly_detector.fit(train_features, dev_features)

initialized data collator


The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 32
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 80
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
50,6.3477,7.127536


The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4
  Batch size = 256
Saving model checkpoint to temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50
Configuration saved in temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50/config.json
Model weights saved in temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




In [9]:
predict_results = anomaly_detector.predict(test_features)
print (predict_results)

INFO:root:Loading model from /Users/amrita.saha/Home/salesforce/workspace/code/AIOps/RCA_Log/logai_opensource/logai/examples/jupyter_notebook/nn_ad_benchmarking/temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-100
loading configuration file /Users/amrita.saha/Home/salesforce/workspace/code/AIOps/RCA_Log/logai_opensource/logai/examples/jupyter_notebook/nn_ad_benchmarking/temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-100/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",


  0%|          | 0/1817 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 1858
  Batch size = 256
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


INFO:root:test_loss: 6.884676456451416 test_runtime: 380.0454 test_samples/s: 4.889
INFO:root:number of original test instances 1469
INFO:root:loss_mean Pos scores:  mean: 6.9087990307260405, std: 0.7953828009166797
INFO:root:loss_mean Neg scores: mean: 3.618975732840743, std: 1.6822439832876217
INFO:root:AUC of loss_mean: 0.9352159468438538
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.483360423272025, std: 0.3953119902978529
INFO:root:loss_max Neg scores: mean: 4.501153945922852, std: 2.1830750224544437
INFO:root:AUC of loss_max: 0.9408833300762166
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.648272041080209, std: 0.5242323353348409
INFO:root:loss_top6_mean Neg scores: mean: 3.905489637738182, std: 1.8918413511898604
INFO:root:AUC of loss_top6_mean: 0.9711745163181551
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8799302817190238, std: 0.003586756438542158
INFO:root:scores_top6_max_prob Neg scores: mean: 0.8638905097863503, std: 0.0122384589137260

INFO:root:test_loss: 6.8940958976745605 test_runtime: 391.2892 test_samples/s: 4.748
INFO:root:number of original test instances 1521
***** Running Prediction *****
  Num examples = 1858
  Batch size = 256


INFO:root:test_loss: 6.845270156860352 test_runtime: 386.8702 test_samples/s: 4.803
INFO:root:number of original test instances 1578
INFO:root:loss_mean Pos scores:  mean: 6.907718217158561, std: 0.4043983715316972
INFO:root:loss_mean Neg scores: mean: 3.382695775026246, std: 1.0991956194923773
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.707508697631253, std: 0.3208316680386524
INFO:root:loss_max Neg scores: mean: 4.9436929523944855, std: 2.270872183178602
INFO:root:AUC of loss_max: 0.9746019108280255
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.658333589694049, std: 0.26375133667216016
INFO:root:loss_top6_mean Neg scores: mean: 3.7447068177991443, std: 1.3644584454599087
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8797631816321065, std: 0.0034920250748031135
INFO:root:scores_top6_max_prob Neg scores: mean: 0.8662098156960889, std: 0.01312254897278986
INFO:root:AUC of scores_to

INFO:root:test_loss: 6.902355194091797 test_runtime: 382.4702 test_samples/s: 4.858
INFO:root:number of original test instances 1640
***** Running Prediction *****
  Num examples = 1858
  Batch size = 256


INFO:root:test_loss: 6.868999004364014 test_runtime: 383.3589 test_samples/s: 4.847
INFO:root:number of original test instances 1683
INFO:root:loss_mean Pos scores:  mean: 6.920148221309522, std: 0.3566757207660673
INFO:root:loss_mean Neg scores: mean: 3.2478738335926436, std: 0.9785259340926624
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.756247967392651, std: 0.3020146397331471
INFO:root:loss_max Neg scores: mean: 4.94701823592186, std: 2.270355316893744
INFO:root:AUC of loss_max: 0.9816417910447761
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.710901800642362, std: 0.24537859594446257
INFO:root:loss_top6_mean Neg scores: mean: 3.5826374517546755, std: 1.2310723233439638
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8796587121719942, std: 0.0034385457377637844
INFO:root:scores_top6_max_prob Neg scores: mean: 0.8664167847406741, std: 0.013133275510927451
INFO:root:AUC of scores_to

INFO:root:test_loss: 6.907938003540039 test_runtime: 394.4318 test_samples/s: 4.711
INFO:root:number of original test instances 1731
***** Running Prediction *****
  Num examples = 1857
  Batch size = 256


INFO:root:test_loss: 6.864945888519287 test_runtime: 393.193 test_samples/s: 4.723
INFO:root:number of original test instances 1762
INFO:root:loss_mean Pos scores:  mean: 6.929079766560031, std: 0.33463519180139617
INFO:root:loss_mean Neg scores: mean: 3.1887111990196586, std: 0.9732760917406157
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.78076688376093, std: 0.2998567894493911
INFO:root:loss_max Neg scores: mean: 5.17869359254837, std: 2.5063636922181893
INFO:root:AUC of loss_max: 0.967502850627138
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.763326366827785, std: 0.2483408765544348
INFO:root:loss_top6_mean Neg scores: mean: 3.5923239290714264, std: 1.2769347057849996
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8796277117918756, std: 0.0034515691720516223
INFO:root:scores_top6_max_prob Neg scores: mean: 0.8665742223383859, std: 0.013015930285878012
INFO:root:AUC of scores_top6

INFO:root:test_loss: 6.903265953063965 test_runtime: 384.7599 test_samples/s: 4.826
INFO:root:number of original test instances 1803
***** Running Prediction *****
  Num examples = 1857
  Batch size = 256


INFO:root:test_loss: 6.851230144500732 test_runtime: 381.9937 test_samples/s: 4.861
INFO:root:number of original test instances 1816
INFO:root:loss_mean Pos scores:  mean: 6.932833810220314, std: 0.31970477373520695
INFO:root:loss_mean Neg scores: mean: 3.1603801690085693, std: 0.9341407161133001
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.807894197959035, std: 0.2937742492019078
INFO:root:loss_max Neg scores: mean: 5.257125748528375, std: 2.4196701242926126
INFO:root:AUC of loss_max: 0.9444751890795057
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.7535741145295765, std: 0.2687491083267956
INFO:root:loss_top6_mean Neg scores: mean: 3.5896224975585938, std: 1.24861338300561
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8796215265727366, std: 0.0034751428831847975
INFO:root:scores_top6_max_prob Neg scores: mean: 0.8678981571308808, std: 0.012591848174185542
INFO:root:AUC of scores_t

INFO:root:test_loss: 6.874952793121338 test_runtime: 378.4836 test_samples/s: 4.906
INFO:root:number of original test instances 1817


      indices  max_loss   sum_loss num_loss  \
0           0  2.477225  19.516182        9   
1           0  2.496417  17.288784        8   
2           1  2.511517  17.309793        8   
3           2  8.343674  60.673038        9   
4           3  2.544247     17.348        8   
...       ...       ...        ...      ...   
18571    1813   8.47681  52.004642        8   
18572    1814   8.77824  55.250885        9   
18573    1815  8.773056  55.483009        9   
18574    1815  8.790743   47.12072        8   
18575    1816  9.254669  64.524132        8   

                                               top6_loss  \
0      [2.477224826812744, 2.36291241645813, 2.281082...   
1      [2.4964170455932617, 2.313321113586426, 2.2107...   
2      [2.511516571044922, 2.361931324005127, 2.20818...   
3      [8.343673706054688, 7.839829921722412, 7.30509...   
4      [2.5442466735839844, 2.326444625854492, 2.2223...   
...                                                  ...   
18571  [8.47681