### Log Anomaly Detection on HDFS Dataset using LogBERT model
This is a running example of the end-to-end workflow of Log Anomaly Detection on public dataset HDFS using the LogBERT model

It is very similar to that using the LSTM model. We will only mark out the dataset specific portions of this workflow i.e. the parts that differ between the two models. 

For a more complete elaboration of the full workflow please refer to `hdfs_lstm_unsupervised_parsed_sequential.ipynb` notebook.

We have skipped the parsing step for this workflow since LogBERT model does not need any parsing of the raw logs and directly works on the preprocessed log data (with or without any log partitioning)

In [1]:
import os 
from logai.applications.openset.anomaly_detection.openset_anomaly_detection_workflow import OpenSetADWorkflowConfig, validate_config_dict
from logai.utils.file_utils import read_file
from logai.utils.dataset_utils import split_train_dev_test_for_anomaly_detection
import logging 
from logai.dataloader.data_loader import FileDataLoader
from logai.preprocess.hdfs_preprocessor import HDFSPreprocessor
from logai.information_extraction.log_parser import LogParser
from logai.preprocess.openset_partitioner import OpenSetPartitioner
from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
from logai.information_extraction.log_vectorizer import LogVectorizer
from logai.utils import constants

In [2]:
config_path = "configs/hdfs_logbert_config.yaml"
config_parsed = read_file(config_path)
config_dict = config_parsed["workflow_config"]

config = OpenSetADWorkflowConfig.from_dict(config_dict)

In [3]:
dataloader = FileDataLoader(config.data_loader_config)
logrecord = dataloader.load_data()
print (logrecord.body[constants.LOGLINE_NAME])

0       Receiving block blk_-1608999687919862906 src: ...
1       BLOCK* NameSystem.allocateBlock: /mnt/hadoop/m...
2       Receiving block blk_-1608999687919862906 src: ...
3       Receiving block blk_-1608999687919862906 src: ...
4       PacketResponder 1 for block blk_-1608999687919...
                              ...                        
4514    Deleting block blk_-2126554733521224025 file /...
4515    Deleting block blk_-66330728533676520 file /mn...
4516    Deleting block blk_872694497849122755 file /mn...
4517    Deleting block blk_3947106522258141922 file /m...
4518    Deleting block blk_-774246298521956028 file /m...
Name: logline, Length: 4519, dtype: object


In [4]:
preprocessor = HDFSPreprocessor(config.preprocessor_config, config.label_filepath)
preprocessed_filepath = os.path.join(config.output_dir, 'HDFS_5k_processed.csv')            
logrecord = preprocessor.clean_log(logrecord)
logrecord.save_to_csv(preprocessed_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0           Receiving block BLOCK src IP INT dest IP INT 
1       BLOCK NameSystem.allocateBlock /mnt/hadoop/map...
2           Receiving block BLOCK src IP INT dest IP INT 
3           Receiving block BLOCK src IP INT dest IP INT 
4         PacketResponder INT for block BLOCK terminating
                              ...                        
4514    Deleting block BLOCK file /mnt/hadoop/dfs/data...
4515    Deleting block BLOCK file /mnt/hadoop/dfs/data...
4516    Deleting block BLOCK file /mnt/hadoop/dfs/data...
4517    Deleting block BLOCK file /mnt/hadoop/dfs/data...
4518    Deleting block BLOCK file /mnt/hadoop/dfs/data...
Name: logline, Length: 4519, dtype: object


In [5]:
partitioner = OpenSetPartitioner(config.open_set_partitioner_config)
partitioned_filepath = os.path.join(config.output_dir, 'HDFS_5k_nonparsed_session.csv')
logrecord = partitioner.partition(logrecord)
logrecord.save_to_csv(partitioned_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0      Receiving block BLOCK src IP INT dest IP INT [...
1      Receiving block BLOCK src IP INT dest IP INT [...
2      Receiving block BLOCK src IP INT dest IP INT [...
3      Receiving block BLOCK src IP INT dest IP INT [...
4      Receiving block BLOCK src IP INT dest IP INT [...
                             ...                        
105    Receiving block BLOCK src IP INT dest IP INT [...
106    Receiving block BLOCK src IP INT dest IP INT [...
107    Receiving block BLOCK src IP INT dest IP INT [...
108    Receiving block BLOCK src IP INT dest IP INT [...
109    Receiving block BLOCK src IP INT dest IP INT [...
Name: logline, Length: 110, dtype: object


In [6]:
train_filepath = os.path.join(config.output_dir, 'HDFS_5k_nonparsed_session_supervised_train.csv')
dev_filepath = os.path.join(config.output_dir, 'HDFS_5k_nonparsed_session_supervised_dev.csv')
test_filepath = os.path.join(config.output_dir, 'HDFS_5k_nonparsed_session_supervised_test.csv')

(train_data, dev_data, test_data) = split_train_dev_test_for_anomaly_detection(
                logrecord,training_type=config.training_type,
                test_data_frac_neg_class=config.test_data_frac_neg,
                test_data_frac_pos_class=config.test_data_frac_pos,
                shuffle=config.train_test_shuffle
            )

train_data.save_to_csv(train_filepath)
dev_data.save_to_csv(dev_filepath)
test_data.save_to_csv(test_filepath)
print ('Train/Dev/Test Anomalous', len(train_data.labels[train_data.labels[constants.LABELS]==1]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==1]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==1]))
print ('Train/Dev/Test Normal', len(train_data.labels[train_data.labels[constants.LABELS]==0]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==0]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==0]))

indices_train/dev/test:  75 8 27
Train/Dev/Test Anomalous 0 0 10
Train/Dev/Test Normal 75 8 17


### Transforming log data to vectors using LogVectorizer

This step in the code is exactly identical across whichever vectorizer or anomaly detector you choose. All the differences are in the config yaml file where you can specify the algorithm name and algorithm parameters. Each vectorizer algorithm has its own Config or Param dataclass for storing its custom hyperparameters. 
For the exact parameters of the algorithm of your choice, head to the documentation of the algorithm's config class.

In [7]:
vectorizer = LogVectorizer(config.log_vectorizer_config)
vectorizer.fit(train_data)
train_features = vectorizer.transform(train_data)
dev_features = vectorizer.transform(dev_data)
test_features = vectorizer.transform(test_data)
print (train_features)

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 75
})


### Anomaly Detection using LogBERT 

This step in the code is exactly identical across whichever neural anomaly detector (NNAnomalyDetector) you choose. All the differences are in the config yaml file where you can specify the algorithm name and algorithm parameters. Each algorithm has its own Config or Param dataclass for storing its custom hyperparameters. 
For the exact parameters of the algorithm of your choice, head to the documentation of the algorithm's config class. 

In [8]:
anomaly_detector = NNAnomalyDetector(config=config.nn_anomaly_detection_config)
anomaly_detector.fit(train_features, dev_features)

initialized data collator


The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 75
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 190
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
50,1.5265,0.862205
100,0.7446,0.607761
150,0.7179,0.460633


The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8
  Batch size = 256
Saving model checkpoint to temp_output/HDFS_5k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50
Configuration saved in temp_output/HDFS_5k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50/config.json
Model weights saved in temp_output/HDFS_5k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 8
  Batch size = 256
Saving mod

In [9]:
predict_results = anomaly_detector.predict(test_features)
print (predict_results)

INFO:root:Loading model from /Users/amrita.saha/Home/salesforce/workspace/code/AIOps/RCA_Log/logai_opensource/logai/temp_output/HDFS_5k_parsed_session_supervised_AD/bert-base-cased/checkpoint-150
loading configuration file /Users/amrita.saha/Home/salesforce/workspace/code/AIOps/RCA_Log/logai_opensource/logai/temp_output/HDFS_5k_parsed_session_supervised_AD/bert-base-cased/checkpoint-150/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.23.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_

  0%|          | 0/27 [00:00<?, ?ba/s]

***** Running Prediction *****
  Num examples = 105
  Batch size = 256
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


INFO:root:test_loss: 1.014678955078125 test_runtime: 83.2527 test_samples/s: 1.261
INFO:root:number of original test instances 27
INFO:root:loss_mean Pos scores:  mean: 2.059218628594652, std: 1.7926128562940402
INFO:root:loss_mean Neg scores: mean: 0.5299380107454079, std: 0.39768380503463113
INFO:root:AUC of loss_mean: 0.7235294117647059
INFO:root:

INFO:root:loss_max Pos scores:  mean: 5.7560831069946286, std: 3.14190306540753
INFO:root:loss_max Neg scores: mean: 2.943949373329387, std: 1.7687664243041483
INFO:root:AUC of loss_max: 0.7235294117647059
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 2.400115772363885, std: 1.992208159608372
INFO:root:loss_top6_mean Neg scores: mean: 0.6830175074237381, std: 0.5114021108113075
INFO:root:AUC of loss_top6_mean: 0.7235294117647059
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.21630861336986226, std: 0.05769911577260041
INFO:root:scores_top6_max_prob Neg scores: mean: 0.18187326222073796, std: 0.014334467801007033

INFO:root:test_loss: 0.9861030578613281 test_runtime: 82.683 test_samples/s: 1.258
INFO:root:number of original test instances 27
***** Running Prediction *****
  Num examples = 104
  Batch size = 256


INFO:root:test_loss: 1.1745320558547974 test_runtime: 88.791 test_samples/s: 1.171
INFO:root:number of original test instances 27
INFO:root:loss_mean Pos scores:  mean: 2.1219184682465544, std: 1.6983737911106986
INFO:root:loss_mean Neg scores: mean: 0.5928176043682969, std: 0.2924837395173191
INFO:root:AUC of loss_mean: 0.7470588235294118
INFO:root:

INFO:root:loss_max Pos scores:  mean: 6.691045713424683, std: 2.9804914308932755
INFO:root:loss_max Neg scores: mean: 4.090087729341843, std: 1.3805904426704225
INFO:root:AUC of loss_max: 0.6882352941176471
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 3.4820845513397614, std: 2.2425742149074077
INFO:root:loss_top6_mean Neg scores: mean: 1.3439611899209956, std: 0.7046433189823426
INFO:root:AUC of loss_top6_mean: 0.7764705882352941
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.31607609084910815, std: 0.028842391046279624
INFO:root:scores_top6_max_prob Neg scores: mean: 0.2965105602067281, std: 0.004912850334116

INFO:root:test_loss: 1.1727038621902466 test_runtime: 93.7178 test_samples/s: 1.11
INFO:root:number of original test instances 27
***** Running Prediction *****
  Num examples = 104
  Batch size = 256


INFO:root:test_loss: 1.1584101915359497 test_runtime: 86.6201 test_samples/s: 1.201
INFO:root:number of original test instances 27
INFO:root:loss_mean Pos scores:  mean: 2.2554494030398713, std: 1.8150099338811976
INFO:root:loss_mean Neg scores: mean: 0.609534556013258, std: 0.32165884581679327
INFO:root:AUC of loss_mean: 0.8352941176470587
INFO:root:

INFO:root:loss_max Pos scores:  mean: 7.351688361167907, std: 2.7813355088245744
INFO:root:loss_max Neg scores: mean: 4.5056345182306625, std: 1.5344308862051492
INFO:root:AUC of loss_max: 0.7441176470588234
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 4.887380647312642, std: 2.8028220232620416
INFO:root:loss_top6_mean Neg scores: mean: 1.8708893243418216, std: 1.0048697887345033
INFO:root:AUC of loss_top6_mean: 0.8352941176470587
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.326788921530048, std: 0.03757485806268242
INFO:root:scores_top6_max_prob Neg scores: mean: 0.32176246679101894, std: 0.0151382380065404

INFO:root:test_loss: 0.9672116637229919 test_runtime: 83.5388 test_samples/s: 1.245
INFO:root:number of original test instances 27
***** Running Prediction *****
  Num examples = 104
  Batch size = 256


INFO:root:test_loss: 0.969925582408905 test_runtime: 84.7546 test_samples/s: 1.227
INFO:root:number of original test instances 27
INFO:root:loss_mean Pos scores:  mean: 2.349126224921584, std: 2.1523467967824894
INFO:root:loss_mean Neg scores: mean: 0.53474554428909, std: 0.28215333184122304
INFO:root:AUC of loss_mean: 0.8294117647058823
INFO:root:

INFO:root:loss_max Pos scores:  mean: 7.653564143180847, std: 3.0611919169082427
INFO:root:loss_max Neg scores: mean: 4.604733425028184, std: 1.4066104234214887
INFO:root:AUC of loss_max: 0.7441176470588234
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 5.196850432952244, std: 2.9924530904184525
INFO:root:loss_top6_mean Neg scores: mean: 2.0070443147263837, std: 1.023917104014254
INFO:root:AUC of loss_top6_mean: 0.8294117647058823
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.33317578554981286, std: 0.04539746382246543
INFO:root:scores_top6_max_prob Neg scores: mean: 0.3474506453117904, std: 0.015463721962154215
I

INFO:root:test_loss: 1.01661217212677 test_runtime: 83.5752 test_samples/s: 1.244
INFO:root:number of original test instances 27
***** Running Prediction *****
  Num examples = 104
  Batch size = 256


INFO:root:test_loss: 1.0347822904586792 test_runtime: 84.5661 test_samples/s: 1.23
INFO:root:number of original test instances 27
INFO:root:loss_mean Pos scores:  mean: 2.4548787350300554, std: 2.3441222706009706
INFO:root:loss_mean Neg scores: mean: 0.4871146835758065, std: 0.214966075456876
INFO:root:AUC of loss_mean: 0.7999999999999999
INFO:root:

INFO:root:loss_max Pos scores:  mean: 7.775362467765808, std: 3.1761796382891503
INFO:root:loss_max Neg scores: mean: 4.826125748017255, std: 1.4327840530789102
INFO:root:AUC of loss_max: 0.7441176470588234
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 5.497082110235675, std: 3.2717929341947243
INFO:root:loss_top6_mean Neg scores: mean: 2.172644008081893, std: 0.9551029668480685
INFO:root:AUC of loss_top6_mean: 0.7941176470588235
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.33998988034824523, std: 0.05070171762118567
INFO:root:scores_top6_max_prob Neg scores: mean: 0.3777509571047, std: 0.007681180353872846
INF

INFO:root:test_loss: 0.9827648997306824 test_runtime: 82.3832 test_samples/s: 1.262
INFO:root:number of original test instances 27


     indices   max_loss   sum_loss num_loss  \
0          0   0.078862   0.400576        8   
1          0  10.861187   78.08194        8   
2          1   8.379167   43.85025        9   
3          1   0.332565   1.821065        8   
4          1   8.497158  52.730793        8   
...      ...        ...        ...      ...   
1036      25   0.028344   0.152666        8   
1037      26    6.27078  32.057602        8   
1038      26   5.805102  25.056643        8   
1039      26   1.282481   5.625502        8   
1040      26   0.028524    0.15167        8   

                                              top6_loss  \
0     [0.0788615420460701, 0.07201781123876572, 0.05...   
1     [10.861186981201172, 10.619929313659668, 10.04...   
2     [8.379166603088379, 7.949586391448975, 7.41764...   
3     [0.33256518840789795, 0.2971160411834717, 0.25...   
4     [8.49715805053711, 8.332608222961426, 7.788794...   
...                                                 ...   
1036  [0.0283436067402

In [10]:
type(predict_results)

pandas.core.frame.DataFrame