# DGX Kernel Root Cause Analysis Acceleration & Predictive Maintenance using CLX-RAPIDS

## Authors
- Gorkem Batmaz (NVIDIA)
- Eli Fajardo (NVIDIA)

# Table of Contents 
* Introduction
* Dataset
* Reading in the datasets
* Initialize/Load CLX module
* Training - DGX Kernel logs dataset
* Evaluation
* Conclusion
* References

# Introduction

Like any other Linux based machine, DGX's generate a vast amount of logs. Analysts spend hours trying to identify the root causes of each failure. There could be infinitely many types of root causes of the failures. Some patterns might help to narrow it down; however, regular expressions can only help to identify previously known patterns. Moreover, this creates another manual task of maintaining a search script. 

In this notebook, we show how GPU's can accelerate the analysis of the enormous amount of logs using machine learning. Another benefit of analyzing in a probabilistic way is that we can pin down unseen root causes. To achieve this, we will fine-tune a pre-trained BERT* model with a classification layer using HuggingFace library.

Once the model is capable of identifying even the new root causes, it can also be deployed as a process running in the machines to predict failures before they happen.

*BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found [here.](https://arxiv.org/pdf/1810.04805.pdf)

## Dataset
* DGX Linux Kernel logs

The dataset comprises `kern.log` files from multiple DGX's. Each line inside has been labelled as either `0` for `ordinary` or `1` or `root cause` by a script that uses some known patterns. We will be especially interested in lines that are marked as ordinary in the test set but predicted as a root cause as they may be new types of root causes of failures.

More information on Linux log types can be found [here.](https://help.ubuntu.com/community/LinuxLogFiles)

### Required Libraries

In [1]:
import cudf;
from cuml.preprocessing.model_selection import train_test_split;
from clx.analytics.sequence_classifier import SequenceClassifier;
import s3fs;
from os import path;
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

## Reading the file

In [2]:
dflogs = cudf.read_csv("kernel.tsv", delimiter='\t', header=None, names=['label', 'log'])

Each row in the `log` column have a line from the `kern.log` file, and the `label` column has the information on whether it is ordinary or root cause.

## Initialize/Load CLX module
We will initialize the CLX sequence classifier module with a pre-trained BERT model. The pre-trained model we use is located at https://huggingface.co/jeniya/BERTOverflow The standard BERT model produced similar performance, but we chose this one because it had been trained on a more relevant text, which might help inference performance in production. For more information on the model, please see the paper at https://arxiv.org/pdf/2005.01634.pdf


In [3]:
seq_classifier = SequenceClassifier()
seq_classifier.init_model("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Training 

Part of the dataset will be used for fine-tuning the model. The rest of the dataset will be used as the test set to evaluate if the model is useful. With default settings, 80% of the dataset will be the training set.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(dflogs, dflogs.label)

We run the training. The number of epochs should be adjusted for each dataset.

In [5]:
seq_classifier.train_model(X_train["log"], y_train, epochs=1)

Epoch: 100%|██████████| 1/1 [24:44<00:00, 1484.55s/it]

Train loss: 0.00996842665674772





## Evaluation of the model

`evaluate_model` returns the accuracy in the test set.

In [6]:
seq_classifier.evaluate_model(X_test["log"], y_test)

0.9992424242424243

We get the predictions from the model.

In [7]:
test_preds = seq_classifier.predict(X_test["log"], batch_size=128)

In [8]:
tests = test_preds[0].to_array()
true_labels = X_test.label.to_array()

Calculate the F1 score since it's not a balanced dataset.

In [9]:
f1_score(true_labels, tests)

0.9809410363311495

Accuracy is higher than the F1 score. The distribution of the labels is not balanced hence accuracy might be less indicative of performance.

We can use a confusion matrix to check how many of each label are predicted as marked.

In [10]:
confusion_matrix(true_labels, tests)

array([[82756,    64],
       [    0,  1647]])

# Conclusion

The confusion matrix shows 64 lines out of 82756 ordinary logs are marked as a root cause for problems or failures. These 64 lines would have been missed if regex had been used. The lines identified by this model may give an indication of a problem hours before the actual failure or outage happen. This approach can be implemented on the machines to warn the users well before the problems occur so corrective actions can be taken.

# References
* https://github.com/huggingface/transformers/tree/master/examples#
* https://arxiv.org/pdf/1810.04805.pdf
* https://arxiv.org/pdf/2005.01634.pdf