# CLX Asset Classification (Supervised)

## Authors
- Bhargav Suryadevara (NVIDIA)
- Gorkem Batmaz (NVIDIA)

## Table of Contents 
* Introduction
* Dataset
* Reading in the datasets
* Training and inference
* References

# Introduction

In this notebook, we will show how to predict the function of a server with Windows Event Logs using cudf, cuml and pytorch. The machines are labeled as DC, SQL, WEB, DHCP, MAIL and SAP. The dependent variable will be the type of the machine. The features are selected from Windows Event Logs which is in a tabular format. This is a first step to learn the behaviours of certain types of machines in data-centres by classifying them probabilistically. It could help to detect unusual behaviour in a data-centre. For example, some compromised computers might be acting as web/database servers but with their original tag. 

This work could be expanded by using different log types or different events from the machines as features to improve accuracy. Various labels can be selected to cover different types of machines or data-centres.

## Library imports

In [1]:
from clx.analytics.asset_classification import AssetClassification
import cudf
from cuml.preprocessing import train_test_split
from cuml.preprocessing import LabelEncoder
import torch
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import pandas as pd

## Initialize variables

10000 is chosen as the batch size to optimise the performance for this dataset. It can be changed depending on the data loading mechanism or the setup used. 

EPOCH should also be adjusted depending on convergence for a specific dataset. 

label_col indicates the total number of features used plus the dependent variable. Feature names are listed below.

In [2]:
batch_size = 10000
label_col = '19'
epochs = 15

In [3]:
ac = AssetClassification()

## Read the dataset into a GPU dataframe with `cudf.read_csv()` 

The raw data had many other fields. Many of them were either static or mostly blank. After filtering those, there were 18 meaningful columns left. We then categorized the feature columns and converted to numeric. The `AssetClassification` class includes a `categorize_columns` method to this for your for convenience.

```
win_events_gdf = ac.categorize_columns(win_events_18_features.gdf)
```

In [4]:
win_events_categorized_gdf = cudf.read_csv('win_events_18_features_categorized.csv')

In [5]:
win_events_categorized_gdf.head()

Unnamed: 0.1,Unnamed: 0,eventcode,keywords,privileges,message,sourcename,taskcategory,account_for_which_logon_failed_account_domain,detailed_authentication_information_authentication_package,detailed_authentication_information_key_length,detailed_authentication_information_logon_process,detailed_authentication_information_package_name_ntlm_only,logon_type,network_information_workstation_name,new_logon_security_id,impersonation_level,network_information_protocol,network_information_direction,filter_information_layer_name,label
0,0,0,1,0,15,0,4,22,0,0,5,0,1,932,38,3,6,1,1,1
1,1,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,0
2,2,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,0
3,3,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,0
4,4,14,1,0,7,0,5,22,3,2,6,1,6,932,25,3,6,1,1,0


### Split the dataset into training and test sets using cuML `train_test_split` function
Column 19 contains the ground truth about each machine's function that the logs come from. i.e. DC, SQL, WEB, DHCP, MAIL and SAP. Hence it will be used as a label.

In [6]:
X_train, X_test, Y_train, Y_test = train_test_split(win_events_categorized_gdf, "label", train_size=0.9)
X_train["label"] = Y_train

In [7]:
X_train.head()

Unnamed: 0.1,Unnamed: 0,eventcode,keywords,privileges,message,sourcename,taskcategory,account_for_which_logon_failed_account_domain,detailed_authentication_information_authentication_package,detailed_authentication_information_key_length,detailed_authentication_information_logon_process,detailed_authentication_information_package_name_ntlm_only,logon_type,network_information_workstation_name,new_logon_security_id,impersonation_level,network_information_protocol,network_information_direction,filter_information_layer_name,label
80525,80525,0,1,0,15,0,4,22,0,0,5,0,6,0,495,2,6,1,1,0
34663,34663,0,1,0,15,0,4,22,0,0,5,0,1,932,2116,2,6,1,1,2
19097,19097,0,1,0,15,0,4,22,0,0,5,0,1,932,6715,3,6,1,1,0
46525,46525,0,1,0,15,0,4,22,0,0,5,0,1,0,1894,2,6,1,1,0
32507,32507,0,1,0,15,0,4,22,0,0,5,0,1,932,1333,2,6,1,1,2


In [8]:
Y_train.unique()

0    0
1    1
2    2
3    3
4    4
5    5
Name: label, dtype: int64

### Print Labels
Making sure the test set contains all labels

In [9]:
Y_test.unique()

0    0
1    1
2    2
3    3
4    4
5    5
Name: label, dtype: int64

## Training 

Asset Classification training uses the fastai tabular model. More details can be found at https://github.com/fastai/fastai/blob/master/fastai/tabular/models.py#L6

Feature columns will be embedded so that they can be used as categorical values. The limit can be changed depending on the accuracy of the dataset.

Adam is the optimizer used in the training process; it is popular because it produces good results in various tasks. In its paper, computing the first and the second moment estimates and updating the parameters are summarized as follows

$$\alpha_{t}=\alpha \cdot \sqrt{1-\beta_{2}^{t}} /\left(1-\beta_{1}^{t}\right)$$

More detailson Adam can be found at https://arxiv.org/pdf/1412.6980.pdf

We have found that the way we partition the dataframes with a 10000 batch size gives us the optimum data loading capability. The **batch_size** argument can be adjusted for different sizes of datasets.

In [10]:
# ac = AssetClassification()
ac.train_model(X_train, "label", batch_size, epochs, lr=0.01, wd=0.0)

  return libdlpack.to_dlpack(gdf_cols)


training loss:  1.4262071646648788
valid loss 1.031 and accuracy 0.682
training loss:  0.897440600414207
valid loss 0.745 and accuracy 0.761
training loss:  0.6860394697254878
valid loss 0.626 and accuracy 0.806
training loss:  0.5719357293821185
valid loss 0.545 and accuracy 0.829
training loss:  0.4942147186719115
valid loss 0.477 and accuracy 0.846
training loss:  0.43338838699681853
valid loss 0.426 and accuracy 0.865
training loss:  0.3858283765975855
valid loss 0.387 and accuracy 0.878
training loss:  0.34896802049625425
valid loss 0.355 and accuracy 0.890
training loss:  0.3188021505160119
valid loss 0.334 and accuracy 0.896
training loss:  0.29705357024831397
valid loss 0.315 and accuracy 0.899
training loss:  0.2781167943085782
valid loss 0.302 and accuracy 0.901
training loss:  0.2630330801276372
valid loss 0.291 and accuracy 0.904
training loss:  0.25069803613619457
valid loss 0.283 and accuracy 0.907
training loss:  0.23921689129938362
valid loss 0.277 and accuracy 0.908
tr

## Evaluation

In [11]:
pred_results = ac.predict(X_test).to_array()
true_results = Y_test.to_array()

In [12]:
ac.predict(X_test)

0       2
1       2
2       0
3       2
4       0
       ..
8204    0
8205    1
8206    0
8207    0
8208    0
Length: 8209, dtype: int64

In [13]:
type(ac.predict(X_test))

cudf.core.series.Series

In [14]:
f1_score_ = f1_score(pred_results, true_results, average='micro')
print('micro F1 score: %s'%(f1_score_))

micro F1 score: 0.9126568400535998


In [15]:
torch.cuda.empty_cache()

In [16]:
labels = ["DC","DHCP","MAIL","SAP","SQL","WEB"]
a = confusion_matrix(true_results, pred_results)

In [17]:
pd.DataFrame(a, index=labels, columns=labels)

Unnamed: 0,DC,DHCP,MAIL,SAP,SQL,WEB
DC,3401,35,14,9,69,7
DHCP,98,681,2,1,19,0
MAIL,15,0,2668,7,19,0
SAP,32,1,9,116,9,0
SQL,228,5,14,10,608,11
WEB,40,1,0,2,60,18


The confusion matrix shows that some machines' function can be predicted really well, whereas some of them need more tuning or more features. This work can be improved and expanded to cover individual data-centres to create a realistic map of the network using ML by not just relying on the naming conventions. It could also help to detect more prominent scale anomalies like multiple machines, not acting per their tag.

## References:
* https://github.com/fastai/fastai/blob/master/fastai/tabular/models.py#L6
* https://jovian.ml/aakashns/04-feedforward-nn
* https://www.kaggle.com/dienhoa/reverse-tabular-module-of-fast-ai-v1
* https://github.com/fastai/fastai/blob/master/fastai/layers.py#L44