# CLX Asset Classification (Supervised)

## Authors
- Bhargav Suryadevara (NVIDIA)
- Gorkem Batmaz (NVIDIA)

## Table of Contents 
* Introduction
* Dataset
* Reading in the datasets
* Training and inference
* References

# Introduction

In this notebook, we will show how to predict the function of a server with Windows Event Logs using cudf, cuml and pytorch. The machines are labeled as DC, SQL, WEB, DHCP, MAIL and SAP. The dependent variable will be the type of the machine. The features are selected from Windows Event Logs which is in a tabular format. This is a first step to learn the behaviours of certain types of machines in data-centres by classifying them probabilistically. It could help to detect unusual behaviour in a data-centre. For example, some compromised computers might be acting as web/database servers but with their original tag. 

This work could be expanded by using different log types or different events from the machines as features to improve accuracy. Various labels can be selected to cover different types of machines or data-centres.

## Library imports

In [1]:
from clx.analytics.asset_classification import AssetClassification
import cudf
from cuml.preprocessing import train_test_split
from cuml.preprocessing import LabelEncoder
import torch
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import pandas as pd

## Initialize variables

10000 is chosen as the batch size to optimise the performance for this dataset. It can be changed depending on the data loading mechanism or the setup used. 

EPOCH should also be adjusted depending on convergence for a specific dataset. 

label_col indicates the total number of features used plus the dependent variable. Feature names are listed below.

In [2]:
batch_size = 10000
label_col = '19'
epochs = 15

In [3]:
ac = AssetClassification()

## Read the dataset into a GPU dataframe with `cudf.read_csv()` 

In [4]:
win_events_on_gpu = cudf.read_csv('win_events_18_features.csv')

In [5]:
win_events_on_gpu.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,4624,Audit Success,,An account was successfully logged on.,Microsoft Windows security auditing.,Logon,,Kerberos,0.0,Kerberos,-,3.0,,NVIDIA.COM\MCRAIGHEAD-LT2$,,,,,DHCP
1,4756,Audit Success,,A member was added to a security-enabled unive...,Microsoft Windows security auditing.,Security Group Management,,,,,,,,,,,,,DC
2,4756,Audit Success,,A member was added to a security-enabled unive...,Microsoft Windows security auditing.,Security Group Management,,,,,,,,,,,,,DC
3,4756,Audit Success,,A member was added to a security-enabled unive...,Microsoft Windows security auditing.,Security Group Management,,,,,,,,,,,,,DC
4,4756,Audit Success,,A member was added to a security-enabled unive...,Microsoft Windows security auditing.,Security Group Management,,,,,,,,,,,,,DC


The raw data had many other fields. Many of them were either static or mostly blank. After filtering those, there were 18 meaningful columns left that are listed below.

In [6]:
features = {
    "1" : "eventcode",
    "2" : "keywords",
    "3" : "privileges",
    "4" : "message",
    "5" : "sourcename", 
    "6" : "taskcategory",
    "7" : "account_for_which_logon_failed_account_domain",
    "8" : "detailed_authentication_information_authentication_package",
    "9" : "detailed_authentication_information_key_length",
    "10" : "detailed_authentication_information_logon_process",
    "11" : "detailed_authentication_information_package_name_ntlm_only",
    "12" : "logon_type",
    "13" : "network_information_workstation_name",
    "14" : "new_logon_security_id",
    "15" : "impersonation_level",
    "16" : "network_information_protocol",
    "17" : "network_information_direction",
    "18" : "filter_information_layer_name"
}

#### Categorize the columns
Categorical columns will be converted to numerical.

In [7]:
win_events_on_gpu = ac.categorize_columns(win_events_on_gpu)

### Split the dataset into training and test sets using cuML `train_test_split` function
Column 19 contains the ground truth about each machine's function that the logs come from. i.e. DC, SQL, WEB, DHCP, MAIL and SAP. Hence it will be used as a label.

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(win_events_on_gpu, "19", train_size=0.9)
X_train["label"] = Y_train

In [9]:
X_train.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,label
39595,0,1,0,15,0,4,22,0,0,5,0,1,932,2108,2,6,1,1,2
36695,2,1,2,19,0,6,22,3,2,6,1,6,932,25,3,6,1,1,0
9573,0,1,0,15,0,4,22,4,1,7,3,1,932,7980,3,6,1,1,1
69165,0,1,0,15,0,4,22,0,0,5,0,1,0,1185,2,6,1,1,0
29385,0,1,0,15,0,4,22,0,0,5,0,1,932,8714,3,6,1,1,0


In [10]:
Y_train.unique()

0    0
1    1
2    2
3    3
4    4
5    5
Name: 19, dtype: int16

### Print Labels
Making sure the test set contains all labels

In [11]:
Y_test.unique()

0    0
1    1
2    2
3    3
4    4
5    5
Name: 19, dtype: int16

## Training 

Asset Classification training uses the fastai tabular model. More details can be found at https://github.com/fastai/fastai/blob/master/fastai/tabular/models.py#L6

Feature columns will be embedded so that they can be used as categorical values. The limit can be changed depending on the accuracy of the dataset.

Adam is the optimizer used in the training process; it is popular because it produces good results in various tasks. In its paper, computing the first and the second moment estimates and updating the parameters are summarized as follows

$$\alpha_{t}=\alpha \cdot \sqrt{1-\beta_{2}^{t}} /\left(1-\beta_{1}^{t}\right)$$

More detailson Adam can be found at https://arxiv.org/pdf/1412.6980.pdf

We have found that the way we partition the dataframes with a 10000 batch size gives us the optimum data loading capability. The **batch_size** argument can be adjusted for different sizes of datasets.

In [12]:
# ac = AssetClassification()
ac.train_model(X_train, "label", batch_size, epochs, lr=0.01, wd=0.0)

  return libdlpack.to_dlpack(gdf_cols)


training loss:  1.4136250664809147
valid loss 0.965 and accuracy 0.767
training loss:  0.8499849299514194
valid loss 0.696 and accuracy 0.771
training loss:  0.6722361403187076
valid loss 0.591 and accuracy 0.816
training loss:  0.5685139073013329
valid loss 0.501 and accuracy 0.845
training loss:  0.4828818655419584
valid loss 0.428 and accuracy 0.868
training loss:  0.4202517175491208
valid loss 0.380 and accuracy 0.877
training loss:  0.375006778393924
valid loss 0.344 and accuracy 0.887
training loss:  0.3404633395877056
valid loss 0.317 and accuracy 0.900
training loss:  0.31362021492817377
valid loss 0.295 and accuracy 0.905
training loss:  0.29396465170887426
valid loss 0.280 and accuracy 0.908
training loss:  0.2780476712029165
valid loss 0.268 and accuracy 0.909
training loss:  0.26583347537215235
valid loss 0.258 and accuracy 0.914
training loss:  0.25608768926957526
valid loss 0.250 and accuracy 0.915
training loss:  0.2477132092886084
valid loss 0.245 and accuracy 0.918
tra

## Evaluation

In [13]:
pred_results = ac.predict(X_test).to_array()
true_results = Y_test.to_array()

In [19]:
ac.predict(X_test)

  return libdlpack.to_dlpack(gdf_cols)


0       0
1       0
2       0
3       0
4       2
       ..
8204    0
8205    4
8206    0
8207    3
8208    0
Length: 8209, dtype: int64

In [14]:
type(ac.predict(X_test))

cudf.core.series.Series

In [15]:
f1_score_ = f1_score(pred_results, true_results, average='micro')
print('micro F1 score: %s'%(f1_score_))

micro F1 score: 0.9171640881958826


In [16]:
torch.cuda.empty_cache()

In [17]:
labels = ["DC","DHCP","MAIL","SAP","SQL","WEB"]
a = confusion_matrix(true_results, pred_results)

In [18]:
pd.DataFrame(a, index=labels, columns=labels)

Unnamed: 0,DC,DHCP,MAIL,SAP,SQL,WEB
DC,3413,21,14,9,80,8
DHCP,104,669,0,3,16,0
MAIL,14,0,2591,6,8,0
SAP,22,0,4,157,5,0
SQL,234,2,16,18,650,27
WEB,37,0,0,1,31,49


The confusion matrix shows that some machines' function can be predicted really well, whereas some of them need more tuning or more features. This work can be improved and expanded to cover individual data-centres to create a realistic map of the network using ML by not just relying on the naming conventions. It could also help to detect more prominent scale anomalies like multiple machines, not acting per their tag.

## References:
* https://github.com/fastai/fastai/blob/master/fastai/tabular/models.py#L6
* https://jovian.ml/aakashns/04-feedforward-nn
* https://www.kaggle.com/dienhoa/reverse-tabular-module-of-fast-ai-v1
* https://github.com/fastai/fastai/blob/master/fastai/layers.py#L44