# Anomaly Detection in Logs

The lab provides a practical demonstration of how an autoencoder can be used to identify anomalies. While this works, I chose to encode character by character in order to provide an intuitive understanding of what a high loss value looks like. In this notebook, we approach the problem "a real way."

In [1]:
import numpy as np
import re
import tensorflow as tf
from tensorflow.keras import models, layers

I0000 00:00:1767122695.665414   84700 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0000 00:00:1767122695.698019   84700 cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1767122696.502095   84700 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [2]:
# This max_length is 20% longer than the longest tokenized entry in the training data.
max_length = 32


# We do still need to load the logs for training.
with open('../data/Day 5/messages', 'r') as f:
    log_data = f.readlines()

In [3]:
# Let's use a tokenizer to tokenize the "words"
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

tokenizer.train(['../data/Day 5/messages'])






In [4]:
# It still makes sense to strip the timestamps to avoid inflating loss.
log_data = [line[16:-1] for line in log_data]
log_data[:5]

['munnin rsyslogd: [origin software="rsyslogd" swVersion="8.24.0-57.amzn2.2.0.1" x-pid="3223" x-info="http://www.rsyslog.com"] rsyslogd was HUPed',
 'munnin systemd: Removed slice User Slice of root.',
 'munnin dhclient[3034]: XMT: Solicit on eth0, interval 108620ms.',
 'munnin systemd: Created slice User Slice of root.',
 'munnin systemd: Started Session 263 of user root.']

In [5]:
# However, we will not remove the numbers this time. 
# Let's encode all of the lines:


def preprocess_data(x):
    data = [tokenizer.encode(i).ids for i in x]
    return np.array([((i * round((max_length / (len(i))+1)))[:max_length]) for i in data])
    
x = preprocess_data(log_data)

In [6]:
model = models.Sequential()
model.add(layers.Input(shape=(max_length,)))
model.add(layers.Embedding(30000, 256))
model.add(layers.Conv1D(64, 2, activation='elu'))
model.add(layers.Conv1D(32, 4, activation='elu'))
model.add(layers.Conv1D(8, 8, activation='elu'))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(21*8, activation='relu')) # Before the latent space we are at (-1,21,8)
model.add(layers.Conv1DTranspose(8, 8, activation='elu'))
model.add(layers.Conv1DTranspose(32, 4, activation='elu'))
model.add(layers.Conv1DTranspose(64, 2, activation='elu'))
model.add(layers.Flatten())
model.add(layers.Dense(max_length))

W0000 00:00:1767122700.069923   84700 gpu_device.cc:2456] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0a. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
W0000 00:00:1767122700.077926   84700 gpu_device.cc:2456] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0a. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
I0000 00:00:1767122700.156993   84700 gpu_device.cc:2040] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29789 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:01:00.0, compute capability: 12.0a


In [7]:

model.compile(loss='mse', optimizer='adamax', metrics=['accuracy'])
history = model.fit(x, x, epochs=50, batch_size=32, verbose=False)


I0000 00:00:1767122701.985544   84798 service.cc:158] XLA service 0x7fd7c404d140 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1767122701.985560   84798 service.cc:166]   StreamExecutor device (0): NVIDIA GeForce RTX 5090, Compute Capability 12.0a
I0000 00:00:1767122702.024156   84798 dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1767122702.143087   84798 cuda_dnn.cc:463] Loaded cuDNN version 91002
I0000 00:00:1767122704.369708   84798 device_compiler.h:208] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


In [8]:
test = [
       'munnin sshd: failed logon attempt by margie.',
       'hestia kernel: unknown operand from system process',
       'munnin kernel: /dev/sda1 out of diskspace',
       'munnin sudo: mike : TTY=pts/2 ; PWD=/home/mike ; USER=root ; COMMAND=/usr/sbin/adduser jim',
       'munnin groupadd[1731]: group added to /etc/group: name=jim, GID=1001',
       'munnin passwd[1742]: pam_unix(passwd:chauthtok): password changed for jim'
       ]
anomalies = preprocess_data(test)

In [9]:
print('Losses on known data:')
print(tf.keras.losses.mae(x[:6], model(x[:6])))
print('Losses on anomalies:')
print(tf.keras.losses.mae(anomalies, model(anomalies)))

Losses on known data:
tf.Tensor([2547.3103    122.550415  103.366776  120.90019    98.71762   122.550415], shape=(6,), dtype=float32)
Losses on anomalies:
tf.Tensor([7956.436  3611.8918 4805.9834 4142.8804 6116.6274 2715.1367], shape=(6,), dtype=float32)


Notice the massive differences in losses. However, it would not make sense to try to decode the data back to log entries. They would all be nonsense.