# Understanding text data

This notebook demonstrates how to classify logs of 16 technologies like:
1. Android
2. Apache
3. BGL
4. Hadoop
5. HDFS
6. HealthApp
7. HPC
8. Linux
9. Mac
10. OpenSSH
11. OpenStack
12. Proxifier
13. Spark
14. Thunderbird
15. Windows
16. Zookeeper

The log data for the above technologies was downloaded from: https://github.com/logpai/loghub

## The Approach

Given any log file we train a Deep learning model which automatically learns the patterns for each technology log and correctly classifies unseen log files from the given list of technology.

This notebook uses state of the art methods like Glove Vectors, Convolutions and LSTMs to create a Deep Learning model. The model is trained on the log files so that it can learn the patterns in the logs. The log file is preprocessed to remove special characters and numbers thats it.

## Benefits of this approach

1. Very high accuracy classification (99.98%).
2. No code to be written to extract specific patters from the log files. No string matching algorithms or Regular expressions required.
3. The model can be easily adapted to incorporate any new technology by providing the log files of that technology and just retraining the model on the new data.
4. No expertise required in programming. Since the only work involved is the collection of data.

The rest of the notebook will walk through the code that achieves this state of the art performance.

## 1. Importing the required libraries

In [3]:
from __future__ import print_function

import os
import sys
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding, Dropout, LSTM, TimeDistributed
from keras.models import Model
from keras.initializers import Constant

## 2. Setting up data directory path and model parameters

In [2]:
BASE_DIR = ''
TEXT_DATA_DIR = os.path.join(BASE_DIR, 'loghub')
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

### Understanding data

In [6]:
sample_android_log = "03-17 16:13:38.811  1702  2395 D WindowManager: printFreezingDisplayLogsopening app wtoken = AppWindowToken{9f4ef63 token=Token{a64f992 ActivityRecord{de9231d u0 com.tencent.qt.qtl/.activity.info.NewsDetailXmlActivity t761}}}, allDrawn= false, startingDisplayed =  false, startingMoved =  false, isRelaunching =  false"
print(f'sample_android_log:\n\n{sample_android_log}\n')

sample_windows_log = "2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll"
print(f'sample_windows_log:\n\n{sample_windows_log}\n')

sample_linux_log = "Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 "
print(f'sample_linux_log:\n\n{sample_linux_log}\n')

sample_android_log:

03-17 16:13:38.811  1702  2395 D WindowManager: printFreezingDisplayLogsopening app wtoken = AppWindowToken{9f4ef63 token=Token{a64f992 ActivityRecord{de9231d u0 com.tencent.qt.qtl/.activity.info.NewsDetailXmlActivity t761}}}, allDrawn= false, startingDisplayed =  false, startingMoved =  false, isRelaunching =  false

sample_windows_log:

2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxsmd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll

sample_linux_log:

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 



## 3. Reading the log files and removing any special characters and numbers

The intuition behind the approach is that, when humans go through the log files they usually tend to go through keywords which are the commonly used terminologies in that technology. Anyone who has used Android has a pre-built knowledge of the terminologies used in the it's log files.

We want our algorithm to learn the terminologies associated with a given technology and then use that knowledge to classify unseen logs.

Though this pre-processing step works for the log data files used here, it can be argued that this pre-processing step might not work for some other technologies. For eg. SAP S4 HANA logs have tokens which are keys (alphanumeric) that are not human interpretable but are related some module internally. In that case, slightly more sophisticated pre-processing can be done.

1. Regex to remove all special characters: `re1 = r"[^a-zA-Z0-9]+"`
2. Regex to remove any word containing a digit/numeric character: `re2 = r"\w*\d\w*"`

In [3]:
import re

print('Processing text dataset')

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids

# # 1. Regex to remove all special characters
# re1 = r"[^a-zA-Z0-9]+"
# # Regex to remove any word containing a digit/numeric character
# re2 = r"\w*\d\w*"

for class_name in os.listdir(TEXT_DATA_DIR):
    if os.path.isdir(os.path.join(TEXT_DATA_DIR, class_name)):
        label_id = len(labels_index)
        labels_index[class_name] = label_id
        with open(os.path.join(TEXT_DATA_DIR, class_name, class_name + '_2k.log')) as f:
            for line in f:
                texts.append(re.sub(r"([^a-zA-Z0-9]+)|\w*\d\w*", ' ', line))
                labels.append(label_id)

print(f'Found {len(texts)} lines of text.')

Processing text dataset
Found 32000 lines of text.


Printing the first line of the Android log file and verifying the pre-processed text and corresponding label.

In [4]:
print(labels[0])
print(texts[0])
print()
print(labels_index.items())

0
                D WindowManager printFreezingDisplayLogsopening app wtoken AppWindowToken   token Token   ActivityRecord     com tencent qt qtl activity info NewsDetailXmlActivity   allDrawn false startingDisplayed false startingMoved false isRelaunching false 

dict_items([('Android', 0), ('Apache', 1), ('BGL', 2), ('Hadoop', 3), ('HDFS', 4), ('HealthApp', 5), ('HPC', 6), ('Linux', 7), ('Mac', 8), ('OpenSSH', 9), ('OpenStack', 10), ('Proxifier', 11), ('Spark', 12), ('Thunderbird', 13), ('Windows', 14), ('Zookeeper', 15)])


## 4. Vectorizing the text samples into a 2D integer array

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 3665 unique tokens.


Tokenizer creates the vocabulary index (word_index) based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary such that `word_index["the"] = 0` and `word_index["cat"] = 1` it is word -> index dictionary so every word gets a unique integer value. So lower integer means more frequent word

In [6]:
tokenizer.word_index

{'info': 1,
 'nova': 2,
 'jul': 3,
 'dec': 4,
 'for': 5,
 'c': 6,
 'from': 7,
 'to': 8,
 'sshd': 9,
 'proxy': 10,
 'error': 11,
 'in': 12,
 'kernel': 13,
 'tbird': 14,
 'com': 15,
 'cbs': 16,
 'block': 17,
 'bytes': 18,
 'hadoop': 19,
 'compute': 20,
 'dfs': 21,
 'warn': 22,
 'log': 23,
 'received': 24,
 'org': 25,
 'apache': 26,
 'exe': 27,
 'combo': 28,
 'labsz': 29,
 'nov': 30,
 'edu': 31,
 'ras': 32,
 'hk': 33,
 'node': 34,
 'connection': 35,
 'cse': 36,
 'cuhk': 37,
 'server': 38,
 'user': 39,
 'step': 40,
 'req': 41,
 'msra': 42,
 'sa': 43,
 'time': 44,
 'unix': 45,
 'not': 46,
 'pam': 47,
 'chrome': 48,
 'client': 49,
 'executor': 50,
 'quorumcnxmanager': 51,
 'failed': 52,
 'data': 53,
 'notice': 54,
 'sendworker': 55,
 'child': 56,
 'status': 57,
 'is': 58,
 'sun': 59,
 'calvisitor': 60,
 'at': 61,
 'i': 62,
 'root': 63,
 'get': 64,
 'uid': 65,
 'sent': 66,
 'on': 67,
 'blk': 68,
 'got': 69,
 'found': 70,
 'local': 71,
 'package': 72,
 'api': 73,
 'of': 74,
 'session': 75,
 'r

Finding the maximum length of the log text in the entire dataset

In [7]:
MAX_SEQUENCE_LENGTH = max([len(seq) for seq in sequences])
print(MAX_SEQUENCE_LENGTH)

91


## 5. Pre-padding and one-hot-encoding the training data

1. Pre-padding: Padding the logs, having length less than MAX_SEQUENCE_LENGTH, with blank values before the start of the text.
2. One hot encoding: Converting and array say `[1, 2, 3, 1]` to `[[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]`.

In [8]:
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (32000, 91)
Shape of label tensor: (32000, 16)


In [9]:
print(sequences[0])
print(data[0])
print(labels[0])

[133, 514, 2829, 123, 2830, 1164, 1065, 1065, 1445, 15, 420, 1230, 1231, 330, 1, 1861, 2831, 103, 2832, 103, 2833, 103, 2834, 103]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0  133  514 2829
  123 2830 1164 1065 1065 1445   15  420 1230 1231  330    1 1861 2831
  103 2832  103 2833  103 2834  103]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


## 6. Splitting the data into a training set and a validation set

In [10]:
gloveindices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_sgloveamples:]
y_val = labels[-num_validation_samples:]

In [11]:
print(x_train.shape)
print(y_train.shape)
print(x_val.shape)
print(y_val.shape)

(25600, 91)
(25600, 16)
(6400, 91)
(6400, 16)


The above pre-processing techniques are specific to a given dataset. There are other techniques like de-duplication, stemming, stopwords removal etc. which are not discussed here but are useful for tasks like sentiment classification where the dataset consists of English words.

Further reading:
* https://www.kaggle.com/deffro/text-pre-processing-techniques
* https://www.kaggle.com/shashanksai/text-preprocessing-using-python

## 7. Building and training Keras Model (OPTIONAL)

In [12]:
print('Preparing embedding layer.')
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

Preparing embedding layer.


In [21]:
print('Training model.')

# train a 1D convnet with LSTMS
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Dropout(0.25)(embedded_sequences)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = LSTM(128)(x)
# x = Conv1D(128, 5, activation='relu')(x)
# x = GlobalMaxPooling1D()(x)
# x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train,
          batch_size=128,
          epochs=2,
          validation_data=(x_val, y_val))

Training model.
Train on 25600 samples, validate on 6400 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x28fce719550>

In [22]:
model.save('log_classifier.h5')

### Results: We achieve 99.98% accuracy in just 2 epochs !

## 8. Testing model on some sample inputs (OPTIONAL)

#### A.  Sample log from Android

In [37]:
sample_android_log = "03-17 16:13:38.811  1702  2395 D WindowManager: printFreezingDisplayLogsopening app wtoken = AppWindowToken{9f4ef63 token=Token{a64f992 ActivityRecord{de9231d u0 com.tencent.qt.qtl/.activity.info.NewsDetailXmlActivity t761}}}, allDrawn= false, startingDisplayed =  false, startingMoved =  false, isRelaunching =  false"
processed_sample_log = re.sub(r"([^a-zA-Z0-9]+)|\w*\d\w*", ' ', sample_android_log)
processed_sample_log

'                D WindowManager printFreezingDisplayLogsopening app wtoken AppWindowToken   token Token   ActivityRecord     com tencent qt qtl activity info NewsDetailXmlActivity   allDrawn false startingDisplayed false startingMoved false isRelaunching false'

In [40]:
seq = tokenizer.texts_to_sequences([processed_sample_log])
sample_input = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)

In [42]:
sample_input

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,  133,  514, 2829,  123, 2830, 1164, 1065, 1065, 1445,   15,
         420, 1230, 1231,  330,    1, 1861, 2831,  103, 2832,  103, 2833,
         103, 2834,  103]])

In [44]:
np.argmax(model.predict(sample_input))

0

In [45]:
labels_index.items()

dict_items([('Android', 0), ('Apache', 1), ('BGL', 2), ('Hadoop', 3), ('HDFS', 4), ('HealthApp', 5), ('HPC', 6), ('Linux', 7), ('Mac', 8), ('OpenSSH', 9), ('OpenStack', 10), ('Proxifier', 11), ('Spark', 12), ('Thunderbird', 13), ('Windows', 14), ('Zookeeper', 15)])

The model gives `0` as the prediction. On comparing it with the label indicies we can verify that index `0` corresponds to `Android` class.

#### B.  Sample log from Windows

In [46]:
sample_windows_log = "2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll"
processed_sample_log = re.sub(r"([^a-zA-Z0-9]+)|\w*\d\w*", ' ', sample_windows_log)
processed_sample_log

'            Info CBS Loaded Servicing Stack         with Core C Windows winsxs   windows         cbscore dll'

In [47]:
seq = tokenizer.texts_to_sequences([processed_sample_log])
sample_input = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)

In [48]:
sample_input

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    1,   16, 2203, 1857, 1447,  327,  124,    6,  756, 2796,
         756, 2797, 1668]])

In [49]:
np.argmax(model.predict(sample_input))

14

In [50]:
labels_index.items()

dict_items([('Android', 0), ('Apache', 1), ('BGL', 2), ('Hadoop', 3), ('HDFS', 4), ('HealthApp', 5), ('HPC', 6), ('Linux', 7), ('Mac', 8), ('OpenSSH', 9), ('OpenStack', 10), ('Proxifier', 11), ('Spark', 12), ('Thunderbird', 13), ('Windows', 14), ('Zookeeper', 15)])

The model gives `14` as the prediction. On comparing it with the label indicies we can verify that index `14` corresponds to `Windows` class.

#### C.  Sample log from Linux

In [51]:
sample_linux_log = "Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 "
processed_sample_log = re.sub(r"([^a-zA-Z0-9]+)|\w*\d\w*", ' ', sample_linux_log)
processed_sample_log

'Jun         combo sshd pam unix   authentication failure logname uid   euid   tty NODEVssh ruser rhost         '

In [52]:
seq = tokenizer.texts_to_sequences([processed_sample_log])
sample_input = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)

In [53]:
sample_input

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 125,
         28,   9,  47,  45,  81,  89,  91,  65,  88,  92, 197,  93,  94]])

In [54]:
np.argmax(model.predict(sample_input))

7

In [55]:
labels_index.items()

dict_items([('Android', 0), ('Apache', 1), ('BGL', 2), ('Hadoop', 3), ('HDFS', 4), ('HealthApp', 5), ('HPC', 6), ('Linux', 7), ('Mac', 8), ('OpenSSH', 9), ('OpenStack', 10), ('Proxifier', 11), ('Spark', 12), ('Thunderbird', 13), ('Windows', 14), ('Zookeeper', 15)])

The model gives `7` as the prediction. On comparing it with the label indicies we can verify that index `7` corresponds to `Linux` class.