[**Workflow with code**](#workflow)
# Workflow with input and output sample graphs of each step: <a name='workflowgraphs'></a> 
# 1.Create dataframe 


**Input: The folder path.**

("L:\BF69314\Mailbox\Fanny Mailbox\Master_Thesis_Data_Mining\Log_files")


![alt text](./pictures/image_data.png )


**Output: The dataframe with files info and corresponding log data.**

![alt text](./pictures/image_outputdf.png )

---

# 2.Data preprocessing

**Input: Log data**

![DSD](./pictures/datapreprocessing_1.png)


## *Process:* 
**1. Regex: Remove the variable components, only keep the static components (only keep the text information).**

**Output: Cleaned log**

![DSD](./pictures/datapreprocessing_3.png)



**2. Tokenization: processing the non-numerical data into numerical data.**

**Output: Integer log**

![DSD](./pictures/datapreprocessing_2.png)


**In the tokenization step, we will have the integer-id dictionary of the cleaned log.**

![DSD](./pictures/datapreprocessing_dic.png)

---

# 3. Training the model


## *!!!This example is only for the uni-lstm model, different model has different windows and targets.*

**Input:  Numerical data (The integer log)**

An example of the integer log [43, 16, 44,  2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4, 5, 6, 7, 15, 45, 46, 47, 48, 49, 3,27]

## *Process:* 
**1. Create the windows**

![DSD](./pictures/windows.png)



**Create the targets**

 [5, 6, 7, 4, 5, 6, 7, 15, 45, 46, 47, 48, 49, 3]

**2. Trian the model**

![DSD](./pictures/model.png)
---
# 4. Evaluation function (do this to find the best threshold)

**Input: unseen data with labels** 

## *Process:* 
**Step 1. Label the data**

**Step 2. Data preprocessing for the unseen data.**

Output: the integer log (same data preprocessing as training data).
***

**Create the windows for each block**

Split the log data of each file into blocks. We split into blocks by the "_info_sendcommand_sendcommand" command.


For example:

we have a numerical data of one file: [1,2,3,4,6,2,3,5,6,7,6,2,4,6,8...]

the index 2 is the integer id of the command. Then we will split the numberical data to block:
[1,2], [2,3,4,6,2],[2,3,5,6,7,6,2],[2,4,6,8....]

Then we create the windows for each block, the windows will be:
[2,3,4],[3,4,6],[4,6,2],[2,3,5],[3,5,6],[5,6,7],[6,7,6],[7,6,2],.....

***

**Step 3. Evaluate the model output anomalies results by the labels**

Output:

![](./pictures/func.png)
---

# 5. Anomaly detection:

**Input: unseen data, saved model (trained model), saved dictionary**



## *Process:* 
**Step 1. Data preprocessing for the unseen data.**

Output: the integer log (same data preprocessing as training data).
***

**Create the windows for each block and train the model**


***

**Step 2. Do the prediction.**

Output: the probability distributions of the predicting targets.



**Step 3. Compare the predictions with the targets.**

Output: the predicted anomalous log.

![](./pictures/anomaly_detection_1.png)
![](./pictures/anomaly_detection_2.png)
---



# Context

- [**Create dataframe**](#Create_dataframe)

- [**Data preprocessing**](#data_preprocessing)

- [**Training the model**](#training_model)

- [**Anomaly detection**](#anomaly_detection)

- [**Evaluation function**](#evaluation_function)


<a id='workflow'></a>


# Import packages

In [2]:
import zipfile
import os
import itertools
import re

import pandas as pd
import numpy as np
import pickle

import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.callbacks import Callback
from keras.models import load_model

from itertools import chain

<a id='Create_dataframe'></a>

# Create dataframe 

`Input: The Folder path. ("example_log_files" folder was used)`

`Output: The dataframe with files info and corresponding log data.`


In [3]:
#read the folder path
def return_filepaths(directory):
    json_data=[]
    # walk through the folders 
    for root, dirs, files in os.walk(directory, topdown=False):
        for name in files:
            filepath=os.path.join(root, name)
            json_data.append(ETL(filepath))
    return json_data

#output the ETLlog file
def ETL(foldername):
    
    json_data=[]
    
    foldername_long=foldername
    
    # here we split the foldername to obtain the year/month, day, project name and artifact
    foldername=foldername.split('\\')
    foldername=foldername[-1]
    foldername=foldername.split('.')[1:-1]
    
    # open the folder
    with zipfile.ZipFile(foldername_long,"r") as zfile:
        # read the files that have "ETL" in the name
        for name in zfile.namelist():
            if "ETL" in name:
                temp=zfile.read(name)
                json_data.append({'Year_month': foldername[0], 
                                  'Count': foldername[1], 
                                  'Project':foldername[2]+' '+foldername[3], 
                                  'Instance':foldername[4], 
                                  'Filename':name, 
                                  'Log': temp.decode('utf-8')})
    return json_data

#create the dataframe contains files' info and log data
def create_dataframe(folder):
    jsons=return_filepaths(folder)
    flat_jsons=[item for sublist in jsons for item in sublist]
    
    df=pd.DataFrame.from_records(flat_jsons)
    return df

In [4]:
example_path = 'C:\\Users\\A373503\\Desktop\\example_log_files' 
#if you want to run the code, change it into correct path
#You can find the example data under "Jupyter_version_implementing" folder
#or use some other data

In [5]:
example_df=create_dataframe(example_path)
example_df#see the output

Unnamed: 0,Year_month,Count,Project,Instance,Filename,Log
0,2101,01,P0987E ROBOT,artifacts_1,21_29_34_0_ETLog.txt,﻿2021-01-04 21:29:34.033 INFO [10888] [StartE...
1,2101,01,P0987E ROBOT,artifacts_1,21_34_25_0_ETLog.txt,﻿2021-01-04 21:34:25.322 INFO [3332] [StartEn...
2,2101,01,P0987E ROBOT,artifacts_1,21_38_10_0_ETLog.txt,﻿2021-01-04 21:38:10.492 INFO [3332] [StartEn...
3,2101,01,P0987E ROBOT,artifacts_1,21_55_43_0_ETLog.txt,﻿2021-01-04 21:55:43.784 INFO [3332] [StartEn...
4,2101,01,P0987E ROBOT,artifacts_1,21_56_14_0_ETLog.txt,﻿2021-01-04 21:56:14.892 INFO [3332] [StartEn...
...,...,...,...,...,...,...
167,2101,04,P0987E ROBOT,artifacts_2,19_16_03_0_ETLog.txt,﻿2021-01-07 19:16:03.486 INFO [9760] [StartEn...
168,2101,04,P0987E ROBOT,artifacts_2,19_18_50_0_ETLog.txt,﻿2021-01-07 19:18:50.216 INFO [9760] [StartEn...
169,2101,04,P0987E ROBOT,artifacts_2,19_19_53_0_ETLog.txt,﻿2021-01-07 19:19:53.986 INFO [9760] [StartEn...
170,2101,04,P0987E ROBOT,artifacts_2,19_20_56_0_ETLog.txt,﻿2021-01-07 19:20:56.627 INFO [9760] [StartEn...


<a id='data_preprocessing'></a> 

# Data preprocessing



`Input: Dataframe which contains files info and log data.`

`Output: Dataframe which contains files info, log data and numerical data.`

*Process:* 
1. Regex
2. Tokenization
***

<b>Regex:</b> 
Remove the variable components, only keep the static components (only keep the text information).

For example: 


"### 2021-02-03 10:42:39 Command listecu" will be processed to "Command_listecu"  
***

In [6]:
def preprocess_to_log_lines(log, keep_numbers):
    ### Split lines
    log=log.splitlines()
    
    ### remove the "weird names"
    log=[re.sub('\ufeff','',log[i]) for i in range(len(log))]
    log=[' '.join(log[i].split()) for i in range(len(log))]
    log=[re.sub(' +', ' ', log[i]) for i in range(len(log))]
    logical=[re.search('^\<|^[0-9]|^\#', log[i]) for i in range(len(log))]

    for i in range(len(log)):
        if logical[i] is None:
            log[i]=''
     
    ### to lower
    log=[log[i].lower() for i in range(len(log))]
    ###
    
    ### remove numbers if keep_numbers is False
    if not keep_numbers:
        log=[re.sub('[0-9]', '', log[i]) for i in range(len(log))]
    ###
    
    ### remove \r\n
    log=[re.sub('\\r\\n', '', log[i]) for i in range(len(log))]
    ###
    
    ### remove all non-alphanumeric characters
    log=[re.sub('[\W_]+', ' ',log[i]) for i in range(len(log))]
    
    ### remove all non-english characters
    temp=[re.search('[\u0080-\uFFFF]+', log[i]) for i in range(len(log))]
    for i in range(len(temp)):
        if temp[i]:
            log[i]=''
    
    # whitespace to underscore
    log=[re.sub(' ','_',log[i]) for i in range(len(log))]
    
    for i in range(len(log)):
        if log[i]=='':
            log[i]='empty_line'
    
    return log

In [7]:
#apply the above function on every log data
example_df['Cleaned_Log']=example_df['Log'].map(lambda x: preprocess_to_log_lines(x, False))

In [8]:
#output overview
example_df

Unnamed: 0,Year_month,Count,Project,Instance,Filename,Log,Cleaned_Log
0,2101,01,P0987E ROBOT,artifacts_1,21_29_34_0_ETLog.txt,﻿2021-01-04 21:29:34.033 INFO [10888] [StartE...,[_info_startengineeringtool_starting_engineeri...
1,2101,01,P0987E ROBOT,artifacts_1,21_34_25_0_ETLog.txt,﻿2021-01-04 21:34:25.322 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...
2,2101,01,P0987E ROBOT,artifacts_1,21_38_10_0_ETLog.txt,﻿2021-01-04 21:38:10.492 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...
3,2101,01,P0987E ROBOT,artifacts_1,21_55_43_0_ETLog.txt,﻿2021-01-04 21:55:43.784 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...
4,2101,01,P0987E ROBOT,artifacts_1,21_56_14_0_ETLog.txt,﻿2021-01-04 21:56:14.892 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...
...,...,...,...,...,...,...,...
167,2101,04,P0987E ROBOT,artifacts_2,19_16_03_0_ETLog.txt,﻿2021-01-07 19:16:03.486 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...
168,2101,04,P0987E ROBOT,artifacts_2,19_18_50_0_ETLog.txt,﻿2021-01-07 19:18:50.216 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...
169,2101,04,P0987E ROBOT,artifacts_2,19_19_53_0_ETLog.txt,﻿2021-01-07 19:19:53.986 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...
170,2101,04,P0987E ROBOT,artifacts_2,19_20_56_0_ETLog.txt,﻿2021-01-07 19:20:56.627 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...


<b>Tokenization: </b>processing the non-numerical data into numerical data.

***
For example (this example is not related to our data, just for easy understanding):

"I like apple" will be processed into [1,2,3]

"She does not like apple" will be processed into [4,5,6,2,3]

The dictionary will be [I:1, like:2, apple:3, she:4, does:5, not:6]
***

In [9]:
#This function is for the tokenization (transfer cleaned log sequence to integer sequence)
def cleanedLog_to_integerseq(data):
    cleanedLog_list=list(data.Cleaned_Log)#list all the cleaned_logs (since we build dictionary based on all the logs)
    tok = Tokenizer(oov_token=True)
    tok.fit_on_texts(cleanedLog_list)
    sequences = tok.texts_to_sequences(cleanedLog_list)#assign the integer id to each log
    data["EventSequence"]=sequences
    word2id=tok.word_index #the dictionary(output this because we will use this later)
    return data,word2id

In [10]:
#add the column (the integer sequence) to the dataframe.
example_df,word2id=cleanedLog_to_integerseq(example_df)

In [11]:
example_df

Unnamed: 0,Year_month,Count,Project,Instance,Filename,Log,Cleaned_Log,EventSequence
0,2101,01,P0987E ROBOT,artifacts_1,21_29_34_0_ETLog.txt,﻿2021-01-04 21:29:34.033 INFO [10888] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."
1,2101,01,P0987E ROBOT,artifacts_1,21_34_25_0_ETLog.txt,﻿2021-01-04 21:34:25.322 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."
2,2101,01,P0987E ROBOT,artifacts_1,21_38_10_0_ETLog.txt,﻿2021-01-04 21:38:10.492 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
3,2101,01,P0987E ROBOT,artifacts_1,21_55_43_0_ETLog.txt,﻿2021-01-04 21:55:43.784 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
4,2101,01,P0987E ROBOT,artifacts_1,21_56_14_0_ETLog.txt,﻿2021-01-04 21:56:14.892 INFO [3332] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."
...,...,...,...,...,...,...,...,...
167,2101,04,P0987E ROBOT,artifacts_2,19_16_03_0_ETLog.txt,﻿2021-01-07 19:16:03.486 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."
168,2101,04,P0987E ROBOT,artifacts_2,19_18_50_0_ETLog.txt,﻿2021-01-07 19:18:50.216 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."
169,2101,04,P0987E ROBOT,artifacts_2,19_19_53_0_ETLog.txt,﻿2021-01-07 19:19:53.986 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."
170,2101,04,P0987E ROBOT,artifacts_2,19_20_56_0_ETLog.txt,﻿2021-01-07 19:20:56.627 INFO [9760] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."


<b>Additional step: </b>
    Add the "unknown" word into the dictionary, the aim is when we do the tokenization step for unseen data, we can group all the unseen log into "unknown" class.

In [12]:
word2id['unknown']=len(word2id)+1#add the unknown word
word2id#the dictionary will looks like

{True: 1,
 'empty_line': 2,
 '_returnvalues_': 3,
 '_verb_receivedata_recv_got_bytes': 4,
 '_info_receivedata_creating_new_temp_data_container_for_': 5,
 '_info_receivedata_temp_container_created': 6,
 '_info_receivedata_appending_new_data_to_buffer': 7,
 '_requestconfirms_': 8,
 '_responseconfirms_': 9,
 '_engineeringtooldiagnosticsresponse_': 10,
 '_requestinfo_sa_timestamp_positive_requestinfo_': 11,
 '_diagtimeout_': 12,
 '_info_receivedata_receivedata': 13,
 '_responseinfo_sa_response_positive_timestamp_': 14,
 '_verb_receivedata_recv_got_token': 15,
 '_info_sendcommand_sendcommand': 16,
 '_verb_receivedata_raw_rcv_buffer_as_string_': 17,
 '_engineeringtoolconsoleresponse_': 18,
 '_command_udsservices': 19,
 '_xml_start_udsservices': 20,
 '_xml_end_udsservices': 21,
 '_result_uds_service_': 22,
 '_verb_sendcommand_received_engineeringtooldiagnosticsresponse_': 23,
 '_responseinfo_': 24,
 '_datainfo_': 25,
 '_partnumber_partnumber_': 26,
 '_returnvalue_name_outputformat_xml_returnv

**Save the dictionary (mandatory step, because the dictionary will be used during anomaly detection)

In [13]:
a_file = open("word2id.pkl", "wb")
pickle.dump(word2id, a_file)
a_file.close()

**Save the dataframe (mandatory step, because the dataframe will be used during training)

In [14]:
example_df.to_pickle("example_df.pkl")

<a id='training_model'></a> 

# Training the model
`Input:  Numerical data`

`Output: Trained model`

*Process:* 
1. create the windows
2. trian the model
***


**Create the windows and target (input data) for the model**
    
***
For example (this example is not related to our data, just for easy understanding):
***
The numerical data for one file is [1,2,3,4,5,6,7,8]

If we let the window size as 3 and step size as 1, we will get 5 windows:

[1,2,3],[2,3,4],[3,4,5],[4,5,6],[5,6,7]

And the target is the following item after the window, we will have 5 targets in this example:

[4], [5], [6], [7], [8]

we create the windows file by file.

In [15]:
#if you do this part independent, you need to import the data first.
example_df=pd.read_pickle('example_df.pkl') 

In [16]:
#prameters need to define
window_size=10 #length of the input
step=1 #the space between the start of the windows
embedding_size=128 #embedding size, need to try to find the best one
batch_size=32 #batch size
epochs=10 #amount of epochs 

volab=len(word2id)+1 #the vocab size is the length of the dictionary, 
#plus one is because index n is out of bounds for axis 0 with size n

In [17]:
#This function is to generate the windows and target for the dataframe
def generate_for_file(data, window_size,step):
    #no separated block
    windows = []
    targets = []
    sequence=[]
    for i in range(len(data)):
        for item in range(0,len(data["EventSequence"].iloc[i])-window_size, step): 
            sequence=data.EventSequence.iloc[i]
            window=sequence[item:item+window_size]
            target=sequence[item+window_size]
                
            windows.append(window)
            targets.append(target)
            
    windows=np.array(windows)
    targets=to_categorical(targets, volab)

    return windows,targets        

In [18]:
#create the inputs and targets
windows,targets=generate_for_file(example_df, window_size,step)

**Build the model**
    
In this model, the units inside of the LSTM layer and amount of the layers should self-defined. In the current model, we only used one LSTM layer with 64 units inside

In [19]:

model = Sequential()
model.add(Embedding(volab, embedding_size, input_length = window_size))
model.add(LSTM(64))#the value inside is the LSTM units
model.add(Dense(embedding_size, activation='relu'))
model.add(Dense(volab, activation='softmax')) 
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 10, 128)           25344     
_________________________________________________________________
lstm (LSTM)                  (None, 64)                49408     
_________________________________________________________________
dense (Dense)                (None, 128)               8320      
_________________________________________________________________
dense_1 (Dense)              (None, 198)               25542     
Total params: 108,614
Trainable params: 108,614
Non-trainable params: 0
_________________________________________________________________
None


**Train the model**

The batch size and epochs could be changed by self chosen.

In [20]:
#Only the batch_size and epochs should be self-defined 
optim=tf.keras.optimizers.Adam(learning_rate=0.001)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5,
           patience=3, verbose=1)
logs = Callback()
model.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy',tf.keras.metrics.Recall(),tf.keras.metrics.Precision(),])
model.fit(windows, targets, batch_size, epochs,shuffle = True, validation_split = 0.1, callbacks=[reduce_lr])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Epoch 00010: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.


<tensorflow.python.keras.callbacks.History at 0x1635bc46508>

<b>Save the model, this is a necessary step because we will use this in the anomaly detection part.

In [21]:
# Save model
model.save('your_saved_model_name.txt') #



INFO:tensorflow:Assets written to: your_saved_model_name.txt\assets


INFO:tensorflow:Assets written to: your_saved_model_name.txt\assets


<a id='anomaly_detection'></a> 

# Anomaly detection:
`Input: unseen data, saved model, saved dictionary`

`Output: the predicted anomalous log (if needed the output can be saved in the txt file).`

*Process:* 
1. Data preprocessing for the test data.
2. Prediction


In [22]:
#if you do this part independent, you need to import the data first.
test_data=pd.read_pickle('example_test_df.pkl') #the data for the anomaly detection (can not be the same as the training data)
model = load_model('your_saved_model_name.txt')#the trained model
word2id=pd.read_pickle('word2id.pkl')#the dictionary

**Same data preprocessing as the training data (have to be the same!!)**

In [23]:
#I did not show the create data and regex steps here 
#but they two steps are exactly same as the what we did for the training data
#Assume we have already did the two steps, then we will have the dataframe like this:
test_data


Unnamed: 0,address,Filename,Log,Cleaned_Log
0,C:\Users\A373502\Documents\Anomalies\Communica...,06_30_44_0_ETLog.txt,﻿2021-01-25 06:30:44.474 INFO [13364] [StartE...,[_info_startengineeringtool_starting_engineeri...
1,C:\Users\A373502\Documents\Anomalies\Communica...,06_33_34_0_ETLog.txt,﻿2021-01-25 06:33:34.990 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...
2,C:\Users\A373502\Documents\Anomalies\Communica...,06_34_28_0_ETLog.txt,﻿2021-01-25 06:34:28.553 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...
3,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_16_0_ETLog.txt,﻿2021-01-25 06:38:16.581 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...
4,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_29_0_ETLog.txt,﻿2021-01-25 06:38:29.168 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...
...,...,...,...,...
76,C:\Users\A373502\Documents\Anomalies\InvalidOp...,03_59_35_0_ETLog.txt,﻿2021-02-20 03:59:35.411 INFO [10716] [StartE...,[_info_startengineeringtool_starting_engineeri...
77,C:\Users\A373502\Documents\Anomalies\InvalidOp...,04_00_17_0_ETLog.txt,﻿2021-02-20 04:00:17.873 INFO [4784] [StartEn...,[_info_startengineeringtool_starting_engineeri...
78,C:\Users\A373502\Documents\Anomalies\InvalidOp...,04_02_36_0_ETLog.txt,﻿2021-02-20 04:02:36.554 INFO [14932] [StartE...,[_info_startengineeringtool_starting_engineeri...
79,C:\Users\A373502\Documents\Anomalies\InvalidOp...,18_30_22_0_ETLog.txt,﻿2021-02-25 18:30:22.951 INFO [7936] [StartEn...,[_info_startengineeringtool_starting_engineeri...


When we do the tokenization step, not 100% same as what we have done for the training data, we have to use the saved dictionary (when we did tokenization for the training data, we built the dictionary).

In [24]:
#this function is to do the tokenization step for test data
def integerId_to_unseen_data(data,word2id):
    inx=()
    integerseq_id=[]
    id_for_block=[]
    for i in range(len(data)):

        lists=data.Cleaned_Log.iloc[i]#only read the cleaned log col
        for j in range(len(lists)):#read line by line    
            char=data.Cleaned_Log.iloc[i][j]#check line by line 
            if char in word2id:
                inx = word2id[char]#assign to the id based on the dictionary
            else:
                inx = len(word2id)#if can not find it in the dictionary, put them into 'unknown' class
            integerseq_id += [inx]       
        id_for_block+=[integerseq_id]
        integerseq_id= []#empty for each block
        
    data['EventSequence']=id_for_block
    return data

In [25]:
test_data=integerId_to_unseen_data(test_data,word2id)

In [26]:
test_data#output overview

Unnamed: 0,address,Filename,Log,Cleaned_Log,EventSequence
0,C:\Users\A373502\Documents\Anomalies\Communica...,06_30_44_0_ETLog.txt,﻿2021-01-25 06:30:44.474 INFO [13364] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
1,C:\Users\A373502\Documents\Anomalies\Communica...,06_33_34_0_ETLog.txt,﻿2021-01-25 06:33:34.990 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
2,C:\Users\A373502\Documents\Anomalies\Communica...,06_34_28_0_ETLog.txt,﻿2021-01-25 06:34:28.553 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
3,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_16_0_ETLog.txt,﻿2021-01-25 06:38:16.581 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
4,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_29_0_ETLog.txt,﻿2021-01-25 06:38:29.168 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 15, 45, 46, 47..."
...,...,...,...,...,...
76,C:\Users\A373502\Documents\Anomalies\InvalidOp...,03_59_35_0_ETLog.txt,﻿2021-02-20 03:59:35.411 INFO [10716] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."
77,C:\Users\A373502\Documents\Anomalies\InvalidOp...,04_00_17_0_ETLog.txt,﻿2021-02-20 04:00:17.873 INFO [4784] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
78,C:\Users\A373502\Documents\Anomalies\InvalidOp...,04_02_36_0_ETLog.txt,﻿2021-02-20 04:02:36.554 INFO [14932] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15..."
79,C:\Users\A373502\Documents\Anomalies\InvalidOp...,18_30_22_0_ETLog.txt,﻿2021-02-25 18:30:22.951 INFO [7936] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,..."


**Create the windows for each block**

Split the log data of each file into blocks. We split into blocks by the "_info_sendcommand_sendcommand" command.

***
For example:

we have a numerical data of one file: [1,2,3,4,6,2,3,5,6,7,6,2,4,6,8...]

the index 2 is the integer id of the command. Then we will split the numberical data to block:
[1,2], [2,3,4,6,2],[2,3,5,6,7,6,2],[2,4,6,8....]

Then we create the windows for each block, the windows will be:
[2,3,4],[3,4,6],[4,6,2],[2,3,5],[3,5,6],[5,6,7],[6,7,6],[7,6,2],.....

In [27]:
#check the correspnding integer id
word2id['_info_sendcommand_sendcommand']

16

In [28]:
#this function is to Split list at a specific value
def split_at_values(lst, values):
    indices = [i for i, x in enumerate(lst) if x in values]
    for start, end in zip([0, *indices], [*indices, len(lst)]):
        yield lst[start:end+1]

In [29]:
#create the blocks
values = {word2id['_info_sendcommand_sendcommand']}
blocks=[]
for i in range(len(test_data)):
    lst_A = [test_data.EventSequence.iloc[i]]#just read the integer sequence coloum
    output = list(chain.from_iterable(split_at_values(sublst, values) for sublst in lst_A))   
    blocks.append(output)

In [30]:
test_data['Block_Sequence']=blocks #add this info into the dataframe

In [31]:
test_data

Unnamed: 0,address,Filename,Log,Cleaned_Log,EventSequence,Block_Sequence
0,C:\Users\A373502\Documents\Anomalies\Communica...,06_30_44_0_ETLog.txt,﻿2021-01-25 06:30:44.474 INFO [13364] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."
1,C:\Users\A373502\Documents\Anomalies\Communica...,06_33_34_0_ETLog.txt,﻿2021-01-25 06:33:34.990 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."
2,C:\Users\A373502\Documents\Anomalies\Communica...,06_34_28_0_ETLog.txt,﻿2021-01-25 06:34:28.553 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."
3,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_16_0_ETLog.txt,﻿2021-01-25 06:38:16.581 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."
4,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_29_0_ETLog.txt,﻿2021-01-25 06:38:29.168 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 15, 45, 46, 47...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 15, 45,..."
...,...,...,...,...,...,...
76,C:\Users\A373502\Documents\Anomalies\InvalidOp...,03_59_35_0_ETLog.txt,﻿2021-02-20 03:59:35.411 INFO [10716] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."
77,C:\Users\A373502\Documents\Anomalies\InvalidOp...,04_00_17_0_ETLog.txt,﻿2021-02-20 04:00:17.873 INFO [4784] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."
78,C:\Users\A373502\Documents\Anomalies\InvalidOp...,04_02_36_0_ETLog.txt,﻿2021-02-20 04:02:36.554 INFO [14932] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."
79,C:\Users\A373502\Documents\Anomalies\InvalidOp...,18_30_22_0_ETLog.txt,﻿2021-02-25 18:30:22.951 INFO [7936] [StartEn...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 4,...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6..."


<b>Do the anomaly detection

In [47]:
#this function is to generate the windows and target only for one block
def for_block(integerseq,window_size,step):
    windows=[]
    targets=[]
    for item in range(0,len(integerseq)-window_size, step):
        sentence=integerseq[item:item+window_size]
        target=integerseq[item+window_size]
        windows.append(sentence)
        targets.append(target)
    windows=np.array(windows)
    targets=to_categorical(targets, volab)
    
    return windows,targets

In [33]:
#%%capture cap --no-stderr 
#The above code is to save the output of this cell into txt file.

#do the anomaly detection for each block
# for one file:
#    for one block:
#        generate the input(I named sentence here) and target
#        do prediction by the saved model
#        compare the prediction and the target
#        if not the same:
#           output this as an anomaly

for i in range(len(test_data)):
    print(test_data.address.iloc[i],test_data.Filename.iloc[i])
    window_num=[]
    for block in range(1,len(test_data.Block_Sequence.iloc[i])):
        one_block=test_data.Block_Sequence.iloc[i][block]
        window_n=len(one_block)-window_size
        window_num.append(window_n)

        sentences,targets=for_block(one_block,window_size,step)
        #print('In the',block,'th block, block length is',len(one_block),", has",len(targets),"prediction.")
        
        n=10
        prediction = model.predict(sentences)
        preds=(-prediction).argsort()[:,:n]
        truth=(-targets).argsort()[:,0]
        truth=truth[:, None]
        
        for j in range(len(prediction)): #for each blocks  
            if truth[j] not in preds[j]:
                #print("Anomaly in the",j,'th line')
         
                #print(sum(window_num)-window_num[block-1]+9*block+2)
                location_num=sum(window_num)-window_num[block-1]+9*block+j+2
                Logs_list=list(test_data.Log[i].splitlines())
                
                print("Anomaly is around the",location_num+1,'th line:',Logs_list[location_num])

    print('\n')


C:\Users\A373502\Documents\Anomalies\CommunicationException\no_anomaly.BSW.2101.18.P3237A.ROBOT.artifacts.zip 06_30_44_0_ETLog.txt
Anomaly is around the 234 th line:           <SerialNumber>20220007</SerialNumber>
Anomaly is around the 235 th line:         </HardwareInfo>
Anomaly is around the 236 th line:       </MainHardwareTea2Plus>
Anomaly is around the 241 th line:       <Errors>
Anomaly is around the 242 th line:         <Error Description="DID f190 not included in node response.">
Anomaly is around the 243 th line:           <SubDescription />
Anomaly is around the 269 th line:           <SerialNumber>20220007</SerialNumber>
Anomaly is around the 270 th line:         </HardwareInfo>
Anomaly is around the 271 th line:       </MainHardwareTea2Plus>
Anomaly is around the 276 th line:       <Errors>
Anomaly is around the 277 th line:         <Error Description="DID f190 not included in node response.">
Anomaly is around the 278 th line:           <SubDescription />
Anomaly is around



C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E.ROBOT.artifacts.zip 00_14_38_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E.ROBOT.artifacts.zip 00_19_38_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E.ROBOT.artifacts.zip 00_20_58_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E.ROBOT.artifacts.zip 00_32_30_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E.ROBOT.artifacts.zip 00_33_01_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E.ROBOT.artifacts.zip 00_34_09_0_ETLog.txt
Anomaly is around the 348 th line: ### 2021-01-19 01:34:31 Progress udsservices "DiagSessionControl" 0...10...20...30...40...50...60...70...80...90...100


C:\Users\A373502\Documents\Anomali



C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E_BP.ROBOT.artifacts.zip 01_02_24_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E_BP.ROBOT.artifacts.zip 01_06_05_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E_BP.ROBOT.artifacts.zip 01_24_01_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E_BP.ROBOT.artifacts.zip 01_24_33_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E_BP.ROBOT.artifacts.zip 01_25_39_0_ETLog.txt
Anomaly is around the 348 th line: ### 2021-01-19 02:26:01 Progress udsservices "DiagSessionControl" 0...10...20...30...40...50...60...70...80...90...100


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\no_anomaly.BSW.2101.12.P3226E_BP.ROBOT.artifacts.zip 01_26_45_0_ETLog.txt


C:\Users\A373502



C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E.ROBOT.artifacts.zip 21_51_53_0_ETLog.txt
Anomaly is around the 347 th line: ### 2021-01-19 22:52:15 Progress udsservices "DiagSessionControl" 0...10...20...30...40...50...60...70...80...90...100


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E.ROBOT.artifacts.zip 21_52_56_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E.ROBOT.artifacts.zip 21_53_23_0_ETLog.txt
Anomaly is around the 562 th line: 2021-01-19 21:53:38.949 VERB  [24332] [SendCommand@258] Received: <EngineeringToolConsoleResponse Command="disconnect" ResponseType="SimpleGeneric">


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E.ROBOT.artifacts.zip 21_53_58_0_ETLog.txt
Anomaly is around the 405 th line:         <ReturnValue Name="DataRecord">13 ce 00 03 c0 74</ReturnValue>
Anomaly 



C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E_BP.ROBOT.artifacts.zip 22_42_44_0_ETLog.txt
Anomaly is around the 348 th line: ### 2021-01-19 23:43:06 Progress udsservices "DiagSessionControl" 0...10...20...30...40...50...60...70...80...90...100


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E_BP.ROBOT.artifacts.zip 22_43_47_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E_BP.ROBOT.artifacts.zip 22_44_14_0_ETLog.txt


C:\Users\A373502\Documents\Anomalies\Error_reading_DOID_P1FRS\with_anomaly.BSW.2101.13.P3226E_BP.ROBOT.artifacts.zip 22_44_49_0_ETLog.txt
Anomaly is around the 405 th line:         <ReturnValue Name="DataRecord">13 ce 00 03 c0 74</ReturnValue>
Anomaly is around the 424 th line:         <ReturnValue Name="DataRecord">13 ce 00 03 c0 74</ReturnValue>
Anomaly is around the 453 th line:         <ReturnValue Name="DataRecord">

Anomaly is around the 235 th line:         <PartNumber>[50,33,32,32,36,45,5f,32]  is not a valid part number.</PartNumber>
Anomaly is around the 236 th line:         <Revision>Value '102' is not valid as issue and revision.</Revision>
Anomaly is around the 237 th line:         <BuildId />
Anomaly is around the 240 th line:         <DataInfo>
Anomaly is around the 242 th line:           <PartNumber>[50,33,32,32,36,45,5f,32]  is not a valid part number.</PartNumber>
Anomaly is around the 243 th line:           <Caption>Missing</Caption>
Anomaly is around the 244 th line:           <Revision>Value '102' is not valid as issue and revision.</Revision>
Anomaly is around the 249 th line:           <PartNumber>[50,33,32,32,36,45,5f,32]  is not a valid part number.</PartNumber>
Anomaly is around the 251 th line:           <Revision>Value '102' is not valid as issue and revision.</Revision>
Anomaly is around the 254 th line:         <DataInfo>
Anomaly is around the 256 th line:           <PartNu

Anomaly is around the 642 th line:         <ReturnValue Name="DataRecord">f4 9e ff ff</ReturnValue>
Anomaly is around the 661 th line:         <ReturnValue Name="DataRecord">f4 9e ff ff</ReturnValue>
Anomaly is around the 709 th line:         <ReturnValue Name="DataRecord">13 d3 00 00 00 04</ReturnValue>
Anomaly is around the 738 th line:         <ReturnValue Name="DataRecord">13 d3 00 00 01 f4</ReturnValue>
Anomaly is around the 757 th line:         <ReturnValue Name="DataRecord">13 d3 00 00 01 f4</ReturnValue>
Anomaly is around the 786 th line:         <ReturnValue Name="DataRecord">13 d3 ff ff ff ff</ReturnValue>
Anomaly is around the 805 th line:         <ReturnValue Name="DataRecord">13 d3 ff ff ff ff</ReturnValue>
Anomaly is around the 882 th line:         <ReturnValue Name="DataRecord">12 9a ff ff ff ff</ReturnValue>
Anomaly is around the 901 th line:         <ReturnValue Name="DataRecord">12 9a ff ff ff ff</ReturnValue>
Anomaly is around the 930 th line:         <ReturnValue Na

Anomaly is around the 992 th line:           <SerialNumber>19444635</SerialNumber>
Anomaly is around the 993 th line:         </HardwareInfo>
Anomaly is around the 994 th line:       </MainHardwareTea2Plus>
Anomaly is around the 1049 th line:           <SerialNumber>19444635</SerialNumber>
Anomaly is around the 1050 th line:         </HardwareInfo>
Anomaly is around the 1051 th line:       </MainHardwareTea2Plus>


C:\Users\A373502\Documents\Anomalies\InvalidOperation\no_anomaly.BSW3.1510.2102.04.P3225E_BP_NOAHI.ROBOT.artifacts.zip 03_46_51_0_ETLog.txt
Anomaly is around the 233 th line:           <SerialNumber>19444635</SerialNumber>
Anomaly is around the 234 th line:         </HardwareInfo>
Anomaly is around the 235 th line:       </MainHardwareTea2Plus>
Anomaly is around the 290 th line:           <SerialNumber>19444635</SerialNumber>
Anomaly is around the 291 th line:         </HardwareInfo>
Anomaly is around the 292 th line:       </MainHardwareTea2Plus>


C:\Users\A373502\Document

Anomaly is around the 779 th line:       </Errors>
Anomaly is around the 780 th line:     </Node>
Anomaly is around the 781 th line:   </Nodes>




<b>save the output in txt file if need

In [None]:
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(cap.stdout)

<a id='evaluation_function'></a> 

# Evaluation function
`Input: anomalous file name, the dataframe with address path and log data (Assume we have already did the data preprocessing)`

`Output: evaluation results`

*Process:* 
1. Label the data 
2. Prediction 
3. Evaluated by top n method

In [34]:
#if you knwo which file is anomaly, add the name into the anomalous list here
anomalous_list=['09_41_55_0_ETLog.txt', '09_43_05_0_ETLog.txt', '21_53_58_0_ETLog.txt', '22_44_49_0_ETLog.txt', '15_17_02_0_ETLog.txt', '15_22_00_0_ETLog.txt', '18_31_49_0_ETLog.txt', '18_30_22_0_ETLog.txt']

In [35]:
#label the data, if it is anomalous, labeled as 1; if not, as 0.
#you can use the anomalous and non-anomalous to define the label, but 1 and 0 are more easier.
labels=[]
for i in range(len(test_data)):
    if any(word in test_data.Filename.iloc[i] for word in anomalous_list):
        n=1
    else:
        n=0
    labels.append(n)

In [36]:
#add the label into our dataframe
test_data['Label']=labels

In [37]:
#data overview
test_data.head()

Unnamed: 0,address,Filename,Log,Cleaned_Log,EventSequence,Block_Sequence,Label
0,C:\Users\A373502\Documents\Anomalies\Communica...,06_30_44_0_ETLog.txt,﻿2021-01-25 06:30:44.474 INFO [13364] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6...",0
1,C:\Users\A373502\Documents\Anomalies\Communica...,06_33_34_0_ETLog.txt,﻿2021-01-25 06:33:34.990 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6...",0
2,C:\Users\A373502\Documents\Anomalies\Communica...,06_34_28_0_ETLog.txt,﻿2021-01-25 06:34:28.553 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6...",0
3,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_16_0_ETLog.txt,﻿2021-01-25 06:38:16.581 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6, 7, 15...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 4, 5, 6...",0
4,C:\Users\A373502\Documents\Anomalies\Communica...,06_38_29_0_ETLog.txt,﻿2021-01-25 06:38:29.168 INFO [17640] [StartE...,[_info_startengineeringtool_starting_engineeri...,"[43, 16, 44, 2, 13, 4, 5, 6, 7, 15, 45, 46, 47...","[[43, 16], [16, 44, 2, 13, 4, 5, 6, 7, 15, 45,...",0


In [42]:
#this function is to generate the windows and target only for one file

def for_file(integerseq,window_size,step):
    windows=[]
    targets=[]
    for i in range(1,len(integerseq)):
        file_seq=integerseq[i]
        for item in range(0,len(file_seq)-window_size, step):
            sentence=file_seq[item:item+window_size]
            target=file_seq[item+window_size]
            windows.append(sentence)
            targets.append(target)
    windows=np.array(windows)
    targets=to_categorical(targets, volab)
    
    return windows,targets

In [46]:
#do the anomaly detection for each file

# for one file:
#    generate the input(I named sentence here) and target
#    do prediction by using the saved model
#    choose the threshold n here and then compare the prediction and the target
#    if any prediction is not the same as the targets:
#         if the file's label is 1:(anomalous label)
#              output this as one TN
#         if the file's label is 0 (non-anomalous):
#              output this as one FP
#    if all prediction is the same as the targets:
#         if the file's label is 0:(non-anomalous label)
#              output this as an TP
#         if the file's label is 1: (anomalous label)
#              output this as an FN


TN=0 #ture negative
FP=0 #false positive
FN=0 #false negative
TP=0 #ture positive
precision=0
recall=0
accuracy=0
f1_score=0

####below threshold needs to be self-defined
n=10 #you can choose any other number

for i in range(len(test_data)):
    one_file=test_data.Block_Sequence.iloc[i]
    sentences,targets=for_file(one_file,window_size,step)
    #print('In the',i,'th file, sequence length is',len(sentences))
        
    
    prediction = model.predict(sentences)
    preds=(-prediction).argsort()[:,:n]
    truth=(-targets).argsort()[:,0]
    truth=truth[:, None]
    
    if int(sum([truth[j] in preds[j] for j in range(0,len(sentences))])/len(prediction)) is not 1: # anomaly prediction
        if test_data.Label.iloc[i] == 1:
            TP+=1
        else: #label is normal
            FP+=1
    if int(sum([truth[j] in preds[j] for j in range(0,len(sentences))])/len(prediction)) is 1:  #normal prediction
        #if "no_anomaly" in test_data.label.iloc[i]: #label is normal
        if test_data.Label.iloc[i] == 0:
            TN+=1
        else: #label is anomaly
            FN+=1
            

                 
        

print("Top",n,":")
print("True positive (anomaly with anomaly prediction):",TP)
print("False positive (normal with anomaly prediction):",FP)
print("False negative (anomaly with normal prediction): ",FN)
print("True negative (normal with normal prediction):",TN)
            
accuracy=(TP+TN)/(TP+TN+FP+FN)
precision=TP/(TP+FP)
recall=TP/(TP+FN)
f1_score=(2*recall*precision)/(recall+precision)

print("Accuracy:",accuracy)
print("Precision:",precision)
print("Recall:",recall)
print("F1 score:",f1_score)
        

 

Top 10 :
True positive (anomaly with anomaly prediction): 8
False positive (normal with anomaly prediction): 21
False negative (anomaly with normal prediction):  0
True negative (normal with normal prediction): 52
Accuracy: 0.7407407407407407
Precision: 0.27586206896551724
Recall: 1.0
F1 score: 0.4324324324324324
