#***Web robot detection based on Log access Pattern recognition***

####17102042 Min-seon Kim
####17102062 Yong-Hoon Lee



##- Background[1]




<img src="https://drive.google.com/uc?id=1a1iV7b3dJDqakWtiLRkS7_xNk88m7B0F" width="700">

According to a recent survey, 37.2% of all internet users were robots in 2020. It consists of 13.1% of good bots and 24.1% of bad bots. 
And bad bots considered malicious usually threaten the security and privacy of web applications and users. 

In this project, the workflow pipeline which was proposed in the "*DeepLog: Anomaly Detection and Diagnotics from System Logs*" was applied to web robot detection by using the URL access pattern.
The core assumption of the model in this project is that humans leave log records through URLs of similar related topics, but robots will repeat the standardized pattern regardless of the subject. For that reasons, we can determine the presence or absence of robots by capturing the uri access pattern which has been made by various web bots.


## - Dataset[2]
 
<img src="https://drive.google.com/uc?id=1CKIBAo0RFxc5P806dWmegGGSBG7Djsil" width="700">

We used the web robot server log open data posted on ZENODE. This dataset includes server logs from search engines in libraries and information centers at the University of Aristotle of Thessaloniki in Greece (http://search.lib.auth.gr/). <br>
The search engine allows users to check the availability of books and other researches for digitized materials and scientific publications.
<br>
<br>


 
<img src="https://drive.google.com/uc?id=1ClJDTdhauvd8jnxOW_JznHo1PQlZXBO0" width="1000">

There are 9 columns in the data : 

referrer, request, method, resource, bytes, response, ip, useragent, timestamp

The data timestamp is from March 1st to March 31 2018 <br>

Total request : 4,091,155 requests  <br>
Average request per day : 131,973 requests <br>
Total unique IP : 27,061  <br>
Total unique user-agent : 3,441 <br>


In [None]:
import pandas as pd
import json
import numpy as np
from google.colab import drive
drive.mount('/gdrive', force_remount=True)
# load data
test_file = pd.read_csv('/gdrive/My Drive/data300000.csv', encoding='utf-8')

In [None]:
test_file.head()

## - Data preprocessing
<br>

### Using column
* we only use 5 columns request, response, ip, useragent and timestamp

###'Request' column
*  we use the first value after the 'get' letter in the request as a representative. The get method contains information in the URL, so the first value, which is the most meaningful information, is used as the representative of the request.

* Use Encoder to change 'request' from string to int.

### 'Label' column
* A label column is added, which means 1 bot means 0 for humans. Label refers to the user agent.


In [None]:
# use 5 columns
need_col = ['request', 'response', 'ip', 'useragent', 'timestamp']

df = test_file[need_col]
df = df.dropna(axis=0)

In [None]:
# timestamp change to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# request preprocessing
df['request'] = df['request'].apply(lambda x: str(x).split()[7].split('/')[1] if (str(x).split()[7].split('/')[0]=="") else "Timeout")

df['request'] = df['request'].apply(lambda x:"rc4.js?" if x[:7]=="rc4.js?" else x)
df['request'] = df['request'].apply(lambda x:"favicon.ico?" if x[:12]=="favicon.ico?" else x)
df['request'] = df['request'].apply(lambda x:"sitemap-n.xml" if x[:8]=="sitemap-" else x)

In [None]:
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder() # Encoder to change String to Int
df['request'] = LE.fit_transform(df['request']) # sql_syntax int encoding
df = df.sort_values(by=['timestamp'])

# make label column by using useragent
df['label'] = df['useragent'].apply(lambda x: 1 if 'bot' in str(x) else (1 if 'crawl' in str(x) else (1 if 'BUbiNG' in str(x) else (1 if 'Bot' in str(x) else (1 if 'Crawl' in str(x) else 0)))))
num_classes = len(df['request'].unique())

# index reset
df.reset_index(drop=True, inplace=True)

In [None]:
df.head()

##- Method and Algorithm

###Reffered paper [3] <br>
I referred to the DeepLog paper published in 2017 at Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 
<br>
<br>

###**Method1 : log key and parameter**
DeepLog paper
<br>
<img src="https://drive.google.com/uc?id=1Ha7t3iyY4akuIrguirPpd3airL7czvO5" width="700">
<br>
Our method flow
<br>
<img src="https://drive.google.com/uc?id=1xUO9Fgdw83-IfLlP0MEAZT_D1Y44ELiX" width="700">
<br>
<br>

All log entries were divided by key values and parameters. Key values are log bodies excluding specific values, and parameters are variable values that enter values. By parsing the log entries through the log parser, all logs are divided into log key and parameter value vector. In this step, value of the parameter vector is time difference between the current log and previous log. 
Two different models are trained with the order of log key values and the sequence of parameter values based on unique ip.
When the new log entry generated, the log key is checked against pretrained log key anomaly detection model to see if there’s any anomaly. If not anomaly, it will further check this parameter value vector against parameter value anomaly detection model for that log key to see if there’s any anomaly happens. After checking through two different models, it will discern the ip's identity to access the website. 
<br>
<br>
To sum up, in our team project, we define the log entry as the combination of log key value whose column name is “request” and the parameter value for time difference between the current log and previous log.

<img src="https://drive.google.com/uc?id=1DhahieGUkVKjAkArqL-3UslOjGn-s6Fl" width="700">
<br>
<br>




###**Method2 : Window size**
<img src="https://drive.google.com/uc?id=1OteYOLsIphF4jS-Ok_ePCDoiATdbdXkA" width="700">
<br>
In our model the window size is one of the most important hyperparameter. The above image is example of window size. In our project, we set window size 10.<br>
After selecting the window size as 10, the encoded sequence of the log key and paramter whose length is 10 is generated. We use this sequence data sets with 10 length vector as input for LSTM model and predict the 11th log key and parameter value.

# **Detailed Code Explanation**
### Method1 : Make unique key and parameter list

In [None]:
# make unique key and parameter dictionary
# Generate the sequence of parameter value of all IP.

log_entry = "ip"
unique_key = df[log_entry].unique()
unique_n = len(df[log_entry].unique())

# parameter
param_bot = dict()
param_man = dict()

# log key
log_key_bot = dict()
log_key_man = dict()

# make empty dictionary
for i in range(unique_n):
    param_bot[unique_key[i]] = []

for i in range(unique_n):
    param_man[unique_key[i]] = []

for i in range(unique_n):
    log_key_bot[unique_key[i]] = []

for i in range(unique_n):
    log_key_man[unique_key[i]] = []

In [None]:
# allign parameter (time difference between the current log and previous log, label)
# allign log_key (request, label)
# label is added to measure the performance of the model
for idx in range(len(df)):
    if idx%10000==0:
        #print(idx)
        pass
    if df.iloc[idx]['label']==1:
        param_bot[df[log_entry][idx]].append((df.iloc[idx]['timestamp']-df.iloc[idx-1]['timestamp'], df.iloc[idx]["label"]))
    if df.iloc[idx]['label']==0:
        param_man[df[log_entry][idx]].append((df.iloc[idx]['timestamp']-df.iloc[idx-1]['timestamp'], df.iloc[idx]["label"]))
    if df.iloc[idx]['label']==1:
        log_key_bot[df[log_entry][idx]].append((df.iloc[idx]['request'], df.iloc[idx]["label"]))
    if df.iloc[idx]['label']==0:
        log_key_man[df[log_entry][idx]].append((df.iloc[idx]['request'], df.iloc[idx]["label"]))

In [None]:
# Make unique key and parameter list
# Sequence is according to ip and time difference
seq_param_bot = []
seq_param_man = []
seq_log_key_bot = []
seq_log_key_man = []

for k in param_bot.keys():
    seq_param_bot.append(param_bot[k])

for k in param_man.keys():
    seq_param_man.append(param_man[k])

for k in log_key_bot.keys():
    seq_log_key_bot.append(log_key_bot[k])

for k in log_key_man.keys():
    seq_log_key_man.append(log_key_man[k])

seq_param_bot.sort(key=len, reverse=True)
seq_param_man.sort(key=len, reverse=True)
seq_log_key_bot.sort(key=len, reverse=True)
seq_log_key_man.sort(key=len, reverse=True)

In [None]:
# delete the sequence smaller than 5
idx = len(seq_param_bot)
for item in range(len(seq_param_bot)):
    if len(seq_param_bot[item]) <= 5:    
        idx = item
        break

seq_param_bot = seq_param_bot[:idx]

# delete the sequence smaller than 5
idx = len(seq_param_man)
for item in range(len(seq_param_man)):
    if len(seq_param_man[item]) <= 5:   
        break

seq_param_man = seq_param_man[:idx]

# delete the sequence smaller than 5
idx = len(seq_log_key_bot)
for item in range(len(seq_log_key_bot)):
    if len(seq_log_key_bot[item]) <= 5:
        idx = item
        break

seq_log_key_bot = seq_log_key_bot[:idx]

# delete the sequence smaller than 5
idx = len(seq_log_key_man)
for item in range(len(seq_log_key_man)):
    if len(seq_log_key_man[item]) <= 5:
        idx = item
        break
        
seq_log_key_man = seq_log_key_man[:idx]

In [None]:
seq_param_man[0][:10]

In [None]:
seq_log_key_bot[5][:10]

### Divide train set and validation set per each ip. the ratio is 0.8 and 0.2
<img src="https://drive.google.com/uc?id=1mAfFAzyioY-DZLq-oCm6fjauw8xh-nlF" width="700">
<br>

In [None]:
nums = 0
for seq in seq_log_key_bot:
    nums += len(seq)

ratio = 0.8
train_num = int(nums*ratio)
tmp = 0
# log key train and log key validation 
key_train = []
key_valid = []

for seq in seq_log_key_bot:
    tmp = len(seq)
    idx = int(ratio * tmp)
    key_train.append(seq[:idx])
    key_valid.append(seq[idx:])

In [None]:
ratio = 0.8
train_num = int(nums*ratio)
tmp = 0
# parameter train and parameter validation
param_train = []
param_valid = []

for seq in seq_param_bot:
    tmp = len(seq)
    idx = int(ratio * tmp)
    param_train.append(seq[:idx])
    param_valid.append(seq[idx:])

###Method2 : based on window size 10, make 2 dimensional list

In [None]:
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
import time
import tensorflow as tf
import tensorflow as tf
import os
# os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

# make list by pre-set window size
def generate(name, window_size):
    num_sessions = 0
    inputs = []
    outputs = []

    for line in name:
        num_sessions += 1
        for i in range(len(line) - window_size):
            inputs_tmp = []
            for j in line[i:i + window_size]:
                inputs_tmp.append(j[0])
            inputs.append(inputs_tmp)
            outputs.append(line[i + window_size][0])
    return inputs, outputs


# window size is 10
window_size = 10
num_classes = len(df['request'].unique())

TP = 0
FP = 0
n_candidates = 10  # top n probability for predicted result

In [None]:
# generate the train data and validation data
X, Y = generate(key_train, window_size)
X = np.reshape(X, (len(X), window_size, 1))
Y = to_categorical(Y, num_classes)

X_valid, Y_valid = generate(key_valid, window_size)
X_valid = np.reshape(X_valid, (len(X_valid), window_size, 1))
Y_valid = to_categorical(Y_valid, num_classes)

In [None]:
X[5][:5]

In [None]:
Y[5][:5]

###This is parameter case (using window size)
<img src="https://drive.google.com/uc?id=1xoTBCl1oRs4bw30pmvO2zcRQ3cfaB0N_" width="700">
<br>

In [None]:
# the function for generate the train data set for parameter model
def generate_param(name, window_size):
    num_sessions = 0
    inputs = []
    outputs = []

    for line in name:
        num_sessions += 1
        for i in range(len(line) - window_size):
            inputs_tmp = []
            for j in line[i:i + window_size]:
                inputs_tmp.append(j[0])
            inputs.append(inputs_tmp)
            outputs.append(line[i + window_size][0])
    return inputs, outputs

In [None]:
# generate the train data and validation data
X_p, Y_p = generate_param(param_train, window_size)
X_p_valid, Y_p_valid = generate_param(param_valid, window_size)

In [None]:
# encode the time difference based on the quotient divided by 10.
for i in range(len(X_p)):
    for j in range(len(X_p[i])):
        X_p[i][j] = int(X_p[i][j].total_seconds())//10

for i in range(len(Y_p)):
    Y_p[i] = int(Y_p[i].total_seconds())//10

for i in range(len(X_p_valid)):
    for j in range(len(X_p_valid[i])):
        X_p_valid[i][j] = int(X_p_valid[i][j].total_seconds())//10

for i in range(len(Y_p_valid)):
    Y_p_valid[i] = int(Y_p_valid[i].total_seconds())//10

new_list = []
url_set = set()

for item in X_p:
    if item[2] not in url_set:
        url_set.add(item[2])
        new_list.append(item[2])
    else:
        pass

for item in Y_p:
    if item not in url_set:
        url_set.add(item)
        new_list.append(item)
    else:
        pass

In [None]:
num_params = len(new_list)
num_params = 30  # from 0 seconds to 300 seconds

X_p = np.array(X_p).reshape(-1,10,1)
targets = np.array([Y_p]).reshape(-1)
Y_p = np.eye(num_params)[targets]

X_p_valid = np.array(X_p_valid).reshape(-1,10,1)
targets = np.array([Y_p_valid]).reshape(-1)
Y_p_valid = np.eye(num_params)[targets]

In [None]:
X_p[41981]

In [None]:
Y_p[5]

## - Make model

### Set hyperparameter
We set various value of hyperparameter, and through validation, we can find best value of hyperparameter
<br>
<br>
batch_size = 20000<br>
optimizer = Adam(lr=3e-4)<br>
max epoch_num = 100<br><br>

### Using callback
**ModelCheckpoint**<br>
Only save the model weight when validation loss is improved<br><br>

**EarlyStopping**<br>
If model has not been improved for 10 epochs, stop training<br><br>


**LSTM model with Dense Layer**<br>
3 hidden layer for LSTM model<br><br>
**Adam optimizer**

In [None]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, CSVLogger

output_size = Y.shape[1]
batch_size = 2000
optimizer = Adam(lr=3e-4)
epoch_num = 100

filename = 'checkpoint-epoch-{}-trial02.h5'.format(epoch_num)
checkpoint_callback = ModelCheckpoint(filename,             # file name
                             monitor='val_accuracy',   # call when val_loss improves
                             verbose=1,            # print log
                             save_best_only=True,  # only save best value
                             mode='auto'          # automatically find best
                            )

early_stopping = EarlyStopping(monitor='val_accuracy',  # monitoring point (val loss) 
                              patience=10,         # if val_loss is not improving unitil 10 epoch, end training
                              )

# There are two options.
model = Sequential()
model.add(LSTM(512, activation='relu', return_sequences=True, input_shape=(X.shape[1], X.shape[2])))
model.add(LSTM(256, return_sequences=True))
model.add(LSTM(256, return_sequences=False))
model.add(Dense(output_size, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
model.fit(X, Y, epochs=epoch_num ,validation_data=(X_valid, Y_valid), callbacks=[checkpoint_callback, early_stopping], batch_size=batch_size, shuffle=True)

In [None]:
#output_size = Y.shape[1]
batch_size = 2000
optimizer = Adam(lr=3e-4)
epoch_num = 100

filename = 'checkpoint-epoch-{}-trial02.h5'.format(epoch_num)
checkpoint_callback = ModelCheckpoint(filename,             # file name
                             monitor='val_accuracy',   # call when val_loss improves
                             verbose=1,            # print log
                             save_best_only=True,  # only save best value
                             mode='auto'          # automatically find best
                            )

early_stopping = EarlyStopping(monitor='val_accuracy',  # monitoring point (val loss) 
                              patience=10,         # if val_loss is not improving unitil 10 epoch, end training
                              )

# the numbrer of classes for parameter value is 30 (one-hot encoded)
output_size = 30

model2 = Sequential()
model2.add(LSTM(128, activation='relu', return_sequences=True, input_shape=(X.shape[1], X.shape[2])))
model2.add(LSTM(64, return_sequences=True))
model2.add(LSTM(32, return_sequences=False))
model2.add(Dense(output_size, activation='softmax'))
model2.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
model2.fit(X_p, Y_p, epochs=epoch_num ,validation_data=(X_p_valid, Y_p_valid), callbacks=[checkpoint_callback, early_stopping], batch_size=batch_size, shuffle=True)

In [None]:
for i in range(len(seq_param_man)):
    for j in range(len(seq_param_man[i])):
        seq_param_man[i][j] = list(seq_param_man[i][j])
for i in range(len(seq_param_man)):
    for j in range(len(seq_param_man[i])):
        seq_param_man[i][j][0] = int(seq_param_man[i][j][0].total_seconds())//10

In [None]:
# A function that returns the log key and parameter sequence and the label value
def generate_pred(file, window_size):
    hdfs = list()
    haaa = list()
    hhhh = list()

    uri = []
    time = []
    trid = []

    for line in file:
        uri_tmp = []
        trid_tmp = []
        time_tmp = []
        for i in line:
            uri_tmp.append(i[0])
            time_tmp.append(int(i[1].total_seconds())//10)
            trid_tmp.append(i[2])
        uri.append(uri_tmp)
        time.append(time_tmp)
        trid.append(trid_tmp)
    # pad the sequence when shorter than window size
    for ln in uri:
        line = list(map(lambda n: n - 1, ln))
        ln = line + [-2] * (window_size + 1 - len(line))
        hdfs.append(tuple(ln))

    for ll in trid:
        line = list(ll)
        ll = line + [-2] * (window_size + 1 - len(line))
        hhhh.append(tuple(ll))
        
    for l in time:
        line = list(l)
        l = line + [-2] * (window_size + 1 - len(line))
        haaa.append(tuple(l))

    return hdfs, haaa, hhhh

In [None]:
# the dictionary for the final test using the log entries from human
man = dict()

for i in range(unique_n):
    man[unique_key[i]] = []

for idx in range(len(df)):
    if df.iloc[idx]['label']==0:
        man[df[log_entry][idx]].append((df.iloc[idx]['request'], df.iloc[idx]['timestamp']-df.iloc[idx-1]['timestamp'], df.iloc[idx]["label"]))

man_ = []
for k in man.keys():
    man_.append(man[k])

man_.sort(key=len, reverse=True)

In [None]:
# delete the sequence shorter than 5
idx = len(man_)
for item in range(len(man_)):
    if len(man_[item]) <= 5:
        idx = item
        break

man_ = man_[:idx]

In [None]:
test_key_normal_loader, test_normal_loader, y_test = generate_pred(man_, window_size)

## Performance evaluation result
<img src="https://drive.google.com/uc?id=1sSPk_5XrWVt9nYhBFmPvFgcTd-L31PSz" width="700">
<br>

In [None]:
from tqdm import tqdm
from tensorflow import keras 
from tensorflow.keras.activations import softmax
total = 0
correct = 0
fail = 0
human_count = 0 # count the number of human
proba = []
y_labeled = []
start_time = time.time()

# predict for the entire sequence
for line, line2, y in tqdm(zip(test_key_normal_loader, test_normal_loader, y_test)):
    compare_int = 0  # how many times the prediction false in the sequence (for unique IP)
    for i in range(len(line) - window_size):
        # the 0.7*length of the sequence
        # compare this variable with the the compare_int variable
        compare = int(len(line) * 0.3)
        seq = line[i:i + window_size]
        seq_param = line2[i:i + window_size]
        label = line[i + window_size]
        label_param = line2[i + window_size]
        trid = y[i + window_size]
        if label == -2:
            continue

        X = np.reshape(seq, (1, window_size, 1))
        X = X / float(num_classes)
        Y = to_categorical(label, num_classes)
        prediction = model.predict(X, verbose=0)

        predicted = prediction.argsort()[0][::-1][: n_candidates]
        y_pred = prediction

        proba.append(y_pred)
        total += 1

        if np.argmax(Y) in prediction.argsort()[0][::-1][: n_candidates]:
            Xp = np.reshape(seq_param, (1, window_size, 1))
            Xp = Xp / float(30)
            Yp = to_categorical(label_param, 30)
            
            prediction2 = model2.predict(Xp, verbose=0)
            if np.argmax(Yp) in prediction2.argsort()[0][::-1]:
                correct += 1
            else:
                compare_int += 1
                if (compare_int >= compare):
                    human_count += 1
                    break
        else:
            compare_int += 1
            if (compare_int >= compare):
                human_count += 1
                break
            
elapsed_time = time.time() - start_time
print('elapsed_time: {:.3f}s'.format(elapsed_time))
print("total : %d" % total)
accu = human_count/len(man_)*100
print("accuracy : %f" % accu)

## Any Insights and Future Work

 We made predictions through two models. The first is request uri and the second is prediction through time differences between logs. Human pattern analysis was performed through these two, and as a result of prediction through the generated model, it was found that the accuracy exceeded 90%. Through this, it can be seen that web robots and humans show distinctly different pattern differences in log approaches, and these patterns can be learned and predicted. By using this approach, web robots that access specific sites can be sufficiently prevented. This is expected to prevent bad bots trying to achieve personal benefits through data within the company, and to be a sufficient solution for companies with traffic problems or data problems by web robots.<br>

##- Reference

[1] Web Robot survey https://ppcprotect.com/blog/ad-fraud/how-many-of-the-internets-users-are-robots/ <br>
[2] Lagopoulos, Athanasios and Tsoumakas, Grigorios. (2019). Web robot detection - Server logs [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3477932.<br>
[3] Du, Min, et al. "Deeplog: Anomaly detection and diagnosis from system logs through deep learning." Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017.
<br>
[4] Shinil Kwon, Young-Gab Kim, Sungdeok Cha "Web robot detection based on pattern-matching technique" Proceedings of the 2012 SAGE journals. 2012
<br>[5] Du, Min, et al. "LightLog: A lightweight temporal convolutional network for log anomaly detection on the edge." Proceedings of the 2021 Science Direct. 2021
<br>[6] C. Kim, M. Jang, S. Seo, K. Park and P. Kang, "Intrusion Detection Based on Sequential Information Preserving Log Embedding Methods and Anomaly Detection Algorithms," in IEEE Access, vol. 9, pp. 58088-58101, 2021, doi: 10.1109/ACCESS.2021.3071763.



##- Member's contribution statement

### Min-seon Kim
Topic Selecion, Reference survey, Pre-processing and Network Modeling Pipeline, Presentor
### Yong-Hoon Lee
Data Search, Pre-processing and presentation material, Reference survey, Presentor

##- Debugging experience worth sharing

There was a difficulty in the process of adjusting the hyper-parameter for optimal performance. such as window size, candidtaes, num_params and etc.

In the process of making two different models using different data, it took a lot of effort to adjust and fit the format of input features the model.

In the process of constructing a workflow that integrates the two log key models and parameter models, it was necessary to redefine the workflow configuration.

In the data preprocessing process, custom logic was required in the encoding and train test split which has been explained during the presention.



##- The Github repository with the commit history

https://github.com/sperospera1225/WebRobotDetection