# Improving ModSecurity WAF Using a Structured-Language Classifier


## For Computer Science B.Sc. Ariel University

By Maor Saadon, Shahar Zaidel, Eden Mor.

# Introdaction

In the ever-evolving landscape of cybersecurity, Web Application Firewalls (WAF) stand as critical defenders against malicious web requests. Traditionally, WAFs rely on predefined rule sets to identify and block potential threats. However, the complexity and sophistication of modern cyber attacks often outpace the capabilities of these rule-based systems, necessitating more dynamic and adaptive approaches. Machine learning (ML), particularly deep learning, has emerged as a promising solution, offering the potential to significantly enhance the detection and mitigation capabilities of WAFs through its ability to learn from data and identify complex patterns.

Our research explores the integration of machine learning techniques into the domain of web application security, focusing on the comparative analysis of various ML models in classifying web requests. We delve into models based on advanced text vectorization techniques, such as TF-IDF and Word2Vec, which transform web requests into numerical vectors that capture their semantic essence. These vectors are then utilized by machine learning algorithms, including the traditional k-Nearest Neighbors (kNN) algorithm, to distinguish between benign and malicious requests effectively.

Central to our investigation is the exploration of deep learning models in contrast to traditional ML models. Deep learning's capacity for learning hierarchical representations offers a nuanced approach to understanding and classifying web traffic. This leads us to our guiding research question: Does a machine learning model enhanced with deep learning techniques outperform traditional models in the context of Web Application Firewalls?

By examining the efficacy of deep learning models against their non-deep learning counterparts, our study aims to shed light on their relative performance and potential for improving the security mechanisms of WAFs. Through this comparative analysis, we aspire to contribute valuable insights into the advancement of cybersecurity methodologies, specifically in enhancing the resilience of web applications against the growing threat of cyber attacks.

## Install and import libraries

In [7]:
%pip install tensorflow
%pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import sys
import getopt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pickle as pkl
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
import random
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC




## Feature extraction functions

We will use the following feature extraction functions to extract features from the HTTP request:
1. calculate_alphanumeric_ratio - 
2. calculate_input_length - 
3. calculate_special_character_ratio - 


In [2]:
def calculate_alphanumeric_ratio(payload):
    alphanumeric_characters = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
    alphanumeric_count = sum(1 for char in payload if char in alphanumeric_characters)
    payload_length = len(payload)
    input_length = max(payload_length, 1)  # Avoid division by zero if payload_length is 0
    alphanumeric_ratio = (alphanumeric_count / input_length) * 10
    return alphanumeric_ratio

def calculate_input_length(payload):
    input_length = len(payload)
    return input_length

def calculate_special_character_ratio(payload):
    alphanumeric_characters = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
    special_count = sum(1 for char in payload if char not in alphanumeric_characters)
    payload_length = len(payload)
    input_length = max(payload_length, 1)  # Avoid division by zero if payload_length is 0
    special_ratio = (special_count / input_length) * 100
    return special_ratio


# Load and Procces the dataset 

The dataset is a CSV file with two columns: payload and label. The payload column contains the http request payload and the label column contains the label of the http request. The label is 1 if the request is malicious and 0 if the request is benign. 

In [3]:
# Load the dataset
df = pd.read_csv('QueriesDataset.csv')
# df = df.drop(['method','url','attack_feature'], axis=1)

# Shuffle the combined dataset to mix malicious and benign URLs
balanced_df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print('Balanced Dataset Shape:', balanced_df.shape)
print('Balanced Dataset Distribution:', balanced_df['label'].value_counts())
print('Balanced Dataset Head (First 10 Rows):')
print(balanced_df.head(10))

Balanced Dataset Shape: (98126, 2)
Balanced Dataset Distribution: label
0    50000
1    48126
Name: count, dtype: int64
Balanced Dataset Head (First 10 Rows):
                                             payload  label
0  <iframe  src="data:text/html,%3C%73%63%72%69%7...      1
1                          /javascript/stackdump.exe      1
2  &lt;?import namespace=\"t\" implementation=\"#...      1
3  /awstats/awstats.pl?migrate=|echo;wget -p /tmp...      1
4                                           /146054/      0
5                             123+len(1234)-len(123)      1
6                                         /revhosts/      0
7                                /javascript/devs.7z      0
8                       /javascript/certificate.core      0
9                               /javascript/item.swf      0


## Start testing with the options

We'll split the truncated dataset into a training set and a test set (80% training, 20% test) and use the training set to train the model and the test set to evaluate the model.

## TF-IDF Vectorizer preprocessing

In [4]:
# Apply feature extraction
print('Feature extraction...')
features = balanced_df['payload'].apply(lambda x: pd.Series({
    'payload_len': calculate_input_length(str(x)),
    'alpha': calculate_alphanumeric_ratio(str(x)),
    'non_alpha': calculate_special_character_ratio(str(x)),
    }))


# Concatenate original DF with features
balanced_df = pd.concat([balanced_df, features], axis=1)
print('Balanced Dataset Shape:', balanced_df.shape)
print('Balanced Dataset:')
print(balanced_df)

# Extracting TF-IDF features from URLs
# add to the TFIDF a tokenizer that will split the payload into words by /

tfidf_vectorizer = TfidfVectorizer(max_features=5000, tokenizer=lambda x: x.split('/'))  # Limiting to top 5000 features

# Before fitting the TF-IDF Vectorizer, replace NaN values with empty strings
balanced_df['payload'] = balanced_df['payload'].fillna('')

# Fit and transform the TF-IDF vectorizer
print('Fitting and transforming the TF-IDF Vectorizer to the payloade in the dataset...')
tfidf_features = tfidf_vectorizer.fit_transform(balanced_df['payload'])
print('TF-IDF Vectorization complete.')

# Convert TF-IDF features from a sparse matrix to a dense format and then to an np.ndarray
tfidf_dense = np.asarray(tfidf_features.todense())

# Define X for numerical features
X_numerical = balanced_df.drop(['label', 'payload'], axis=1).values  # Make sure this matches your feature extraction output


# Combining TF-IDF features with numerical features
X_combined = np.hstack((X_numerical, tfidf_dense))

# Define y
y = balanced_df['label'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

print('Saving the TF-IDF Vectorizer to disk...')

# Save the TF-IDF vectorizer to disk
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pkl.dump(tfidf_vectorizer, f)

print('TF-IDF Vectorizer saved to disk.')

Feature extraction...
Balanced Dataset Shape: (98126, 5)
Balanced Dataset:
                                                 payload  label  payload_len  \
0      <iframe  src="data:text/html,%3C%73%63%72%69%7...      1        115.0   
1                              /javascript/stackdump.exe      1         25.0   
2      &lt;?import namespace=\"t\" implementation=\"#...      1         65.0   
3      /awstats/awstats.pl?migrate=|echo;wget -p /tmp...      1        142.0   
4                                               /146054/      0          8.0   
...                                                  ...    ...          ...   
98121                                           /110992/      0          8.0   
98122  /recordings/index.php?action=login&languages[n...      1        119.0   
98123  /scripts/search.jsp?q=%"<script>alert(13319041...      1         58.0   
98124                                   /quake_4-teaser/      0         16.0   
98125                                        



TF-IDF Vectorization complete.
Saving the TF-IDF Vectorizer to disk...


PicklingError: Can't pickle <function <lambda> at 0x000001A0C971E840>: attribute lookup <lambda> on __main__ failed

## K-Nearest Neighbors model

In [5]:

# Create an instance of kNN classifier
# the number of neighbors (k= 350) 
knn_model = KNeighborsClassifier(n_neighbors=350) 

# Train the kNN model (x_train-train_features,y_train-train_labels)
knn_model.fit(X_train, y_train)
print("K-Nearest Neighbors training is complete,\n predictions: ")

# Predict on the test set
predictions = knn_model.predict(X_test)

# #Evaluate the model
# accuracy = accuracy_score(y_test, predictions)
# CMatrix = confusion_matrix(y_test, predictions)

# # Print all evaluation metrics
# print("Accuracy:", accuracy,"Confusion Matrix:", CMatrix ,"\n","Classification Report:\n", classification_report(y_test, predictions))

# # Extract recall, precision, and F1-score from the classification report
# classification_dict = classification_report(y_test, predictions, output_dict=True)
# recall = classification_dict['1']['recall']  # Recall for the positive class (malicious)
# precision = classification_dict['1']['precision']  # Precision for the positive class (malicious)
# f1_score = classification_dict['1']['f1-score']  # F1-score for the positive class (malicious)

# # Dump the model to disk
# with open('knn_model_counter.pkl', 'wb') as f:
#     pkl.dump(knn_model, f)

# print('K-Nearest Neighbors model saved to disk.')

# print("Recall:", recall)
# print("Precision:", precision)
# print("F1 score:", f1_score)


#y_test = y_test.values.ravel()

print('[+] \t Classification accuracy : ', "{:.2f}".format(metrics.accuracy_score(y_test, predictions) * 100), '%\n')

print('[+] \t Percentage of Anomaly requests in test set : ', "{:.2f}".format(y_test.mean() * 100), '%\n')
print('[+] \t Percentage of Normal requests in test set : ', "{:.2f}".format((1 - y_test.mean()) * 100), '%\n')

confusion = metrics.confusion_matrix(y_test, predictions)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

print('[+] \t TP : ', TP, 'from ', (TP + TN + FP + FN))
print('    \t TN : ', TN, 'from ', (TP + TN + FP + FN))
print('    \t FP : ', FP, 'from ', (TP + TN + FP + FN))
print('    \t FN : ', FN, 'from ', (TP + TN + FP + FN))

print('\n[+] \t Metrics : ')
print('\t[-]  Accuracy Score (train_test_split): ', "{:.2f}".format(metrics.accuracy_score(y_test, predictions) * 100), '%')
print('\t[-]  Accuracy Score (k-fold): ',
      "{:.2f}".format(cross_val_score(knn_model, X_train, y_train, cv=100, scoring='accuracy').mean() * 100), '%')

print('\t[-]  Classification Error : ', "{:.2f}".format((1 - metrics.accuracy_score(y_test, predictions)) * 100), '%')
print('\t[-]  Recall : ', "{:.2f}".format(metrics.recall_score(y_test, predictions) * 100), '%')
specificity = TN / (TN + FP)
print('\t[-]  Specificity : ', "{:.2f}".format(specificity * 100), '%')
false_positive_rate = FP / float(TN + FP)
print('\t[-]  False Positive Rate : ', "{:.2f}".format(false_positive_rate * 100), '%')
precision = TP / float(TP + FP)
print('\t[-]  Precision : ', "{:.2f}".format(precision * 100), '%')


K-Nearest Neighbors training is complete,
 predictions: 
[+] 	 Classification accuracy :  91.27 %

[+] 	 Percentage of Anomaly requests in test set :  49.24 %

[+] 	 Percentage of Normal requests in test set :  50.76 %

[+] 	 TP :  8481 from  19626
    	 TN :  9432 from  19626
    	 FP :  530 from  19626
    	 FN :  1183 from  19626

[+] 	 Metrics : 
	[-]  Accuracy Score (train_test_split):  91.27 %
	[-]  Accuracy Score (k-fold):  91.29 %
	[-]  Classification Error :  8.73 %
	[-]  Recall :  87.76 %
	[-]  Specificity :  94.68 %
	[-]  False Positive Rate :  5.32 %
	[-]  Precision :  94.12 %


## TensorFlow neural network, with feature extraction

A deep learning model using a Dense Neural Network (DNN), also known as a Fully Connected Network. This type of neural network architecture is characterized by its layers of neurons where each neuron in one layer is connected to all neurons in the next layer.

In [6]:

n_features = X_train.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'Recall', 'Precision'])
# fit the model

# use asarray to convert the dataframe to array
X_train_1 = np.asarray(X_train)
y_train_1 = np.asarray(y_train)

model.fit(X_train_1, y_train_1, epochs=10, batch_size=32, validation_split=0.2, verbose=1)
# evaluate the model
X_test_1 = np.asarray(X_test)
y_test_1 = np.asarray(y_test)

loss, accuracy, recall, precision = model.evaluate(X_test_1, y_test_1, verbose=0)
# print the accuracy in a different way
print('Test Accuracy: %.3f' % accuracy)
print('Recall: %.3f' % recall)
print('Precision: %.3f' % precision)




Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.942
Recall: 0.920
Precision: 0.961


## ensemble learning

In [13]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# Initialize individual models
log_clf = LogisticRegression(random_state=42, max_iter=1000)
dt_clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
svm_clf = GaussianNB() # Enable probability for soft voting
#svm_clf = ExtraTreesClassifier(n_estimators=100, max_depth=10, random_state=42)


# Create ensemble model using soft voting
ensemble_clf = VotingClassifier(estimators=[
    ('lr', log_clf),
    ('dt', dt_clf),
    ('svc', dt_clf),
    #('knn', knn_model)  # Add your kNN model
    ], voting='soft')

ensemble_clf.fit(X_train, y_train)

# Predict on the test set
ensemble_predictions = ensemble_clf.predict(X_test)

# Evaluate the ensemble model
ensemble_accuracy = accuracy_score(y_test, ensemble_predictions)
ensemble_CMatrix = confusion_matrix(y_test, ensemble_predictions)

# Print evaluation metrics
print("Ensemble Model Accuracy:", ensemble_accuracy)
print("Ensemble Model Confusion Matrix:", ensemble_CMatrix)
print("Ensemble Classification Report:\n", classification_report(y_test, ensemble_predictions))

import pickle

# Save the ensemble model to disk
with open('ensemble_model.pkl', 'wb') as f:
    pickle.dump(ensemble_clf, f)

print('Ensemble model saved to disk.')

Ensemble Model Accuracy: 0.9503210027514521
Ensemble Model Confusion Matrix: [[9726  236]
 [ 739 8925]]
Ensemble Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.98      0.95      9962
           1       0.97      0.92      0.95      9664

    accuracy                           0.95     19626
   macro avg       0.95      0.95      0.95     19626
weighted avg       0.95      0.95      0.95     19626

Ensemble model saved to disk.
