# Improving ModSecurity WAF Using a Structured-Language Classifier


## For Computer Science B.Sc. Ariel University

By Maor Saadon, Shahar Zaidel, Eden Mor.

# Introdaction

In the ever-evolving landscape of cybersecurity, Web Application Firewalls (WAF) stand as critical defenders against malicious web requests. Traditionally, WAFs rely on predefined rule sets to identify and block potential threats. However, the complexity and sophistication of modern cyber attacks often outpace the capabilities of these rule-based systems, necessitating more dynamic and adaptive approaches. Machine learning (ML), particularly deep learning, has emerged as a promising solution, offering the potential to significantly enhance the detection and mitigation capabilities of WAFs through its ability to learn from data and identify complex patterns.

Our research explores the integration of machine learning techniques into the domain of web application security, focusing on the comparative analysis of various ML models in classifying web requests. We delve into models based on advanced text vectorization techniques, such as TF-IDF and Word2Vec, which transform web requests into numerical vectors that capture their semantic essence. These vectors are then utilized by machine learning algorithms, including the traditional k-Nearest Neighbors (kNN) algorithm, to distinguish between benign and malicious requests effectively.

Central to our investigation is the exploration of deep learning models in contrast to traditional ML models. Deep learning's capacity for learning hierarchical representations offers a nuanced approach to understanding and classifying web traffic. This leads us to our guiding research question: Does a machine learning model enhanced with deep learning techniques outperform traditional models in the context of Web Application Firewalls?

By examining the efficacy of deep learning models against their non-deep learning counterparts, our study aims to shed light on their relative performance and potential for improving the security mechanisms of WAFs. Through this comparative analysis, we aspire to contribute valuable insights into the advancement of cybersecurity methodologies, specifically in enhancing the resilience of web applications against the growing threat of cyber attacks.

## Install and import libraries

In [4]:
%pip install tensorflow
%pip install scikit-learn





In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
import sys
import getopt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score



## Feature extraction functions


We will use the following feature extraction functions to extract features from the HTTP request:

In [None]:
def calculate_alphanumeric_ratio(payload):
    alphanumeric_characters = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
    alphanumeric_count = sum(1 for char in payload if char in alphanumeric_characters)
    payload_length = len(payload)
    input_length = max(payload_length, 1)  # Avoid division by zero if payload_length is 0
    alphanumeric_ratio = (alphanumeric_count / input_length) * 10
    return alphanumeric_ratio

def calculate_input_length(payload):
    input_length = len(payload)
    return input_length

def calculate_special_character_ratio(payload):
    alphanumeric_characters = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
    special_count = sum(1 for char in payload if char not in alphanumeric_characters)
    payload_length = len(payload)
    input_length = max(payload_length, 1)  # Avoid division by zero if payload_length is 0
    special_ratio = (special_count / input_length) * 100
    return special_ratio


def calculate_url_weight(url, discovered_malicious):
    # Define weights for discovered malicious
    malicious_weights = {
        "special_character": 10,
        "attack_word": 50,
        "unauthorized_resource_access": 200
    }
    # Initialize URL weight
    url_weight = 0
    # Check if any discovered malicious is present in the URL
    for malicious in discovered_malicious:
        if malicious in url:
            url_weight += malicious_weights.get(malicious, 0)

    return url_weight

def calculate_attack_weight(row):
    # Example of weights, these should be determined based on your analysis
    url_weight = row['url_weight']
    payload_weight = row['payload_weight']
    manipulation = row['manipulation']  # Number of attack words
    alphanumeric_ratio = row['alpha'] / (row['alpha'] + row['non_alpha']) if row['alpha'] + row['non_alpha'] > 0 else 0
    files_weight = row['files_weight']  # This should be calculated based on your context
    attack_weight = url_weight + payload_weight + manipulation + alphanumeric_ratio + files_weight
    return attack_weight

# Load and Procces the dataset 

The dataset is a CSV file with two columns: payload and label. The payload column contains the http request payload and the label column contains the label of the http request. The label is 1 if the request is malicious and 0 if the request is benign. 

In [None]:
# Load the dataset
df = pd.read_csv('request_payload_updated.csv')

# Separate the dataset into malicious and benign
malicious_df = df[df['label'] == 1]
benign_df = df[df['label'] == 0]

# Randomly sample 150,000 entries from each
malicious_sampled_df = resample(malicious_df, n_samples=150000, random_state=42)
benign_sampled_df = resample(benign_df, n_samples=150000, random_state=42)

# Combine the sampled data
balanced_df = pd.concat([malicious_sampled_df, benign_sampled_df])

# Shuffle the combined dataset to mix malicious and benign URLs
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

print('Balanced Dataset Shape:', balanced_df.shape)
print('Balanced Dataset Distribution:', balanced_df['label'].value_counts())
print('Balanced Dataset Head (First 10 Rows):')
print(balanced_df.head(10))

## Start testing with the options

We'll split the truncated dataset into a training set and a test set (80% training, 20% test) and use the training set to train the model and the test set to evaluate the model.

## TF-IDF Vectorizer preprocessing

In [None]:
# Apply feature extraction
print('Feature extraction...')
balanced_df['payload'].apply(lambda payload: pd.Series({
    features = {
    'alphanumeric_ratio': calculate_alphanumeric_ratio(payload),
    'input_length': calculate_input_length(payload),
    'special_character_ratio': calculate_special_character_ratio(payload),
    'url_weight': calculate_url_weight(payload, ['select', 'union', 'insert', 'delete', 'update'])
    }))


# Concatenate original DF with features
balanced_df = pd.concat([balanced_df, features], axis=1)
print('Balanced Dataset Shape:', balanced_df.shape)
print('Balanced Dataset:')
print(balanced_df)

# Extracting TF-IDF features from URLs
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limiting to top 5000 features

# Fit and transform the TF-IDF vectorizer
print('Fitting and transforming the TF-IDF Vectorizer (max_features=5000) to the URLs in the dataset...')
tfidf_features = tfidf_vectorizer.fit_transform(balanced_df['payload'])
print('TF-IDF Vectorization complete.')

# Convert TF-IDF features from a sparse matrix to a dense format and then to an np.ndarray
tfidf_dense = np.asarray(tfidf_features.todense())

# Define X for numerical features
X_numerical = balanced_df.drop(['label', 'payload'], axis=1).values  # Make sure this matches your feature extraction output

# Combining TF-IDF features with numerical features
X_combined = np.hstack((X_numerical, tfidf_dense))

# Define y
y = balanced_df['label'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

print('Saving the TF-IDF Vectorizer to disk...')

# Save the TF-IDF vectorizer to disk
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pkl.dump(tfidf_vectorizer, f)

print('TF-IDF Vectorizer saved to disk.')

## K-Nearest Neighbors model

In [None]:
# Create an instance of kNN classifier
# the number of neighbors (k= 350) 
knn_model = KNeighborsClassifier(n_neighbors=350) 

# Train the kNN model (x_train-train_features,y_train-train_labels)
knn_model.fit(X_train, y_train)
print("K-Nearest Neighbors training is complete,\n predictions: ")

# Predict on the test set
predictions = knn_model.predict(X_test)

#Evaluate the model
accuracy = accuracy_score(y_test, predictions)
CMatrix = confusion_matrix(y_test, predictions)

# Print all evaluation metrics
print("Accuracy:", accuracy,"Confusion Matrix:", CMatrix ,"\n","Classification Report:\n", classification_report(y_test, predictions))

# Extract recall, precision, and F1-score from the classification report
classification_dict = classification_report(y_test, predictions, output_dict=True)
recall = classification_dict['1']['recall']  # Recall for the positive class (malicious)
precision = classification_dict['1']['precision']  # Precision for the positive class (malicious)
f1_score = classification_dict['1']['f1-score']  # F1-score for the positive class (malicious)

print("Recall:", recall)
print("Precision:", precision)
print("F1-score:", f1_score)

## TensorFlow neural network, with feature extraction

Now let's use a neural network to classify the HTTP requests. We will use the feature extraction functions to extract features from the payloads and then use a neural network to classify the request's payloads. We will use the same training and test sets as before. As the same, the training set is 80% and the test set is 20%.

In [None]:

# Define the model
model = Sequential([
    Dense(512, activation='relu', input_shape=(X_train.shape[1],)),  # Input layer with input_shape matching your features
    Dropout(0.5),  # Dropout for regularization
    Dense(256, activation='relu'),  # Hidden layer
    Dropout(0.5),  # Dropout for regularization
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',  # Use 'categorical_crossentropy' for multi-class classification
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2,  # Use a part of the training set for validation
                    verbose=1)

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=2)
print('\nTest accuracy:', test_acc)
