## Jupyter Notebook: HTTP Request Classifier

This notebook demonstrates training a classifier to distinguish between malicious and normal HTTP requests using a labeled dataset.

**Libraries**


In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score

# Feel free to import other libraries as needed

The code below was added, after suppling Gemini with a small sample of the training dataset file. The Data loading part was then adjusted (after several adjustments to the parsing and loading data functions)

In [12]:
# Assuming data has space delimiter
delimiter = " "

def parse_http_request(lines):
  """
  Parses a single HTTP request line into a dictionary.

  Args:
      lines (list): A list of strings representing the HTTP request lines (including headers).

  Returns:
      dict: A dictionary containing request method, URL, protocol, and headers (optional).
  """
  parts = lines[0].split(" ")
  request = {"method": parts[0], "url": parts[1], "protocol": parts[2]}
  headers = {}
  for line in lines[1:]:
      if line.strip():  # Check if line has content
          key, value = line.split(":", 1)
          headers[key.strip()] = value.strip()
      else:
          break  # Reached end of headers (or empty line)
  request["headers"] = headers if headers else None  # Set headers to None if empty
  return request


def load_data(filename):
  """
  Loads HTTP request data from a file.

  Args:
      filename (str): The filename of the data file.

  Returns:
      list: A list of dictionaries, each representing an HTTP request.
  """
  data = []
  with open(filename, "r") as f:
    lines = f.readlines()
    for line in lines:
      request_data = parse_http_request(lines.copy())  # Pass a copy of lines
      data.append(request_data)
  return data




**Load Data**

In [13]:

# Load data using the custom function
normal_train_data = load_data("csic_dataset/normalTrafficTraining.txt")
normal_test_data = load_data("csic_dataset/normalTrafficTest.txt")
anomalous_test_data = load_data("csic_dataset/anomalousTrafficTest.txt")

# Combine normal and anomalous data for supervised learning
all_data = normal_train_data + normal_test_data + anomalous_test_data
all_data = pd.DataFrame(all_data)  # Convert to pandas dataframe for further processing




**Feature Engineering**


This step is crucial for classifier performance and needs to be tailored to your data. Nonetheless, Gemini did not suggest any code for this part. So, I asked it to give me some solutions for feature engineering. Here is the output of that query:

Here are some potential feature engineering solutions you can explore based on the parsed HTTP request data (dictionary format):

**Basic Features:**

* **Method:** Categorical feature indicating the HTTP method used (e.g., GET, POST, PUT, DELETE). You can encode this using one-hot encoding or label encoding.
* **URL Path:** Extract the path component from the URL. This can reveal the targeted resource on the server. You can further process this by:
    * Splitting the path into segments (e.g., "/products/123") and using them as features.
    * Identifying specific patterns or keywords in the path that might be indicative of certain functionalities.
* **Protocol Version:** Categorical feature indicating the HTTP protocol version (e.g., HTTP/1.1, HTTP/2).
* **Status Code (if available):** This might be present in the response data, but if included in your request logs, it can be a valuable indicator of success or error conditions.
* **Content Length (if available):** This might be present in the request headers, indicating the size of the request body (if any).

**Header-based Features:**

* **User-Agent:** Categorical feature indicating the browser or client making the request. You can potentially group similar user agents or identify known malicious ones.
* **Referer:** This header indicates the previous webpage that linked to the current request. It can be helpful in understanding the context of the request.
* **Content-Type (if available):** This indicates the format of the request body data (e.g., application/json, application/x-www-form-urlencoded).
* **Number of Headers:** This can be a simple feature indicating the complexity of the request.

**Advanced Features:**

* **Request Frequency:** Analyze access logs to calculate the frequency of specific request patterns over time. Deviations from normal patterns might indicate suspicious activity.
* **Time-based Features:** Extract features like hour of the day, day of the week, or even time zone from the timestamps (if available) in the logs. This can help identify unusual access patterns.
* **IP Address Features:** Analyze source IP addresses and identify geographically anomalous origins for requests. Geolocation databases can be used for this purpose.

**Remember:**

* The choice of features depends on the specific characteristics of your data and the type of anomalies you are trying to detect.
* Feature selection techniques like correlation analysis can help identify the most informative features for your model.
* Consider data anonymization for sensitive information like IP addresses before processing, especially if sharing data for model development or evaluation.

By exploring these feature engineering solutions and tailoring them to your data, you can create a more robust and informative feature set for training your HTTP request classifier.


**Train-Test Split**

In [None]:
X = all_data.drop("label", axis=1)  # Features
y = all_data["label"]  # Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Classifier Training (Isolation Forest)**

In [None]:
# Isolation Forest is a good choice for anomaly detection
clf = IsolationForest(contamination=0.1)  # Adjust contamination parameter as needed
clf.fit(X_train)

**Prediction and Evaluation**

In [None]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy on Test Set: {accuracy:.4f}")

# Additional evaluation metrics like precision, recall, F1 score can be included


**Further Exploration**

* Try different anomaly detection algorithms (e.g., Local Outlier Factor, One-Class SVM) and compare their performance.
* Experiment with feature selection techniques to identify the most informative features.
* Visualize the distribution of features for normal and anomalous data to understand patterns.

**Converting to Production**

* Save the trained model using libraries like pickle or joblib.
* Integrate the model into a web application or security system for real-time traffic classification.
* Continuously monitor model performance and retrain with new data to maintain accuracy.

**Note:**

* This is a basic example, and the specific implementation will depend on your data format and chosen features.
* Remember to adapt the feature engineering and classifier selection based on your specific dataset.