# Anomaly detection - HTTP requests - CSIC dataset 2010

## Task
The primary objective is to develop a classifier capable of identifying malicious HTTP requests by training on normal traffic data and evaluating both normal and anomalous test data. The dataset is structured as follows:

 * Normal Traffic (Train)
 * Normal Traffic (Test)
 * Anomalous Traffic (Test)

Although the dataset is designed for unsupervised learning, supervised learning techniques can be applied by combining normal and anomalous data into a labeled dataset. This allows for direct classification using any preferred machine learning model.

## Dataset
The dataset contains the generated traffic targeted to an e-commerce web
application. It is an automatically generated dataset that contains 36,000 normal
requests and more than 25,000 anomalous requests (i.e., web attacks).

### 0. Packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
import re

### 1. Data Load and Preparation

Data Preprocessing:

* Combine HTTP method, URL, version, user-agent, and body into a single text feature

* Perform basic text cleaning (lowercasing, removing special characters)

Feature Extraction:

* Use TF-IDF with n-grams (1-2 words) to convert text to numerical features

* Limit to top 1000 features to manage dimensionality

In [21]:

def preprocess_data(df):
    """Preprocess the HTTP request data"""
    # Combine relevant features into a single text feature
    df['text'] = df['Method'] + ' ' + df['URL'] + ' ' + df['HTTP_Version'] + ' ' + df['User-Agent']
    
    # Add body content if exists
    df['text'] = df.apply(lambda x: x['text'] + ' ' + str(x['Body']) if pd.notna(x['Body']) else x['text'], axis=1)
    
    # Basic text cleaning
    df['text'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]', '', str(x).lower()))
    
    return df

def load_and_prepare_data(filepath):
    """Load and prepare the dataset"""
    df = pd.read_csv(filepath)
    
    # Preprocess the data
    df = preprocess_data(df)
    
    # Extract features and labels
    X = df['text']
    y = df['Normal_Anom']
    
    return X, y

### 2. Model Training

* We'll use Random Forest classifier which works well with text data

In [22]:
def train_model(X_train, y_train):
    """Train a classification model"""
    # Create a pipeline with TF-IDF and Random Forest
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=1000, ngram_range=(1, 2))),
        ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
    ])
    
    # Train the model
    pipeline.fit(X_train, y_train)
    
    return pipeline

### 3. Model Evaluation

* Provide accuracy and classification report (precision, recall, F1-score)

* Use stratified sampling to maintain class distribution

In [23]:

def evaluate_model(model, X_test, y_test):
    """Evaluate the model performance"""
    y_pred = model.predict(X_test)
    
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

### 4. Main execution

In [24]:

def main():
    # Load and prepare data
    X, y = load_and_prepare_data('http_requests_all.csv')
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # Train the model
    model = train_model(X_train, y_train)
    
    # Evaluate the model
    evaluate_model(model, X_test, y_test)
    
    # Example prediction
    example_request = "GET http://example.com/login.php HTTP/1.1 Mozilla/5.0"
    print(f"\nExample prediction for '{example_request}':")
    print(model.predict([example_request]))

if __name__ == "__main__":
    main()

Accuracy: 0.81809247956967

Classification Report:
              precision    recall  f1-score   support

   Anomalous       1.00      0.29      0.45      4934
      Normal       0.80      1.00      0.89     14400

    accuracy                           0.82     19334
   macro avg       0.90      0.64      0.67     19334
weighted avg       0.85      0.82      0.78     19334


Example prediction for 'GET http://example.com/login.php HTTP/1.1 Mozilla/5.0':
['Anomalous']


## Enhancements that we can make:

1. Advanced Feature Engineering:

- Extract specific URL patterns, parameter counts, etc.

- Add length-based features (URL length, parameter length)

- Detect special characters or encodings

2. Model Improvements:

- Try other algorithms like SVM or neural networks

- Use grid search for hyperparameter tuning

- Implement ensemble methods

3. Handling Imbalanced Data:

- Use SMOTE or other techniques if anomalies are rare

- Adjust class weights in the classifier

4. Deployment:

- Save the trained model to disk for later use

- Create an API endpoint for real-time classification

### 5. Result Visualization

In [None]:
sns.set_style('darkgrid')
sns.countplot(data=csic_data, x='Unnamed: 0')