# DistilBERT Fine Tuned on CSIC‑2010 Web Attacks Dataset
### Rayen NAIT SLIMANE - Masters Student ISIMA INP

**NOTE:** This notebook is designed to run on [Kaggle](kaggle.com) with the [CSIC‑2010 Web Attacks Dataset](https://www.kaggle.com/datasets/ispangler/csic-2010-web-application-attacks). Make sure to enable the P100 GPU (free 30h/week quota) for optimal performance during training.

I propose fine-tuning DistilBERT, a high-efficiency transformer model, to deliver an AI-powered SQL Injection (SQLi) detection solution. This project will quickly evaluate the model's performance against a Kaggle dataset to establish the most effective and resource-lean strategy for eliminating SQLi vulnerabilities. The goal is to deploy a cutting-edge, high-accuracy defense faster than traditional methods allow.

I observed an increase from 60% accuracy and 50% recall to 95% accuracy and 89% precision by using the following methods:

* *Used only 'URL', 'METHOD', 'CONTENT', 'USER-AGENT' as the feature set given they're the most essential ones*
* *Modifying the concatenated feature strings to incorporate labels like '[URL]:' which is optimal for NLP*
* *Imputed the missing values with column specific values like '[NULL_URL]' maintaining the richness of the dataset*

Furthermore fine tuning using Low Rank Adaptation (LoRA) enabled me to iterate rapidly through different parameters in order to converge towards an optimal model faster.

## Initial Setup

In [2]:
!pip install --upgrade pip -q
!pip install transformers datasets accelerate peft evaluate safetensors -q

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
import numpy as np
import evaluate
import pandas as pd

MODEL_NAME = "distilbert-base-uncased"
NUM_LABELS = 2             # change to your task
BATCH_SIZE = 16
EPOCHS = 3
LR = 2e-5
OUTPUT_DIR = "./lora-distilbert-checkpoint"

## Dataset Preprocessing

In [5]:
raw_df = pd.read_csv("/kaggle/input/csic-2010-web-application-attacks/csic_database.csv")
raw_df.info()

print("\n", raw_df["classification"].value_counts())
print("\nMissing value counts:\n", raw_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61065 entries, 0 to 61064
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       61065 non-null  object
 1   Method           61065 non-null  object
 2   User-Agent       61065 non-null  object
 3   Pragma           61065 non-null  object
 4   Cache-Control    61065 non-null  object
 5   Accept           60668 non-null  object
 6   Accept-encoding  61065 non-null  object
 7   Accept-charset   61065 non-null  object
 8   language         61065 non-null  object
 9   host             61065 non-null  object
 10  cookie           61065 non-null  object
 11  content-type     17977 non-null  object
 12  connection       61065 non-null  object
 13  lenght           17977 non-null  object
 14  content          17977 non-null  object
 15  classification   61065 non-null  int64 
 16  URL              61065 non-null  object
dtypes: int64(1), object(16)
memory 

In [6]:
raw_df = raw_df.drop(columns=['Unnamed: 0'])

kept_columns = ['URL', 'Method', 'content', 'User-Agent']
kept_columns_null = ['NULL_URL', 'NULL_METHOD', 'NULL_CONTENT', 'NULL_USER_AGENT']

for col, null_val in zip(kept_columns, kept_columns_null):
    raw_df[col] = raw_df[col].fillna(null_val)
    raw_df[col] = raw_df[col].astype(str)

raw_df['text_input'] = raw_df.apply(
    lambda row: f"[URL]: {row['URL']} [METHOD]: {row['Method']} [CONTENT]: {row['content']} [USER_AGENT]: {row['User-Agent']}",
    axis=1
)

df = pd.DataFrame({
    'text': raw_df['text_input'],
    'labels': raw_df['classification']
})

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61065 entries, 0 to 61064
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    61065 non-null  object
 1   labels  61065 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 954.3+ KB


Unnamed: 0,text,labels
0,[URL]: http://localhost:8080/tienda1/index.jsp...,0
1,[URL]: http://localhost:8080/tienda1/publico/a...,0
2,[URL]: http://localhost:8080/tienda1/publico/a...,0
3,[URL]: http://localhost:8080/tienda1/publico/a...,0
4,[URL]: http://localhost:8080/tienda1/publico/a...,0


In [7]:
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict

# 1. First split: 80% train+val, 20% test
# Note: Use 'Label' as per your DataFrame structure
train_val_df, test_df = train_test_split(
    df, 
    test_size=0.2, 
    random_state=42, 
    stratify=df['labels'] 
)

# 2. Second split: 80% train, 20% validation (of the remaining train_val set)
train_df, val_df = train_test_split(
    train_val_df, 
    test_size=0.2, # This is 20% of the 80% (i.e., 16% of the original data)
    random_state=42, 
    stratify=train_val_df['labels']
)

# 3. Create the DatasetDict object
raw_datasets = DatasetDict({
    "train": Dataset.from_pandas(train_df, preserve_index=False),
    "validation": Dataset.from_pandas(val_df, preserve_index=False),
    "test": Dataset.from_pandas(test_df, preserve_index=False),
})

## Model Selection and Preparation

In [8]:
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

encoded = raw_datasets.map(preprocess, batched=True)
encoded.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"]
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/39081 [00:00<?, ? examples/s]

Map:   0%|          | 0/9771 [00:00<?, ? examples/s]

Map:   0%|          | 0/12213 [00:00<?, ? examples/s]

In [10]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=NUM_LABELS)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This LoRA setup balances **efficiency and adaptability**, enabling fast, resource-light fine-tuning with a dropout rate of 0.05 to **reduce overfitting** on the sequence classification task.

In [None]:
lora_config = LoraConfig(
    r=8,               # rank
    lora_alpha=32,
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS"
)

In [14]:
model = get_peft_model(model, lora_config)

In [15]:
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    # Calculate accuracy
    accuracy = (predictions == labels).mean()
    
    # Calculate precision, recall, and F1
    from sklearn.metrics import precision_recall_fscore_support
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='binary'
    )
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=EPOCHS,
    learning_rate=LR,
    fp16=torch.cuda.is_available(),   # use mixed precision if GPU available
    load_best_model_at_end=False,
    save_total_limit=2,
    logging_steps=10,
    report_to="none"
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [18]:
trainer.train()
print("\nTraining completed!")

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1204,0.135368,0.938184,0.9728,0.873847,0.920672
2,0.1218,0.123291,0.947191,0.982868,0.886811,0.932372
3,0.1383,0.113578,0.950568,0.981441,0.896535,0.937068



Training completed!


## Evaluate Model

In [19]:
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

def get_labels_from_logits(logits):
    """
    Convert logits to predicted class labels.
    """
    if isinstance(logits, tuple):
        logits = logits[0]
    return np.argmax(logits, axis=1)

# --- Evaluate on validation set ---
eval_results = trainer.evaluate()
print("\nValidation Results:")
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

# --- Detailed validation diagnostics ---
print(f"\nValidation dataset size: {len(encoded['validation'])}")
print(f"Validation dataset columns: {encoded['validation'].column_names}")

val_preds = trainer.predict(encoded["validation"])

# Debug: Check what we got back
print(f"\nPredictions shape: {val_preds.predictions.shape}")
print(f"Label IDs: {val_preds.label_ids.shape if val_preds.label_ids is not None else 'None'}")
print(f"Metrics returned: {val_preds.metrics}")

val_labels = get_labels_from_logits(val_preds.predictions)
print(f"Number of predictions: {len(val_labels)}")

# CRITICAL: Check if we got all predictions
if len(val_labels) != len(encoded["validation"]):
    print(f"\n⚠️ WARNING: Prediction count mismatch!")
    print(f"Expected {len(encoded['validation'])} but got {len(val_labels)}")
    print(f"This suggests the Trainer is dropping samples!")

# Use label_ids if available, otherwise fall back to dataset labels
if val_preds.label_ids is not None:
    val_true = val_preds.label_ids
    print(f"Using label_ids from predictions: {len(val_true)} labels")
else:
    print("⚠️ label_ids is None, using fallback")
    val_true = np.array(encoded["validation"]["labels"])[:len(val_labels)]
    print(f"Using sliced dataset labels: {len(val_true)} labels")

print(f"\n✓ Final lengths - Predictions: {len(val_labels)}, True labels: {len(val_true)}")

print("\nClassification Report:")
print(classification_report(val_true, val_labels, target_names=["Benign", "Malicious"]))

print("\nConfusion Matrix:")
cm = confusion_matrix(val_true, val_labels)
print(cm)
print("\n[TN  FP]\n[FN  TP]")

# --- Test set ---
if "test" in encoded and len(encoded["test"]) > 0:
    print(f"\n\nTest dataset size: {len(encoded['test'])}")
    test_preds = trainer.predict(encoded["test"])
    test_labels = get_labels_from_logits(test_preds.predictions)
    
    print(f"Test predictions: {len(test_labels)}")
    
    if test_preds.label_ids is not None:
        test_true = test_preds.label_ids
    else:
        test_true = np.array(encoded["test"]["labels"])[:len(test_labels)]
    
    print(f"\nTest Set Results:")
    print(classification_report(test_true, test_labels, target_names=["Benign", "Malicious"]))
    
    print("\nTest Confusion Matrix:")
    print(confusion_matrix(test_true, test_labels))


Validation Results:
eval_loss: 0.1136
eval_accuracy: 0.9506
eval_precision: 0.9814
eval_recall: 0.8965
eval_f1: 0.9371
eval_runtime: 20.1256
eval_samples_per_second: 485.5020
eval_steps_per_second: 30.3590
epoch: 3.0000

Validation dataset size: 9771
Validation dataset columns: ['text', 'labels', 'input_ids', 'attention_mask']

Predictions shape: (9771, 2)
Label IDs: (9771,)
Metrics returned: {'test_loss': 0.11357788741588593, 'test_accuracy': 0.9505680073687442, 'test_precision': 0.9814410480349345, 'test_recall': 0.8965345300423835, 'test_f1': 0.9370684039087948, 'test_runtime': 20.5966, 'test_samples_per_second': 474.398, 'test_steps_per_second': 29.665}
Number of predictions: 9771
Using label_ids from predictions: 9771 labels

✓ Final lengths - Predictions: 9771, True labels: 9771

Classification Report:
              precision    recall  f1-score   support

      Benign       0.93      0.99      0.96      5760
   Malicious       0.98      0.90      0.94      4011

    accuracy   

Test predictions: 12213

Test Set Results:
              precision    recall  f1-score   support

      Benign       0.93      0.99      0.96      7200
   Malicious       0.98      0.89      0.94      5013

    accuracy                           0.95     12213
   macro avg       0.96      0.94      0.95     12213
weighted avg       0.95      0.95      0.95     12213


Test Confusion Matrix:
[[7126   74]
 [ 533 4480]]


In [24]:
model.save_pretrained("./lora-distilbert")
print("Saved LoRA adapter ->", OUTPUT_DIR)

Saved LoRA adapter -> ./lora-distilbert-checkpoint
