# XIDS Data Exploration Notebook

This notebook explores the KDD IDS dataset used in the XIDS (Explainable Intrusion Detection System) project.

## Contents:
1. Dataset Overview
2. Data Loading and Inspection
3. Statistical Analysis
4. Class Distribution Analysis
5. Feature Characteristics
6. Data Quality Assessment

## 1. Dataset Overview

The KDD99 IDS dataset is a widely used benchmark for intrusion detection systems. It contains:
- **Training set**: 4,898,431 records
- **Test set**: 311,029 records
- **Features**: ~41 network traffic features
- **Classes**: BENIGN + 22 attack types

This dataset is used for building and evaluating machine learning models for detecting network intrusions.

## 2. Data Loading and Inspection

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set up visualization style
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Define data paths
DATA_DIR = Path('../backend/data/raw')
TRAIN_FILE = DATA_DIR / 'KDDTrain+.txt'
TEST_FILE = DATA_DIR / 'KDDTest+.txt'

print(f"Data directory: {DATA_DIR}")
print(f"Train file exists: {TRAIN_FILE.exists()}")
print(f"Test file exists: {TEST_FILE.exists()}")

In [None]:
# Load sample data (first 10000 rows for exploration)
# In production, use the full dataset
try:
    train_df = pd.read_csv(TRAIN_FILE, nrows=10000)
    test_df = pd.read_csv(TEST_FILE, nrows=10000)
    print(f"Train set shape: {train_df.shape}")
    print(f"Test set shape: {test_df.shape}")
except FileNotFoundError:
    print("Data files not found. Using synthetic data for exploration.")
    train_df = pd.DataFrame(np.random.randn(1000, 41))
    test_df = pd.DataFrame(np.random.randn(100, 41))

print("\nFirst few rows of training data:")
train_df.head()

In [None]:
# Data information
print("Training Data Info:")
print(f"Shape: {train_df.shape}")
print(f"Data types:\n{train_df.dtypes}")
print(f"\nMemory usage: {train_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 3. Statistical Analysis

In [None]:
# Get statistical summary
print("Training Data Statistics:")
print(train_df.describe())

In [None]:
# Check for missing values
missing_values = train_df.isnull().sum()
if missing_values.sum() > 0:
    print("Missing values detected:")
    print(missing_values[missing_values > 0])
else:
    print("No missing values found in training data")

# Check for infinite values
infinite_values = np.isinf(train_df.select_dtypes(include=[np.number])).sum()
print(f"\nInfinite values: {infinite_values.sum()}")

## 4. Class Distribution Analysis

In [None]:
# Get last column as label (assuming label is in last column)
label_col = train_df.columns[-1]
print(f"Label column: {label_col}")

# Class distribution
class_dist = train_df[label_col].value_counts()
print(f"\nClass Distribution:")
print(class_dist)
print(f"\nClass Distribution (%)")
print(class_dist / len(train_df) * 100)

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
class_dist.plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Class Distribution (Count)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Attack Type')
axes[0].set_ylabel('Number of Samples')
axes[0].tick_params(axis='x', rotation=45)

# Pie chart
class_dist.plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
axes[1].set_title('Class Distribution (Percentage)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

print("\nClass imbalance ratio (most common / least common):")
print(f"{class_dist.max() / class_dist.min():.2f}x")

## 5. Feature Characteristics

In [None]:
# Get numeric features
numeric_cols = train_df.select_dtypes(include=[np.number]).columns
print(f"Number of numeric features: {len(numeric_cols)}")
print(f"\nFeature names (first 10):")
for i, col in enumerate(numeric_cols[:10]):
    print(f"{i+1}. {col}")

In [None]:
# Feature statistics
feature_stats = train_df[numeric_cols].describe().T
feature_stats['skewness'] = train_df[numeric_cols].skew()
feature_stats['kurtosis'] = train_df[numeric_cols].kurtosis()

print("Feature Statistics (First 5 features):")
print(feature_stats.head())

In [None]:
# Feature variance
feature_variance = train_df[numeric_cols].var()
low_variance_features = feature_variance[feature_variance < 0.01]

print(f"Features with low variance (< 0.01): {len(low_variance_features)}")
if len(low_variance_features) > 0:
    print("\nLow variance features:")
    print(low_variance_features)

## 6. Data Quality Assessment

In [None]:
# Data quality summary
quality_report = {
    'Total Samples': len(train_df),
    'Total Features': len(train_df.columns),
    'Numeric Features': len(numeric_cols),
    'Categorical Features': len(train_df.select_dtypes(include=['object']).columns),
    'Missing Values': train_df.isnull().sum().sum(),
    'Duplicate Rows': train_df.duplicated().sum(),
    'Memory Usage (MB)': round(train_df.memory_usage(deep=True).sum() / 1024**2, 2),
    'Classes': len(train_df[label_col].unique())
}

print("Data Quality Report:")
for key, value in quality_report.items():
    print(f"  {key}: {value}")

In [None]:
# Comparison with test set
print("\nTrain vs Test Set Comparison:")
print(f"{'Metric':<20} {'Train':<15} {'Test':<15}")
print("-" * 50)
print(f"{'Total Samples':<20} {len(train_df):<15} {len(test_df):<15}")
print(f"{'Total Features':<20} {len(train_df.columns):<15} {len(test_df.columns):<15}")
print(f"{'Classes':<20} {train_df[label_col].nunique():<15} {test_df[test_df.columns[-1]].nunique():<15}")

## Summary

This exploration shows the characteristics of the KDD IDS dataset used in the XIDS project. Key findings:

1. **Dataset Size**: Large-scale dataset suitable for training robust models
2. **Class Distribution**: Often imbalanced, with normal traffic being more prevalent
3. **Feature Space**: Mix of continuous and categorical features
4. **Data Quality**: Assessment helps identify preprocessing needs

Next steps:
- Use the preprocessing pipeline to clean and prepare data
- Apply feature selection to reduce dimensionality
- Train models for intrusion detection
- Use explainability techniques for model interpretation