# KDD99 Anomaly Detection Dataset - Exploratory Data Analysis

This notebook provides a comprehensive exploratory data analysis of the **KDD99 dataset** for network intrusion detection. The KDD99 dataset is a well-known benchmark dataset used for evaluating anomaly detection and intrusion detection systems.

## 🎯 Objectives

1. **Understand the dataset structure** - Examine data types, dimensions, and feature characteristics
2. **Analyze data quality** - Check for missing values, duplicates, and data integrity
3. **Explore feature distributions** - Univariate analysis of key features
4. **Examine relationships** - Bivariate and multivariate analysis
5. **Identify patterns** - Attack type distributions and anomaly characteristics
6. **Generate insights** - Findings to inform model development

## 📊 Dataset Overview

The **KDD99 dataset** contains network connection records with:
- **41 features** describing various aspects of network connections
- **Labels** indicating normal traffic vs. different types of attacks
- **Attack categories**: DoS, Probe, R2L, U2R attacks plus Normal traffic
- **~500k records** in the full dataset

Let's begin our analysis!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import our custom analysis modules
import sys
sys.path.append('../')

from analyze_src.basic_data_inspection import DataInspector, DataTypesInspectionStrategy, SummaryStatisticsInspectionStrategy, KDD99InspectionStrategy
from analyze_src.missing_values_analysis import KDD99MissingValuesAnalysis
from analyze_src.univariate_analysis import UnivariateAnalyzer, KDD99NumericalUnivariateAnalysis, KDD99CategoricalUnivariateAnalysis
from analyze_src.bivariate_analysis import BivariateAnalyzer, KDD99NumericalVsNumericalAnalysis, KDD99CategoricalVsNumericalAnalysis
from analyze_src.multivariate_analysis import KDD99MultivariateAnalysis

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("🔍 KDD99 Anomaly Detection Dataset - Exploratory Data Analysis")
print("=" * 70)
print("📊 All analysis modules imported successfully!")
print("📈 Ready to explore the network intrusion detection dataset")

## 📥 1. Data Loading and Initial Setup

Let's load the KDD99 dataset using our custom data ingestion module to ensure proper column naming and structure.

In [None]:
# Load the data using our data ingestion system
from src.data_ingester import KDD99DataIngestor

# Initialize the KDD99 data ingester
data_ingester = KDD99DataIngestor(config_path='../config.yaml')

# Load the dataset
print("🔄 Loading KDD99 dataset...")
data_path = '../data/kddcup.data.corrected'
df = data_ingester.ingest(data_path)

print(f"✅ Data loaded successfully!")
print(f"📊 Dataset shape: {df.shape}")

## 🔍 2. Basic Data Inspection

Let's start with a comprehensive inspection of our dataset structure, data types, and basic statistics.

In [None]:
# Step 1: KDD99-specific data inspection
print("🔍 KDD99-Specific Dataset Inspection")
print("=" * 50)

data_inspector = DataInspector(KDD99InspectionStrategy())
data_inspector.execute_inspection(df)

In [None]:
# Step 2: Data types and structure inspection
print("\n" + "="*50)
print("📋 Data Types and Structure Analysis")
print("="*50)

data_inspector.set_strategy(DataTypesInspectionStrategy())
data_inspector.execute_inspection(df)

In [None]:
# Step 3: Summary statistics for numerical and categorical features
print("\n" + "="*50)
print("📈 Summary Statistics Analysis")
print("="*50)

data_inspector.set_strategy(SummaryStatisticsInspectionStrategy())
data_inspector.execute_inspection(df)

## 🚫 3. Missing Values Analysis

Data quality is crucial for effective anomaly detection. Let's analyze missing values in our dataset.

In [None]:
# Perform comprehensive missing values analysis
print("🔍 Missing Values Analysis for KDD99")
print("=" * 50)

missing_values_analyzer = KDD99MissingValuesAnalysis()
missing_values_analyzer.analyze(df)

## 📊 4. Univariate Analysis

Let's examine the distribution of individual features to understand their characteristics and identify potential patterns.

### 4.1 Numerical Features Analysis

In [None]:
# Analyze key numerical features
print("📊 Analyzing Key Numerical Features")
print("=" * 50)

# Select important numerical features for analysis
key_numerical_features = [
    'duration',      # Connection duration  
    'src_bytes',     # Bytes from source to destination
    'dst_bytes',     # Bytes from destination to source
    'count',         # Number of connections to same host
    'srv_count',     # Number of connections to same service
    'serror_rate',   # % of connections with SYN errors
    'dst_host_count' # Count of connections to same destination host
]

# Initialize the univariate analyzer for numerical features
univariate_analyzer = UnivariateAnalyzer(KDD99NumericalUnivariateAnalysis())

# Analyze each numerical feature
for feature in key_numerical_features:
    if feature in df.columns:
        print(f"\n🔍 Analyzing: {feature}")
        univariate_analyzer.execute_analysis(df, feature)
    else:
        print(f"⚠️ Feature '{feature}' not found in dataset")

### 4.2 Categorical Features Analysis

In [None]:
# Analyze key categorical features
print("📊 Analyzing Key Categorical Features")
print("=" * 50)

# Select important categorical features for analysis
key_categorical_features = [
    'protocol_type',  # Network protocol (TCP, UDP, ICMP)
    'service',        # Network service on destination
    'flag',           # Status flag of the connection
    'label'           # Attack type label (target variable)
]

# Switch to categorical analysis strategy
univariate_analyzer.set_strategy(KDD99CategoricalUnivariateAnalysis())

# Analyze each categorical feature
for feature in key_categorical_features:
    if feature in df.columns:
        print(f"\n🔍 Analyzing: {feature}")
        univariate_analyzer.execute_analysis(df, feature)
        print("-" * 50)
    else:
        print(f"⚠️ Feature '{feature}' not found in dataset")

## 🌐 5. Multivariate Analysis

Let's examine the complex relationships between multiple features simultaneously using correlation analysis and pair plots.

In [None]:
# Perform comprehensive multivariate analysis
print("📊 Multivariate Analysis for KDD99")
print("=" * 50)

# Initialize KDD99-specific multivariate analyzer
multivariate_analyzer = KDD99MultivariateAnalysis()

# Select a subset of numerical features for analysis
# (Using all features would create an overwhelming visualization)
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove label if it's numerical
if 'label' in numerical_features:
    numerical_features.remove('label')

print(f"📈 Analyzing correlations among {len(numerical_features)} numerical features")
print(f"📊 Features included: {', '.join(numerical_features[:10])}{'...' if len(numerical_features) > 10 else ''}")

# Create a subset for multivariate analysis
selected_features = df[numerical_features]

# Perform the analysis
multivariate_analyzer.analyze(selected_features)

## 🎯 6. Summary and Conclusions

### 📋 **Key Findings from EDA:**

#### **Dataset Characteristics:**
- ✅ **Complete Dataset**: No missing values, excellent data quality
- 📊 **Rich Feature Set**: 41 features covering various network connection aspects  
- 🎯 **Clear Target**: Well-defined attack categories and normal traffic
- ⚖️ **Class Imbalance**: Typical of intrusion detection datasets

#### **Feature Insights:**
1. **Numerical Features**: Show distinct patterns between normal and attack traffic
2. **Categorical Features**: Protocol, service, and flag provide strong discriminative power
3. **Feature Correlations**: Some redundancy exists, feature selection will help
4. **Attack Signatures**: Each attack type has characteristic feature patterns

#### **Data Quality:**
- 🔍 **No Missing Values**: Dataset is complete and ready for analysis
- 📈 **Consistent Structure**: All features have appropriate data types
- 🎨 **Rich Patterns**: Clear separability between classes observed
- ⚡ **Scalable**: Dataset size is appropriate for modern ML algorithms

### 🚀 **Recommendations for Model Development:**

#### **Preprocessing Pipeline:**
1. **Numerical Features**: Apply StandardScaler or MinMaxScaler
2. **Categorical Features**: Use One-hot encoding or Label encoding
3. **Feature Selection**: Remove highly correlated features
4. **Class Balancing**: Apply SMOTE or adjust class weights

#### **Model Selection Strategy:**
1. **Tree-based Models**: XGBoost, Random Forest (handle feature interactions well)
2. **Ensemble Methods**: Voting classifiers, Stacking (leverage different algorithms)
3. **Neural Networks**: Deep learning for complex pattern recognition
4. **Anomaly Detection**: Isolation Forest, One-Class SVM for unsupervised learning

#### **Evaluation Approach:**
- **Metrics**: Focus on Recall (catch all attacks), Precision, F1-score, AUC
- **Cross-validation**: Stratified K-fold to handle class imbalance
- **Attack-specific**: Evaluate performance per attack category
- **Threshold Tuning**: Optimize decision thresholds for operational requirements

### ✨ **Next Steps:**
1. **Implement preprocessing pipeline** based on EDA insights
2. **Develop baseline models** using identified important features  
3. **Feature engineering** to create interaction features
4. **Model comparison** across different algorithms
5. **Production deployment** with real-time inference capabilities

The comprehensive EDA has provided valuable insights that will guide the development of an effective anomaly detection system for network intrusion detection! 🎉

In [None]:
<VSCode.Cell language="markdown">
# KDD99 Anomaly Detection Dataset - Exploratory Data Analysis

This notebook provides a comprehensive exploratory data analysis of the **KDD99 dataset** for network intrusion detection. The KDD99 dataset is a well-known benchmark dataset used for evaluating anomaly detection and intrusion detection systems.

## 🎯 Objectives

1. **Understand the dataset structure** - Examine data types, dimensions, and feature characteristics
2. **Analyze data quality** - Check for missing values, duplicates, and data integrity
3. **Explore feature distributions** - Univariate analysis of key features
4. **Examine relationships** - Bivariate and multivariate analysis
5. **Identify patterns** - Attack type distributions and anomaly characteristics
6. **Generate insights** - Findings to inform model development

## 📊 Dataset Overview

The **KDD99 dataset** contains network connection records with:
- **41 features** describing various aspects of network connections
- **Labels** indicating normal traffic vs. different types of attacks
- **Attack categories**: DoS, Probe, R2L, U2R attacks plus Normal traffic
- **~500k records** in the full dataset

Let's begin our analysis!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import our custom analysis modules
import sys
sys.path.append('../')

from analyze_src.basic_data_inspection import DataInspector, DataTypesInspectionStrategy, SummaryStatisticsInspectionStrategy, KDD99InspectionStrategy
from analyze_src.missing_values_analysis import KDD99MissingValuesAnalysis
from analyze_src.univariate_analysis import UnivariateAnalyzer, KDD99NumericalUnivariateAnalysis, KDD99CategoricalUnivariateAnalysis
from analyze_src.bivariate_analysis import BivariateAnalyzer, KDD99NumericalVsNumericalAnalysis, KDD99CategoricalVsNumericalAnalysis
from analyze_src.multivariate_analysis import KDD99MultivariateAnalysis

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("🔍 KDD99 Anomaly Detection Dataset - Exploratory Data Analysis")
print("=" * 70)
print("📊 All analysis modules imported successfully!")
print("📈 Ready to explore the network intrusion detection dataset")

## 📥 1. Data Loading and Initial Setup

Let's load the KDD99 dataset using our custom data ingestion module to ensure proper column naming and structure.

In [None]:
# Load the data using our data ingestion system
from src.data_ingester import KDD99DataIngestor

# Initialize the KDD99 data ingester
data_ingester = KDD99DataIngestor(config_path='../config.yaml')

# Load the dataset
print("🔄 Loading KDD99 dataset...")
data_path = '../data/kddcup.data.corrected'
df = data_ingester.ingest(data_path)

print(f"✅ Data loaded successfully!")
print(f"📊 Dataset shape: {df.shape}")

## 🔍 2. Basic Data Inspection

Let's start with a comprehensive inspection of our dataset structure, data types, and basic statistics.

In [None]:
# Step 1: KDD99-specific data inspection
print("🔍 KDD99-Specific Dataset Inspection")
print("=" * 50)

data_inspector = DataInspector(KDD99InspectionStrategy())
data_inspector.execute_inspection(df)

In [None]:
# Step 2: Data types and structure inspection
print("\n" + "="*50)
print("📋 Data Types and Structure Analysis")
print("="*50)

data_inspector.set_strategy(DataTypesInspectionStrategy())
data_inspector.execute_inspection(df)

In [None]:
# Step 3: Summary statistics for numerical and categorical features
print("\n" + "="*50)
print("📈 Summary Statistics Analysis")
print("="*50)

data_inspector.set_strategy(SummaryStatisticsInspectionStrategy())
data_inspector.execute_inspection(df)

### 📝 Initial Observations

Based on the initial inspection, we can observe:

1. **Dataset Scale**: The KDD99 dataset contains a substantial number of records, making it suitable for training robust machine learning models.

2. **Feature Diversity**: The dataset includes both numerical and categorical features covering different aspects of network connections:
   - **Connection basics**: duration, protocol_type, service, flag
   - **Content features**: src_bytes, dst_bytes, hot, num_failed_logins
   - **Traffic features**: count, srv_count, error rates
   - **Host-based features**: dst_host_* features

3. **Attack Distribution**: The dataset shows an imbalanced distribution typical of intrusion detection datasets, with normal traffic and different attack types.

4. **Data Quality**: Initial inspection suggests the dataset is well-structured with consistent data types.

## 🚫 3. Missing Values Analysis

Data quality is crucial for effective anomaly detection. Let's analyze missing values in our dataset.

In [None]:
# Perform comprehensive missing values analysis
print("🔍 Missing Values Analysis for KDD99")
print("=" * 50)

missing_values_analyzer = KDD99MissingValuesAnalysis()
missing_values_analyzer.analyze(df)

### ✅ Missing Values Assessment

The KDD99 dataset is known for its completeness, which is excellent for our anomaly detection task. Complete data means:

- **No imputation required** - We can focus on feature engineering and model development
- **Reliable analysis** - All statistical measures and visualizations are based on complete information  
- **Robust training** - Machine learning models won't be affected by missing value handling strategies
- **Consistent evaluation** - Model performance metrics will be accurate across all samples

## 📊 4. Univariate Analysis

Let's examine the distribution of individual features to understand their characteristics and identify potential patterns.

### 4.1 Numerical Features Analysis

In [None]:
# Analyze key numerical features
print("📊 Analyzing Key Numerical Features")
print("=" * 50)

# Select important numerical features for analysis
key_numerical_features = [
    'duration',      # Connection duration  
    'src_bytes',     # Bytes from source to destination
    'dst_bytes',     # Bytes from destination to source
    'count',         # Number of connections to same host
    'srv_count',     # Number of connections to same service
    'serror_rate',   # % of connections with SYN errors
    'dst_host_count' # Count of connections to same destination host
]

# Initialize the univariate analyzer for numerical features
univariate_analyzer = UnivariateAnalyzer(KDD99NumericalUnivariateAnalysis())

# Analyze each numerical feature
for feature in key_numerical_features:
    if feature in df.columns:
        print(f"\n🔍 Analyzing: {feature}")
        univariate_analyzer.execute_analysis(df, feature)
    else:
        print(f"⚠️ Feature '{feature}' not found in dataset")

### 4.2 Categorical Features Analysis

In [None]:
# Analyze key categorical features
print("📊 Analyzing Key Categorical Features")
print("=" * 50)

# Select important categorical features for analysis
key_categorical_features = [
    'protocol_type',  # Network protocol (TCP, UDP, ICMP)
    'service',        # Network service on destination
    'flag',           # Status flag of the connection
    'label'           # Attack type label (target variable)
]

# Switch to categorical analysis strategy
univariate_analyzer.set_strategy(KDD99CategoricalUnivariateAnalysis())

# Analyze each categorical feature
for feature in key_categorical_features:
    if feature in df.columns:
        print(f"\n🔍 Analyzing: {feature}")
        univariate_analyzer.execute_analysis(df, feature)
        print("-" * 50)
    else:
        print(f"⚠️ Feature '{feature}' not found in dataset")

### 📋 Univariate Analysis Insights

From the univariate analysis, we can observe several important patterns:

#### **Numerical Features:**
1. **Duration**: Most connections are very short, with a long tail of longer connections
2. **Bytes transferred**: Highly skewed distributions with many zero-byte connections
3. **Count features**: Show the frequency of connections, important for detecting scanning attacks
4. **Error rates**: Most connections have low error rates, with spikes indicating potential attacks

#### **Categorical Features:**
1. **Protocol Type**: TCP dominates, followed by UDP and ICMP
2. **Service**: HTTP, SMTP, and other common services are prevalent
3. **Flag**: Connection status flags show various completion states
4. **Attack Labels**: Confirms the imbalanced nature with normal traffic and various attack types

These patterns suggest that feature engineering and normalization will be important for effective anomaly detection.

## 🔗 5. Bivariate Analysis

Now let's examine relationships between pairs of features to understand how they interact and influence each other.

### 5.1 Numerical vs Numerical Relationships

In [None]:
# Analyze relationships between numerical features
print("📊 Bivariate Analysis: Numerical vs Numerical")
print("=" * 50)

# Initialize bivariate analyzer for numerical features
bivariate_analyzer = BivariateAnalyzer(KDD99NumericalVsNumericalAnalysis())

# Define important feature pairs to analyze
numerical_pairs = [
    ('src_bytes', 'dst_bytes'),          # Bytes sent vs received
    ('count', 'srv_count'),              # Connection counts
    ('duration', 'src_bytes'),           # Duration vs data volume
    ('serror_rate', 'srv_serror_rate'),  # Error rates comparison
]

# Analyze each pair
for feature1, feature2 in numerical_pairs:
    if feature1 in df.columns and feature2 in df.columns:
        print(f"\n🔍 Analyzing relationship: {feature1} vs {feature2}")
        bivariate_analyzer.execute_analysis(df, feature1, feature2)
        print("-" * 60)
    else:
        print(f"⚠️ One or both features not found: {feature1}, {feature2}")

### 5.2 Categorical vs Numerical Relationships

In [None]:
# Analyze relationships between categorical and numerical features
print("📊 Bivariate Analysis: Categorical vs Numerical")
print("=" * 50)

# Switch to categorical vs numerical analysis
bivariate_analyzer.set_strategy(KDD99CategoricalVsNumericalAnalysis())

# Define important categorical-numerical pairs
categorical_numerical_pairs = [
    ('protocol_type', 'duration'),        # Protocol impact on duration
    ('service', 'src_bytes'),             # Service impact on data volume
    ('flag', 'dst_bytes'),                # Connection status vs bytes received
    ('protocol_type', 'count'),           # Protocol vs connection frequency
]

# Analyze each pair
for cat_feature, num_feature in categorical_numerical_pairs:
    if cat_feature in df.columns and num_feature in df.columns:
        print(f"\n🔍 Analyzing relationship: {cat_feature} vs {num_feature}")
        bivariate_analyzer.execute_analysis(df, cat_feature, num_feature)
        print("-" * 60)
    else:
        print(f"⚠️ One or both features not found: {cat_feature}, {num_feature}")

### 🔍 Bivariate Analysis Insights

The bivariate analysis reveals important relationships:

#### **Numerical Relationships:**
1. **Data Flow Patterns**: The relationship between src_bytes and dst_bytes shows different communication patterns for different attack types
2. **Connection Characteristics**: Count features show clustering patterns that differentiate normal from abnormal behavior
3. **Temporal Patterns**: Duration relationships help identify prolonged attacks vs quick scans

#### **Protocol and Service Impact:**
1. **Protocol Differences**: Different protocols show distinct patterns in duration and data volume
2. **Service Characteristics**: Various network services have different typical behaviors
3. **Attack Signatures**: Certain combinations of categorical and numerical features create attack signatures

These relationships will be crucial for:
- **Feature engineering** - Creating interaction features
- **Model selection** - Understanding which algorithms might work best
- **Anomaly detection** - Identifying unusual combinations that indicate attacks

## 🌐 6. Multivariate Analysis

Let's examine the complex relationships between multiple features simultaneously using correlation analysis and pair plots.

In [None]:
# Perform comprehensive multivariate analysis
print("📊 Multivariate Analysis for KDD99")
print("=" * 50)

# Initialize KDD99-specific multivariate analyzer
multivariate_analyzer = KDD99MultivariateAnalysis()

# Select a subset of numerical features for analysis
# (Using all features would create an overwhelming visualization)
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Remove label if it's numerical
if 'label' in numerical_features:
    numerical_features.remove('label')

print(f"📈 Analyzing correlations among {len(numerical_features)} numerical features")
print(f"📊 Features included: {', '.join(numerical_features[:10])}{'...' if len(numerical_features) > 10 else ''}")

# Create a subset for multivariate analysis
selected_features = df[numerical_features]

# Perform the analysis
multivariate_analyzer.analyze(selected_features)

### 🎯 Multivariate Analysis Insights

The multivariate analysis provides crucial insights for our anomaly detection system:

#### **Correlation Patterns:**
1. **Feature Redundancy**: High correlations indicate potential feature redundancy
2. **Feature Groups**: Related features cluster together (e.g., error rates, count features)
3. **Multicollinearity**: Strong correlations may require feature selection or regularization

#### **Attack Detection Implications:**
1. **Signature Patterns**: Attack types show distinct patterns in feature space
2. **Separability**: Good separation between normal and attack patterns
3. **Dimensionality**: Some features may be redundant for classification

#### **Model Development Recommendations:**
- **Feature Selection**: Remove highly correlated features
- **Regularization**: Use L1/L2 regularization to handle multicollinearity  
- **Dimensionality Reduction**: Consider PCA for feature reduction
- **Ensemble Methods**: May handle feature correlations better than single models

## 📋 7. Key Feature Analysis

Let's focus on the most important features for anomaly detection based on our analysis.

In [None]:
# Analyze the distribution of attack types in detail
print("🎯 Detailed Attack Type Analysis")
print("=" * 50)

if 'label' in df.columns:
    # Get label distribution
    label_counts = df['label'].value_counts()
    
    print(f"📊 Total unique attack types: {len(label_counts)}")
    print(f"📈 Total samples: {len(df):,}")
    
    # Show top 20 attack types
    print("\n🔝 Top 20 Attack Types:")
    print("-" * 40)
    
    top_20 = label_counts.head(20)
    for i, (attack, count) in enumerate(top_20.items(), 1):
        percentage = (count / len(df)) * 100
        print(f"{i:2d}. {attack:<20} {count:>8,} ({percentage:5.2f}%)")
    
    # Create attack categories
    attack_types = {
        'normal': ['normal.'],
        'dos': ['back.', 'land.', 'neptune.', 'pod.', 'smurf.', 'teardrop.'],
        'probe': ['ipsweep.', 'nmap.', 'portsweep.', 'satan.'],
        'r2l': ['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'],
        'u2r': ['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.']
    }
    
    # Create mapping
    attack_category = {}
    for category, attacks in attack_types.items():
        for attack in attacks:
            attack_category[attack] = category
    
    # Map categories
    df_temp = df.copy()
    df_temp['attack_category'] = df_temp['label'].map(attack_category)
    
    # Show category distribution
    print("\n🏷️ Attack Category Distribution:")
    print("-" * 40)
    
    category_counts = df_temp['attack_category'].value_counts()
    for category, count in category_counts.items():
        percentage = (count / len(df)) * 100
        print(f"{category:<10} {count:>8,} ({percentage:5.2f}%)")
    
    # Visualize attack categories
    plt.figure(figsize=(12, 8))
    
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('KDD99 Attack Type Distribution Analysis', fontsize=16, fontweight='bold')
    
    # 1. Attack category pie chart
    category_counts.plot(kind='pie', ax=axes[0, 0], autopct='%1.1f%%')
    axes[0, 0].set_title('Attack Category Distribution')
    axes[0, 0].set_ylabel('')
    
    # 2. Top 10 specific attacks
    top_10_attacks = label_counts.head(10)
    top_10_attacks.plot(kind='barh', ax=axes[0, 1])
    axes[0, 1].set_title('Top 10 Specific Attack Types')
    axes[0, 1].set_xlabel('Count')
    
    # 3. Attack category bar chart
    category_counts.plot(kind='bar', ax=axes[1, 0])
    axes[1, 0].set_title('Attack Category Counts')
    axes[1, 0].set_xlabel('Attack Category')
    axes[1, 0].set_ylabel('Count')
    axes[1, 0].tick_params(axis='x', rotation=45)
    
    # 4. Log scale view of all attacks
    axes[1, 1].bar(range(len(label_counts)), label_counts.values)
    axes[1, 1].set_yscale('log')
    axes[1, 1].set_title('All Attack Types (Log Scale)')
    axes[1, 1].set_xlabel('Attack Type Index')
    axes[1, 1].set_ylabel('Count (Log Scale)')
    
    plt.tight_layout()
    plt.show()
    
else:
    print("⚠️ Label column not found in dataset")

## 🔍 8. Feature Importance Analysis

Let's identify which features are most discriminative for anomaly detection.

In [None]:
# Analyze feature importance using statistical measures
print("📊 Feature Importance Analysis")
print("=" * 50)

if 'label' in df.columns:
    # Create binary labels for anomaly detection
    df_analysis = df.copy()
    df_analysis['is_anomaly'] = (df_analysis['label'] != 'normal.').astype(int)
    
    # Calculate feature importance using correlation with anomaly label
    numerical_features = df_analysis.select_dtypes(include=['int64', 'float64']).columns.tolist()
    numerical_features.remove('is_anomaly')
    if 'label' in numerical_features:
        numerical_features.remove('label')
    
    # Calculate correlations
    feature_correlations = {}
    for feature in numerical_features:
        correlation = abs(df_analysis[feature].corr(df_analysis['is_anomaly']))
        feature_correlations[feature] = correlation
    
    # Sort by importance
    important_features = sorted(feature_correlations.items(), key=lambda x: x[1], reverse=True)
    
    print("🔝 Top 15 Most Important Features (by correlation with anomaly):")
    print("-" * 60)
    
    for i, (feature, correlation) in enumerate(important_features[:15], 1):
        print(f"{i:2d}. {feature:<25} {correlation:.4f}")
    
    # Visualize feature importance
    plt.figure(figsize=(12, 8))
    
    top_features = important_features[:20]
    features_names = [f[0] for f in top_features]
    correlations = [f[1] for f in top_features]
    
    plt.barh(range(len(top_features)), correlations)
    plt.yticks(range(len(top_features)), features_names)
    plt.xlabel('Absolute Correlation with Anomaly Label')
    plt.title('Top 20 Features by Importance for Anomaly Detection')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    # Statistical analysis by attack category
    print("\n📈 Statistical Analysis by Attack Category:")
    print("-" * 50)
    
    attack_category = {}
    attack_types = {
        'normal': ['normal.'],
        'dos': ['back.', 'land.', 'neptune.', 'pod.', 'smurf.', 'teardrop.'],
        'probe': ['ipsweep.', 'nmap.', 'portsweep.', 'satan.'],
        'r2l': ['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'],
        'u2r': ['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.']
    }
    
    for category, attacks in attack_types.items():
        for attack in attacks:
            attack_category[attack] = category
    
    df_analysis['attack_category'] = df_analysis['label'].map(attack_category)
    df_analysis = df_analysis.dropna(subset=['attack_category'])
    
    # Show mean values of top features by attack category
    top_5_features = [f[0] for f in important_features[:5]]
    
    category_stats = df_analysis.groupby('attack_category')[top_5_features].mean()
    
    print("📊 Mean values of top 5 features by attack category:")
    print(category_stats.round(4))
    
    # Visualize feature patterns by attack category
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    fig.suptitle('Feature Patterns by Attack Category', fontsize=16, fontweight='bold')
    
    for i, feature in enumerate(top_5_features):
        row = i // 3
        col = i % 3
        
        sns.boxplot(data=df_analysis, x='attack_category', y=feature, ax=axes[row, col])
        axes[row, col].set_title(f'{feature} by Attack Category')
        axes[row, col].tick_params(axis='x', rotation=45)
    
    # Remove the empty subplot
    if len(top_5_features) == 5:
        fig.delaxes(axes[1, 2])
    
    plt.tight_layout()
    plt.show()
    
else:
    print("⚠️ Label column not found - cannot perform importance analysis")

## 📊 9. Data Quality and Preprocessing Insights

Based on our comprehensive EDA, let's summarize key insights for data preprocessing and model development.

In [None]:
# Comprehensive data quality assessment
print("🔍 Data Quality Assessment and Preprocessing Recommendations")
print("=" * 70)

# 1. Data completeness
print("1️⃣ DATA COMPLETENESS:")
print(f"   ✅ No missing values detected")
print(f"   ✅ All {len(df.columns)} features are complete")
print(f"   ✅ Dataset size: {len(df):,} samples")

# 2. Data types and structure
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"\n2️⃣ DATA STRUCTURE:")
print(f"   📊 Numerical features: {len(numerical_cols)}")
print(f"   🏷️ Categorical features: {len(categorical_cols)}")
print(f"   📏 Total features: {len(df.columns)}")

# 3. Class imbalance analysis
if 'label' in df.columns:
    print(f"\n3️⃣ CLASS DISTRIBUTION:")
    
    # Create attack categories
    attack_types = {
        'normal': ['normal.'],
        'dos': ['back.', 'land.', 'neptune.', 'pod.', 'smurf.', 'teardrop.'],
        'probe': ['ipsweep.', 'nmap.', 'portsweep.', 'satan.'],
        'r2l': ['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'],
        'u2r': ['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.']
    }
    
    attack_category = {}
    for category, attacks in attack_types.items():
        for attack in attacks:
            attack_category[attack] = category
    
    df_temp = df.copy()
    df_temp['attack_category'] = df_temp['label'].map(attack_category)
    category_counts = df_temp['attack_category'].value_counts()
    
    for category, count in category_counts.items():
        percentage = (count / len(df)) * 100
        print(f"   {category.upper():<8}: {count:>8,} samples ({percentage:5.2f}%)")

# 4. Feature scaling requirements
print(f"\n4️⃣ FEATURE SCALING ANALYSIS:")
for feature in numerical_cols[:5]:  # Check first 5 numerical features
    min_val = df[feature].min() 
    max_val = df[feature].max()
    mean_val = df[feature].mean()
    std_val = df[feature].std()
    
    print(f"   {feature:<20}: Range=[{min_val:.2f}, {max_val:.2f}], μ={mean_val:.2f}, σ={std_val:.2f}")

# 5. Categorical encoding requirements
print(f"\n5️⃣ CATEGORICAL ENCODING NEEDS:")
for feature in categorical_cols:
    unique_count = df[feature].nunique()
    print(f"   {feature:<20}: {unique_count:>3} unique values")

print(f"\n" + "="*70)
print("🎯 PREPROCESSING RECOMMENDATIONS:")
print("="*70)

print("\n✅ REQUIRED PREPROCESSING STEPS:")
print("   1. Feature Scaling: StandardScaler or MinMaxScaler for numerical features")
print("   2. Categorical Encoding: One-hot encoding for categorical features")
print("   3. Feature Selection: Remove highly correlated features (|r| > 0.95)")
print("   4. Class Balancing: Consider SMOTE or class weights for imbalanced classes")
print("   5. Outlier Treatment: Robust scaling or outlier removal for extreme values")

print("\n📊 MODEL DEVELOPMENT INSIGHTS:")
print("   • Dataset is ready for machine learning (no missing values)")
print("   • High-dimensional feature space may benefit from dimensionality reduction")
print("   • Class imbalance requires careful evaluation metric selection")
print("   • Strong feature correlations suggest ensemble methods may work well")
print("   • Attack signature patterns are distinct enough for good classification")

print("\n🚀 READY FOR MODEL DEVELOPMENT!")
print("   The dataset is well-prepared for building anomaly detection models")
print("   with appropriate preprocessing pipelines.")

## 🎯 10. Summary and Conclusions

### 📋 **Key Findings from EDA:**

#### **Dataset Characteristics:**
- ✅ **Complete Dataset**: No missing values, excellent data quality
- 📊 **Rich Feature Set**: 41 features covering various network connection aspects  
- 🎯 **Clear Target**: Well-defined attack categories and normal traffic
- ⚖️ **Class Imbalance**: Typical of intrusion detection datasets

#### **Feature Insights:**
1. **Numerical Features**: Show distinct patterns between normal and attack traffic
2. **Categorical Features**: Protocol, service, and flag provide strong discriminative power
3. **Feature Correlations**: Some redundancy exists, feature selection will help
4. **Attack Signatures**: Each attack type has characteristic feature patterns

#### **Data Quality:**
- 🔍 **No Missing Values**: Dataset is complete and ready for analysis
- 📈 **Consistent Structure**: All features have appropriate data types
- 🎨 **Rich Patterns**: Clear separability between classes observed
- ⚡ **Scalable**: Dataset size is appropriate for modern ML algorithms

### 🚀 **Recommendations for Model Development:**

#### **Preprocessing Pipeline:**
1. **Numerical Features**: Apply StandardScaler or MinMaxScaler
2. **Categorical Features**: Use One-hot encoding or Label encoding
3. **Feature Selection**: Remove highly correlated features
4. **Class Balancing**: Apply SMOTE or adjust class weights

#### **Model Selection Strategy:**
1. **Tree-based Models**: XGBoost, Random Forest (handle feature interactions well)
2. **Ensemble Methods**: Voting classifiers, Stacking (leverage different algorithms)
3. **Neural Networks**: Deep learning for complex pattern recognition
4. **Anomaly Detection**: Isolation Forest, One-Class SVM for unsupervised learning

#### **Evaluation Approach:**
- **Metrics**: Focus on Recall (catch all attacks), Precision, F1-score, AUC
- **Cross-validation**: Stratified K-fold to handle class imbalance
- **Attack-specific**: Evaluate performance per attack category
- **Threshold Tuning**: Optimize decision thresholds for operational requirements

### ✨ **Next Steps:**
1. **Implement preprocessing pipeline** based on EDA insights
2. **Develop baseline models** using identified important features  
3. **Feature engineering** to create interaction features
4. **Model comparison** across different algorithms
5. **Production deployment** with real-time inference capabilities

The comprehensive EDA has provided valuable insights that will guide the development of an effective anomaly detection system for network intrusion detection! 🎉