# Network Intrusion Detection System (NIDS)
## Notebook 1: Data Exploration & Exploratory Data Analysis (EDA)

**Team Member:** Member 1  
**Dataset:** CIC-IDS2017 (Multi-class Classification)  
**Date:** November 24, 2025  

**Objectives:**
1. Load and combine all CIC-IDS2017 CSV files
2. Perform univariate outlier detection using manual IQR method
3. Generate correlation heatmaps
4. Analyze target variable distribution
5. Provide initial insights for preprocessing

---

---
## 0. Automatic Dataset Download

This section automatically downloads the CIC-IDS2017 dataset from Kaggle.

**First-time users:** The dataset (~844 MB) will be downloaded automatically.
Already downloaded datasets will be reused from cache.


In [1]:
# Install required packages if not already installed
import sys
import subprocess
import tqdm as notebook_tqdm

def install_package(package):
    try:
        __import__(package)
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

print("Checking and installing dependencies...")
required_packages = ['numpy', 'pandas', 'matplotlib', 'seaborn', 'scipy', 'kagglehub', 'pyarrow', 'sklearn']

for package in required_packages:
    install_package(package)

print("Dependencies checked!")
# Install kagglehub if not already installed

try:
    import kagglehub
except ImportError:
    print("Installing kagglehub...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "kagglehub"])
    import kagglehub

print("kagglehub version:", kagglehub.__version__)

Checking and installing dependencies...
Dependencies checked!
kagglehub version: 0.3.13


In [2]:
# Set Kaggle API credentials
import os

# Set Kaggle API key (embedded for easy replication)
os.environ['KAGGLE_KEY'] = 'KGAT_745b27ccd4661655ca166e570543a9a3'

print("Kaggle authentication configured")

Kaggle authentication configured


In [3]:
# Download CIC-IDS2017 dataset from Kaggle (with STRENGTHENED verification)
import shutil
from pathlib import Path

# Define target directory
target_dir = Path('../data/CICIDS2017')
target_dir.mkdir(parents=True, exist_ok=True)

# Enhanced dataset check: verify files exist AND are valid
def verify_dataset(directory):
    """
    Verify dataset files exist and are valid.
    Returns: (is_valid, csv_files)
    """
    csv_files = list(directory.glob('*.csv'))
    
    # Check 1: Minimum file count
    if len(csv_files) < 7:
        return False, csv_files
    
    # Check 2: Verify each file is readable and has reasonable size
    total_size_mb = 0
    for csv_file in csv_files:
        try:
            size_mb = csv_file.stat().st_size / (1024**2)
            total_size_mb += size_mb
            
            # Each file should be at least 10MB (sanity check)
            if size_mb < 10:
                print(f"Warning: {csv_file.name} is suspiciously small ({size_mb:.2f} MB)")
                return False, csv_files
        except Exception as e:
            print(f"Error accessing {csv_file.name}: {e}")
            return False, csv_files
    
    # Check 3: Total size should be 500MB-1000MB range
    if not (500 <= total_size_mb <= 1500):
        print(f"Warning: Total dataset size ({total_size_mb:.2f} MB) is outside expected range")
        return False, csv_files
    
    return True, csv_files

# Verify dataset
dataset_valid, existing_csv_files = verify_dataset(target_dir)

if dataset_valid:
    print("Dataset already exists and is VALID!")
    print(f"Found {len(existing_csv_files)} CSV files in {target_dir}")
    print("\nExisting files:")
    for f in sorted(existing_csv_files):
        size_mb = f.stat().st_size / (1024**2)
        print(f"  - {f.name} ({size_mb:.2f} MB)")
    print("\nSkipping download. Dataset is ready to use!")
else:
    print("Dataset not found or invalid. Downloading from Kaggle...")
    print(f"Found {len(existing_csv_files)} files, expected 7-8")
    print("This may take a few minutes (~844 MB)")
    print("-" * 70)
    
    # Download dataset
    try:
        downloaded_path = kagglehub.dataset_download("chethuhn/network-intrusion-dataset")
        print(f"\nDataset downloaded to: {downloaded_path}")
        
        # Copy CSV files to our data directory
        downloaded_path_obj = Path(downloaded_path)
        csv_files = list(downloaded_path_obj.glob('*.csv'))
        
        if csv_files:
            print(f"\nCopying {len(csv_files)} CSV files to {target_dir}...")
            for csv_file in csv_files:
                target_file = target_dir / csv_file.name
                shutil.copy2(csv_file, target_file)
                size_mb = target_file.stat().st_size / (1024**2)
                print(f"  Copied: {csv_file.name} ({size_mb:.2f} MB)")
            print("\nDataset download complete!")
        else:
            print("\nERROR: No CSV files found in downloaded dataset")
            print(f"Please check: {downloaded_path}")
            raise FileNotFoundError("Downloaded dataset contains no CSV files")
    except Exception as e:
        print(f"\nERROR during download: {e}")
        raise

# Final verification with enhanced checks
final_valid, final_csv_files = verify_dataset(target_dir)
final_csv_count = len(final_csv_files)
print("\n" + "="*70)
print(f"VERIFICATION: {final_csv_count} CSV files in {target_dir}")
if final_valid:
    print("Status: Dataset is VALID and ready for use!")
else:
    print(f"ERROR: Dataset verification failed!")
    print(f"Expected 7-8 valid files, found {final_csv_count}")
    raise ValueError("Dataset verification failed - please re-download")
print("="*70)


Dataset already exists and is VALID!
Found 8 CSV files in ../data/CICIDS2017

Existing files:
  - Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv (73.55 MB)
  - Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv (73.34 MB)
  - Friday-WorkingHours-Morning.pcap_ISCX.csv (55.62 MB)
  - Monday-WorkingHours.pcap_ISCX.csv (168.73 MB)
  - Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv (79.25 MB)
  - Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv (49.61 MB)
  - Tuesday-WorkingHours.pcap_ISCX.csv (128.82 MB)
  - Wednesday-workingHours.pcap_ISCX.csv (214.74 MB)

Skipping download. Dataset is ready to use!

VERIFICATION: 8 CSV files in ../data/CICIDS2017
Status: Dataset is VALID and ready for use!


## 1. Import Libraries

Import all necessary libraries for data manipulation, visualization, and analysis.

In [4]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

# Utilities
import glob
import warnings
import os
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.4f}'.format)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print(" Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

 Libraries imported successfully!
NumPy version: 2.0.2
Pandas version: 2.2.2


In [None]:
# ============================================================================
# LOCAL OUTPUT SAVER (for Colab VS Code Extension)
# ============================================================================
# This ensures all outputs are saved to your local machine
# ============================================================================

import os
from pathlib import Path

# Detect if running on Colab
IN_COLAB = 'COLAB_GPU' in os.environ or 'google.colab' in str(get_ipython())

if IN_COLAB:
    # Mount Google Drive
    try:
        from google.colab import drive
        drive.mount('/content/drive', force_remount=True)
        
        # Set base path to your local project in Drive
        # IMPORTANT: Update this path to match your Google Drive structure
        BASE_PATH = '/content/drive/MyDrive/MLCEProject'
        
        # Create output directories if they don't exist
        for dir_name in ['outputs', 'models', 'data']:
            Path(f'{BASE_PATH}/{dir_name}').mkdir(parents=True, exist_ok=True)
        
        print("✓ Google Drive mounted")
        print(f"✓ Base path: {BASE_PATH}")
        print(f"✓ Outputs will save to: {BASE_PATH}/outputs")
        print(f"✓ Models will save to: {BASE_PATH}/models")
        print(f"✓ Data will save to: {BASE_PATH}/data")
        
    except Exception as e:
        print(f"⚠️  Could not mount Drive: {e}")
        print("Using Colab local storage (will not sync automatically)")
        BASE_PATH = '/content'
else:
    # Running locally - use relative paths
    BASE_PATH = '..'
    print("✓ Running locally")
    print("✓ Using relative paths (../outputs, ../models, ../data)")

# Helper functions for saving with correct paths
def get_output_path(filename):
    """Get correct path for output file"""
    return f"{BASE_PATH}/outputs/{filename}"

def get_model_path(filename):
    """Get correct path for model file"""
    return f"{BASE_PATH}/models/{filename}"

def get_data_path(filename):
    """Get correct path for data file"""
    return f"{BASE_PATH}/data/{filename}"

print("\n✓ Local save helper ready!")
print("\nUse these functions to save files:")
print("  - get_output_path('plot.png')  → saves to outputs/")
print("  - get_model_path('model.pkl')  → saves to models/")
print("  - get_data_path('data.csv')    → saves to data/\n")


---
## 2. Load CIC-IDS2017 Dataset

Load all CSV files from the `data/CICIDS2017/` directory and combine them into a single DataFrame.

**Expected files:**
- Monday-WorkingHours.pcap_ISCX.csv
- Tuesday-WorkingHours.pcap_ISCX.csv
- Wednesday-workingHours.pcap_ISCX.csv
- Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
- Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
- Friday-WorkingHours-Morning.pcap_ISCX.csv
- Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv

In [5]:
# Define path to dataset
data_path = '../data/CICIDS2017/'

# Find all CSV files
csv_files = glob.glob(os.path.join(data_path, '*.csv'))

if len(csv_files) == 0:
    raise FileNotFoundError(f"No CSV files found in {data_path}. Please run dataset download cell first.")

print(f"Found {len(csv_files)} CSV files:")
for file in csv_files:
    print(f"  - {os.path.basename(file)}")

# OPTIMIZATION: Function to reduce memory usage (Robust Version)
def optimize_dtypes(df):
    # Handle duplicate columns if any (keep first)
    df = df.loc[:, ~df.columns.duplicated()]
    
    # Downcast floats to float32
    float_cols = df.select_dtypes(include=['float64']).columns
    for col in float_cols:
        try:
            df[col] = df[col].astype('float32')
        except Exception:
            pass
            
    # Downcast integers
    int_cols = df.select_dtypes(include=['int64']).columns
    for col in int_cols:
        try:
            df[col] = pd.to_numeric(df[col], downcast='integer')
        except Exception:
            pass
            
    return df

# Load and combine all CSV files with ERROR HANDLING & OPTIMIZATION
print("\nLoading CSV files (with PyArrow engine if available)...")
dfs = []
failed_files = []

for file in csv_files:
    try:
        print(f"Reading: {os.path.basename(file)}...", end=" ")
        # Try using pyarrow engine for faster reading
        try:
            df_temp = pd.read_csv(file, engine='pyarrow')
        except:
            df_temp = pd.read_csv(file) # Fallback to default
        
        # Validate loaded data
        if df_temp.empty:
            raise ValueError(f"File {os.path.basename(file)} is empty")
        
        if df_temp.shape[1] != 79:
            print(f"\nWarning: Expected 79 columns, got {df_temp.shape[1]}")
        
        # Optimize memory immediately
        df_temp = optimize_dtypes(df_temp)
        
        dfs.append(df_temp)
        print(f" Shape: {df_temp.shape} ✓")
        
    except Exception as e:
        print(f" FAILED! Error: {e}")
        failed_files.append(file)

# Check if any files failed to load
if failed_files:
    print(f"\nERROR: Failed to load {len(failed_files)} file(s):")
    for f in failed_files:
        print(f"  - {os.path.basename(f)}")
    raise RuntimeError("Some CSV files could not be loaded. Please check the dataset.")

if not dfs:
    raise RuntimeError("No data was loaded successfully!")

# Combine all dataframes
print("\nCombining all dataframes...")
df = pd.concat(dfs, ignore_index=True)

# Clean column names (remove leading/trailing spaces) immediately after loading
df.columns = df.columns.str.strip()

# CLEANUP: Fix encoding issues in 'Label' column
    # The dataset contains replacement characters (�) in some Web Attack labels
if 'Label' in df.columns:
    print("\nCleaning 'Label' column (fixing encoding issues)...", end=" ")
    # Replace replacement character with hyphen
    df['Label'] = df['Label'].str.replace('\ufffd', '-', regex=False)
    # Also replace any other potential non-standard characters if needed
    # Ensure labels are stripped of whitespace
    df['Label'] = df['Label'].str.strip()
    print("✓")

# Verify combined dataset
if df.empty:
    raise ValueError("Combined dataset is empty!")

print("\n" + "="*70)
print("COMBINED DATASET")
print("="*70)
print(f"Total records: {len(df):,}")
print(f"Total features: {df.shape[1]}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("="*70)


Found 8 CSV files:
  - Tuesday-WorkingHours.pcap_ISCX.csv
  - Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
  - Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
  - Monday-WorkingHours.pcap_ISCX.csv
  - Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
  - Friday-WorkingHours-Morning.pcap_ISCX.csv
  - Wednesday-workingHours.pcap_ISCX.csv
  - Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv

Loading CSV files (with PyArrow engine if available)...
Reading: Tuesday-WorkingHours.pcap_ISCX.csv...  Shape: (445909, 78) ✓
Reading: Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv...  Shape: (288602, 78) ✓
Reading: Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv...  Shape: (170366, 78) ✓
Reading: Monday-WorkingHours.pcap_ISCX.csv...  Shape: (529918, 78) ✓
Reading: Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv...  Shape: (225745, 78) ✓
Reading: Friday-WorkingHours-Morning.pcap_ISCX.csv...  Shape: (191033, 78) ✓
Reading: Wednesday-workingHours.pcap_ISCX.

---
## 3. Initial Data Inspection

Examine the structure and basic statistics of the dataset.

In [6]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,88,640,7,4,440,358,220,0,62.8571,107.349,179,0,89.5,103.3457,1246875.0,17187.5,64.0,135.5573,445,1,640,106.6667,194.3251,497,1,538,179.3333,303.6862,530,4,0,0,0,0,164,104,10937.5,6250.0,0,220,66.5,99.0018,9801.3633,0,0,0,1,0,0,0,0,0,72.5455,62.8571,89.5,0,0,0,0,0,0,7,440,4,358,8192,2053,2,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,88,900,9,4,600,2944,300,0,66.6667,132.2876,1472,0,736.0,849.8596,3937777.75,14444.4443,75.0,192.7952,684,1,900,112.5,252.4112,734,1,766,255.3333,435.3232,758,3,0,0,0,0,204,104,10000.0,4444.4443,0,1472,253.1429,527.4343,278186.9062,0,0,0,1,0,0,0,0,0,272.6154,66.6667,736.0,0,0,0,0,0,0,9,600,4,2944,8192,2053,2,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,88,1205,7,4,2776,2830,1388,0,396.5714,677.2747,1415,0,707.5,816.9506,4652282.0,9128.6309,120.5,236.4333,777,1,1205,200.8333,397.0584,1008,1,927,309.0,491.6472,876,1,0,0,0,0,164,104,5809.1284,3319.502,0,1415,467.1667,690.0989,476236.5,0,0,0,1,0,0,0,0,0,509.6364,396.5714,707.5,0,0,0,0,0,0,7,2776,4,2830,8192,2053,2,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,88,511,7,4,452,370,226,0,64.5714,110.2767,185,0,92.5,106.8098,1608610.625,21526.418,51.1,90.7677,299,1,511,85.1667,131.9006,349,1,462,154.0,224.7198,412,1,0,0,0,0,164,104,13698.6299,7827.7886,0,226,68.5,101.9336,10390.4541,0,0,0,1,0,0,0,0,0,74.7273,64.5714,92.5,0,0,0,0,0,0,7,452,4,370,8192,2053,2,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,88,773,9,4,612,2944,306,0,68.0,134.9333,1472,0,736.0,849.8596,4600258.5,16817.5938,64.4167,148.6983,531,1,773,96.625,196.6657,580,1,675,225.0,348.9011,627,1,0,0,0,0,204,104,11642.9492,5174.644,0,1472,254.0,527.5208,278278.1562,0,0,0,1,0,0,0,0,0,273.5385,68.0,736.0,0,0,0,0,0,0,9,612,4,2944,8192,2053,2,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


In [7]:
# Data types and missing values
print("Dataset Information:")
df.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2830743 entries, 0 to 2830742
Data columns (total 78 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   Destination Port             int32  
 1   Flow Duration                int32  
 2   Total Fwd Packets            int32  
 3   Total Backward Packets       int32  
 4   Total Length of Fwd Packets  int32  
 5   Total Length of Bwd Packets  int32  
 6   Fwd Packet Length Max        int16  
 7   Fwd Packet Length Min        int16  
 8   Fwd Packet Length Mean       float32
 9   Fwd Packet Length Std        float32
 10  Bwd Packet Length Max        int16  
 11  Bwd Packet Length Min        int16  
 12  Bwd Packet Length Mean       float32
 13  Bwd Packet Length Std        float32
 14  Flow Bytes/s                 float32
 15  Flow Packets/s               float32
 16  Flow IAT Mean                float32
 17  Flow IAT Std                 float32
 18  Flow IAT Max         

In [8]:
# Clean column names (remove leading/trailing spaces)
df.columns = df.columns.str.strip()

print(f"\nColumn names ({len(df.columns)} total):")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")


Column names (78 total):
 1. Destination Port
 2. Flow Duration
 3. Total Fwd Packets
 4. Total Backward Packets
 5. Total Length of Fwd Packets
 6. Total Length of Bwd Packets
 7. Fwd Packet Length Max
 8. Fwd Packet Length Min
 9. Fwd Packet Length Mean
10. Fwd Packet Length Std
11. Bwd Packet Length Max
12. Bwd Packet Length Min
13. Bwd Packet Length Mean
14. Bwd Packet Length Std
15. Flow Bytes/s
16. Flow Packets/s
17. Flow IAT Mean
18. Flow IAT Std
19. Flow IAT Max
20. Flow IAT Min
21. Fwd IAT Total
22. Fwd IAT Mean
23. Fwd IAT Std
24. Fwd IAT Max
25. Fwd IAT Min
26. Bwd IAT Total
27. Bwd IAT Mean
28. Bwd IAT Std
29. Bwd IAT Max
30. Bwd IAT Min
31. Fwd PSH Flags
32. Bwd PSH Flags
33. Fwd URG Flags
34. Bwd URG Flags
35. Fwd Header Length
36. Bwd Header Length
37. Fwd Packets/s
38. Bwd Packets/s
39. Min Packet Length
40. Max Packet Length
41. Packet Length Mean
42. Packet Length Std
43. Packet Length Variance
44. FIN Flag Count
45. SYN Flag Count
46. RST Flag Count
47. PSH Flag Cou

In [9]:
# Check for missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)

print("\nMissing Values Summary:")
print(missing_df[missing_df['Missing Count'] > 0])

total_missing = missing.sum()
print(f"\nTotal missing values: {total_missing:,}")
print(f"Percentage of total data: {(total_missing / (df.shape[0] * df.shape[1]))*100:.4f}%")


Missing Values Summary:
              Missing Count  Percentage
Flow Bytes/s           1358      0.0480

Total missing values: 1,358
Percentage of total data: 0.0006%


In [10]:
# Summary statistics
print("\nSummary Statistics for Numerical Features:")
# OPTIMIZATION: Sample for faster describe (5-10x faster)
df.sample(n=100000, random_state=42).describe().T



Summary Statistics for Numerical Features:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Destination Port,100000.0,8090.8984,18305.6152,0.0,53.0,80.0,443.0,65522.0
Flow Duration,100000.0,14826515.3821,33688057.3301,-1.0,153.0,31282.0,3288194.5,119999998.0
Total Fwd Packets,100000.0,8.8931,621.422,1.0,2.0,2.0,5.0,162890.0
Total Backward Packets,100000.0,9.9073,863.4331,0.0,1.0,2.0,4.0,228618.0
Total Length of Fwd Packets,100000.0,523.9209,3661.8007,0.0,12.0,62.0,186.0,646049.0
Total Length of Bwd Packets,100000.0,14537.9729,1814401.6223,0.0,0.0,122.0,476.0,474000000.0
Fwd Packet Length Max,100000.0,204.0863,700.7471,0.0,6.0,37.0,79.0,23360.0
Fwd Packet Length Min,100000.0,18.6232,59.6748,0.0,0.0,2.0,36.0,1983.0
Fwd Packet Length Mean,100000.0,57.3085,182.1353,0.0,6.0,34.0,50.0,4183.0571
Fwd Packet Length Std,100000.0,67.2587,270.1484,0.0,0.0,0.0,25.3114,5795.5005


---
## 4. Target Variable Analysis

Analyze the distribution of the target variable ` Label` (attack types).

** Professor Requirement #2 (Partial): Understanding the target variable before EDA**

In [11]:
# Identify target column (should be ' Label' or 'Label')
target_col = ' Label' if ' Label' in df.columns else 'Label'

print(f"Target column: '{target_col}'")
print(f"\nUnique attack types: {df[target_col].nunique()}")
print("\nClass Distribution:")
class_dist = df[target_col].value_counts().sort_values(ascending=False)
class_dist_pct = (class_dist / len(df) * 100).round(2)

class_summary = pd.DataFrame({
    'Count': class_dist,
    'Percentage': class_dist_pct
})

print(class_summary)
print("\n" + "="*70)

Target column: 'Label'

Unique attack types: 15

Class Distribution:
                              Count  Percentage
Label                                          
BENIGN                      2273097     80.3000
DoS Hulk                     231073      8.1600
PortScan                     158930      5.6100
DDoS                         128027      4.5200
DoS GoldenEye                 10293      0.3600
FTP-Patator                    7938      0.2800
SSH-Patator                    5897      0.2100
DoS slowloris                  5796      0.2000
DoS Slowhttptest               5499      0.1900
Bot                            1966      0.0700
Web Attack - Brute Force       1507      0.0500
Web Attack - XSS                652      0.0200
Infiltration                     36      0.0000
Web Attack - Sql Injection       21      0.0000
Heartbleed                       11      0.0000



In [12]:
# Ensure outputs directory exists (robust version for Colab/local)
import os
from pathlib import Path

# Create outputs directory with absolute path handling
outputs_dir = Path('../outputs').resolve()
outputs_dir.mkdir(parents=True, exist_ok=True)
print(f"Outputs directory: {outputs_dir}")
print(f"Directory exists: {outputs_dir.exists()}")

# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar plot
class_dist.plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_title('Attack Type Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Attack Type', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].tick_params(axis='x', rotation=45, labelsize=10)
axes[0].grid(True, axis='y', alpha=0.3)

# Pie chart
axes[1].pie(class_dist, labels=class_dist.index, autopct='%1.1f%%', 
            startangle=90, textprops={'fontsize': 9})
axes[1].set_title('Attack Type Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()

# Save with robust error handling
try:
    output_file = outputs_dir / 'target_distribution.png'
    plt.savefig(output_file, dpi=150, bbox_inches='tight')
    plt.close()  # Free memory
    print(f"✓ Plot saved: {output_file}")
except Exception as e:
    print(f"Warning: Could not save plot - {e}")
    print(f"Attempted to save to: {output_file}")

plt.show()


Outputs directory: /outputs
Directory exists: True
✓ Plot saved: /outputs/target_distribution.png


### Key Observations:
- Document class imbalance observations here
- Note which attack types are most/least common
- Consider implications for model training (stratified sampling)

---
## 5. Univariate Outlier Detection

** Professor Requirement #2: Perform univariate outlier detection**

Implement manual IQR (Interquartile Range) method to detect outliers in numerical features.

**Method:**
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 - Q1
- Lower Bound = Q1 - 1.5 × IQR
- Upper Bound = Q3 + 1.5 × IQR
- Outliers = values < Lower Bound OR values > Upper Bound

In [13]:
def detect_outliers_iqr(data, column):
    """
    Detect outliers using Interquartile Range (IQR) method.
    
    Parameters:
    -----------
    data : pd.DataFrame
        Input dataset
    column : str
        Column name to check for outliers
        
    Returns:
    --------
    n_outliers : int
        Number of outliers detected
    lower_bound : float
        Lower threshold
    upper_bound : float
        Upper threshold
    outlier_indices : pd.Index
        Indices of outlier rows
    """
    # Compute Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = np.percentile(data[column].dropna(), 25)
    Q3 = np.percentile(data[column].dropna(), 75)
    
    # Compute IQR
    IQR = Q3 - Q1
    
    # Define outlier bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Identify outliers
    outlier_mask = (data[column] < lower_bound) | (data[column] > upper_bound)
    outlier_indices = data[outlier_mask].index
    n_outliers = len(outlier_indices)
    
    return n_outliers, lower_bound, upper_bound, outlier_indices

print(" Outlier detection function defined")

 Outlier detection function defined


In [14]:
# Get numerical columns (exclude target column)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remove target column if present
if target_col in numerical_cols:
    numerical_cols.remove(target_col)

print(f"Number of numerical features: {len(numerical_cols)}")
print("\nApplying VECTORIZED IQR outlier detection...\n")

# OPTIMIZATION: Vectorized outlier detection (10-20x faster)
Q1 = df[numerical_cols].quantile(0.25)
Q3 = df[numerical_cols].quantile(0.75)
IQR = Q3 - Q1

# Calculate bounds
lower_bounds = Q1 - 1.5 * IQR
upper_bounds = Q3 + 1.5 * IQR

# Count outliers for all columns at once
outlier_counts = ((df[numerical_cols] < lower_bounds) | (df[numerical_cols] > upper_bounds)).sum()

# Create summary DataFrame
outlier_df = pd.DataFrame({
    'Feature': numerical_cols,
    'Outliers': outlier_counts.values,
    'Percentage': (outlier_counts.values / len(df)) * 100,
    'Lower Bound': lower_bounds.values,
    'Upper Bound': upper_bounds.values
}).sort_values('Percentage', ascending=False)

print("="*90)
print("OUTLIER DETECTION SUMMARY (Top 20 features)")
print("="*90)
print(outlier_df.head(20).to_string(index=False))
print("="*90)

# Save full summary
outlier_df.to_csv('../outputs/outlier_summary.csv', index=False)
print("\n✓ Full outlier summary saved: outputs/outlier_summary.csv")


Number of numerical features: 77

Applying VECTORIZED IQR outlier detection...

OUTLIER DETECTION SUMMARY (Top 20 features)
                    Feature  Outliers  Percentage   Lower Bound  Upper Bound
               Fwd IAT Mean    671833     23.7335  -309459.5391  515765.8984
                Fwd IAT Max    666292     23.5377 -1396509.0000 2327515.0000
              Fwd IAT Total    665766     23.5191 -1864265.2500 3107108.7500
      Fwd Packet Length Max    664214     23.4643     -106.5000     193.5000
      Fwd Packet Length Std    663959     23.4553      -39.2444      65.4074
                Fwd IAT Std    660519     23.3338   -98984.7363  164974.5605
     Packet Length Variance    654542     23.1226   -45583.2583   75972.0972
      Bwd Packet Length Std    654269     23.1130     -116.9108     194.8513
      Bwd Packet Length Max    637112     22.5069     -420.0000     700.0000
           Destination Port    627245     22.1583     -532.0000    1028.0000
          Max Packet Length  

In [None]:
# Ensure outputs directory exists (robust version for Colab/local)
import os
from pathlib import Path

# Create outputs directory with absolute path handling
outputs_dir = Path('../outputs').resolve()
outputs_dir.mkdir(parents=True, exist_ok=True)

# STRATEGIC FEATURE SELECTION for Diverse Distribution Analysis
# -------------------------------------------------------------
# Instead of just top outliers, we select features based on:
# 1. High Outlier Count (Anomalies)
# 2. High Variance (Spread/Diversity)
# 3. Domain Knowledge (Key Network Traffic Indicators)

# 1. Top Outliers
top_outliers = outlier_df.head(3)['Feature'].tolist()

# 2. High Variance (Coefficient of Variation to be scale-independent)
# CV = std / mean
cv = df[numerical_cols].std() / (df[numerical_cols].mean().abs() + 1e-9)
top_variance = cv.sort_values(ascending=False).head(3).index.tolist()

# 3. Key Domain Features (Network Traffic)
# These are critical for NIDS regardless of statistical properties
domain_features = ['Flow Duration', 'Total Fwd Packets', 'Packet Length Mean']
# Filter to ensure they exist in our dataset
domain_features = [f for f in domain_features if f in df.columns]

# Combine and Deduplicate
selected_features = list(set(top_outliers + top_variance + domain_features))

# Fill to exactly 9 features if needed (using more outliers)
if len(selected_features) < 9:
    remaining = 9 - len(selected_features)
    more_outliers = outlier_df[~outlier_df['Feature'].isin(selected_features)].head(remaining)['Feature'].tolist()
    selected_features.extend(more_outliers)

# Limit to 9 features
selected_features = selected_features[:9]

print("Selected Features for Strategic Distribution Analysis:")
for f in selected_features:
    reason = []
    if f in top_outliers: reason.append('High Outliers')
    if f in top_variance: reason.append('High Variance')
    if f in domain_features: reason.append('Domain Knowledge')
    print(f" - {f} ({', '.join(reason)})")


# Visualize distributions
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(selected_features):
    sns.boxplot(data=df, y=col, ax=axes[idx], color='lightblue')
    
    # Get outlier percentage for title
    out_pct = outlier_df[outlier_df['Feature']==col]['Percentage'].values
    pct_text = f"{out_pct[0]:.2f}%" if len(out_pct) > 0 else "N/A"
    
    axes[idx].set_title(f'{col}\n(Outliers: {pct_text})', 
                        fontsize=10, fontweight='bold')
    axes[idx].set_ylabel('Value', fontsize=9)
    axes[idx].tick_params(labelsize=8)

plt.suptitle('Strategic Feature Distribution Analysis (Top 9 Diverse Features)', fontsize=16, fontweight='bold')
plt.tight_layout()

# Save with robust error handling
try:
    output_file = outputs_dir / 'strategic_distribution_boxplots.png'
    plt.savefig(output_file, dpi=150, bbox_inches='tight')
    plt.close()  # Free memory
    print(f"✓ Plot saved: {output_file}")
except Exception as e:
    print(f"Warning: Could not save plot - {e}")
    print(f"Attempted to save to: {output_file}")

plt.show()


Selected Features for Strategic Distribution Analysis:
 - Fwd Header Length (High Variance)
 - Packet Length Mean (Domain Knowledge)
 - Bwd Header Length (High Variance)
 - Fwd IAT Mean (High Outliers)
 - Total Fwd Packets (Domain Knowledge)
 - min_seg_size_forward (High Variance)
 - Fwd IAT Total (High Outliers)
 - Flow Duration (Domain Knowledge)
 - Fwd IAT Max (High Outliers)


KeyboardInterrupt: 

---
## 6. Correlation Analysis & Heatmap

** Professor Requirement #2: Show heatmaps to display correlation among features**

Compute correlation matrix and visualize feature relationships.

In [None]:
# Select numerical features for correlation
# OPTIMIZATION: Use sampling for correlation matrix to reduce execution time

print(f"Computing correlation matrix for {len(numerical_cols)} features...")

# Use a stratified sample of 100k rows (or 10% if smaller) for speed
# Correlation structure is preserved even with sampling
SAMPLE_SIZE = 100000
if len(df) > SAMPLE_SIZE:
    print(f"Using a sample of {SAMPLE_SIZE:,} rows for faster correlation calculation...")
    # Simple random sample is sufficient for correlation
    corr_df = df.sample(n=SAMPLE_SIZE, random_state=42)
else:
    corr_df = df

# Compute correlation matrix
correlation_matrix = corr_df[numerical_cols].corr(method="pearson")

print(f" Correlation matrix shape: {correlation_matrix.shape}")

# Display correlation matrix (first 10x10)
print("\nCorrelation Matrix (first 10x10):")
correlation_matrix.iloc[:10, :10]


: 

: 

In [None]:
# Plot heatmap
plt.figure(figsize=(20, 16))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8},
            vmin=-1, vmax=1)
plt.title('Feature Correlation Heatmap', fontsize=18, fontweight='bold', pad=20)
plt.xlabel('Features', fontsize=14)
plt.ylabel('Features', fontsize=14)
plt.xticks(fontsize=8, rotation=90)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.savefig(get_output_path('correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.close()  # Free memory

print(" Plot saved: outputs/correlation_heatmap.png")

: 

: 

In [None]:
# Find highly correlated feature pairs (|correlation| > 0.8)
high_corr_pairs = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            high_corr_pairs.append({
                'Feature 1': correlation_matrix.columns[i],
                'Feature 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

high_corr_df = pd.DataFrame(high_corr_pairs).sort_values('Correlation', key=abs, ascending=False)

print(f"\nFound {len(high_corr_df)} highly correlated feature pairs (|r| > 0.8):\n")
print(high_corr_df.head(20).to_string(index=False))

# Save highly correlated pairs
high_corr_df.to_csv('../outputs/high_correlation_pairs.csv', index=False)
print("\n High correlation pairs saved: outputs/high_correlation_pairs.csv")

: 

: 

---
## 7. Distribution Analysis

Analyze the distribution of key numerical features.

In [None]:
# Select 9 features with diverse characteristics for distribution analysis
# (You can modify this based on domain knowledge)
sample_features = numerical_cols[:9]  # First 9 features

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(sample_features):
    axes[idx].hist(df[col].dropna(), bins=50, edgecolor='black', alpha=0.7, color='skyblue')
    axes[idx].set_title(f'{col}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel('Value', fontsize=9)
    axes[idx].set_ylabel('Frequency', fontsize=9)
    axes[idx].tick_params(labelsize=8)
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Distribution of Selected Features (Histograms)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig(get_output_path('feature_distributions.png', dpi=150, bbox_inches='tight')
plt.close()  # Free memory

print(" Plot saved: outputs/feature_distributions.png")

: 

: 

---
## 8. Key Insights & Recommendations for Preprocessing

**Summary of EDA Findings:**

### Dataset Characteristics:
- **Total Records:** [Fill from cell output]
- **Total Features:** [Fill from cell output]
- **Attack Types:** [Fill from class distribution]
- **Class Balance:** [Balanced/Imbalanced - based on target distribution]

### Outlier Detection:
- **Features with >20% outliers:** [List top features]
- **Recommendation:** Consider capping or removal for features with extreme outliers

### Correlation Analysis:
- **Highly correlated pairs (|r| > 0.8):** [Count from analysis]
- **Recommendation:** Consider feature selection or PCA to reduce multicollinearity

### Missing Values:
- **Features with missing values:** [List if any]
- **Recommendation:** Impute or remove based on percentage

### Distribution Observations:
- **Skewed features:** [List heavily skewed features]
- **Recommendation:** Apply log transformation or scaling

---

**Next Steps for Member 2 (Preprocessing):**
1. Handle outliers identified in this notebook
2. Address multicollinearity (consider feature selection)
3. Scale features using StandardScaler
4. Handle class imbalance (if severe) using stratified splitting
5. Create train-test split (70-30 stratified)

---

**Proceed to:** `02_preprocessing_feature_engineering.ipynb`