# Network Intrusion Detection Dataset Preprocessing
## Google Colab / VS Code Jupyter Notebook

This notebook preprocesses three network intrusion detection datasets:
- **CIC-IDS 2017**
- **CIC-IDS 2018**
- **UNSW-NB15**

### Features:
- ‚úÖ Class imbalance handling with sample weights
- ‚úÖ Complete data cleaning pipeline
- ‚úÖ Feature scaling and standardization
- ‚úÖ Parquet export for efficient storage
- ‚úÖ Google Drive integration

### Prerequisites:
1. Upload your dataset CSV files to Google Drive
2. Run this notebook in Google Colab or VS Code with Jupyter extension

## Step 1: Install and Setup PySpark

In [1]:
# Install PySpark (if not already installed in Colab)
!pip install -q pyspark

# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, isnan, isnull, count, lit, lower, trim
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType, FloatType, IntegerType, LongType, StringType
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer
from functools import reduce
import os
import time
from datetime import datetime

print("‚úÖ PySpark installed successfully!")

‚úÖ PySpark installed successfully!


## Step 2: Mount Google Drive (for Colab)

In [2]:
# Mount Google Drive (only for Colab)
try:
    from google.colab import drive
    drive.mount('/content/drive')
    # Path to your data in Google Drive
    BASE_DIR = "/content/drive/MyDrive/NetworkIDS"
    print(f"‚úÖ Google Drive mounted successfully!")
    print(f"üìÅ Data directory: {BASE_DIR}")
    IS_COLAB = True
except:
    # Running locally in VS Code
    BASE_DIR = "d:/Coding/real-time-network-intrusion-detection-spark-kafka/data"
    print(f"‚úÖ Running locally. Data directory: {BASE_DIR}")
    IS_COLAB = False

Mounted at /content/drive
‚úÖ Google Drive mounted successfully!
üìÅ Data directory: /content/drive/MyDrive/NetworkIDS


## Step 3: Create Spark Session

In [3]:
# Create optimized Spark session
spark = SparkSession.builder \
    .appName("NetworkIDS-Preprocessing") \
    .master("local[*]") \
    .config("spark.driver.memory", "12g" if IS_COLAB else "4g") \
    .config("spark.executor.memory", "12g" if IS_COLAB else "4g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.driver.host", "localhost") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print(f"‚úÖ Spark Session created")
print(f"üìä Spark UI: {spark.sparkContext.uiWebUrl}")

‚úÖ Spark Session created
üìä Spark UI: http://localhost:4040


## Step 4: Preprocessing Utility Functions

In [12]:
def clean_column_names(df):
    """Clean column names by removing special characters"""
    for column in df.columns:
        new_name = ''.join(c if c.isalnum() or c == '_' else '_' for c in column)
        new_name = new_name.lower().strip('_')
        while '__' in new_name:
            new_name = new_name.replace('__', '_')
        if new_name != column:
            df = df.withColumnRenamed(column, new_name)
    return df

def handle_missing_values(df, label_col):
    """Handle missing values by filling with mean - OPTIMIZED"""
    numeric_cols = [f.name for f in df.schema.fields
                   if isinstance(f.dataType, (DoubleType, IntegerType, LongType))
                   and f.name != label_col]

    # Calculate all means in ONE PASS
    mean_exprs = [F.mean(col_name).alias(col_name) for col_name in numeric_cols]
    means_row = df.select(*mean_exprs).first()
    means_dict = means_row.asDict()

    # Fill missing values using precomputed means
    for col_name in numeric_cols:
        mean_val = means_dict[col_name]
        if mean_val is not None:
            df = df.withColumn(col_name,
                             F.when(F.col(col_name).isNull(), F.lit(mean_val))
                             .otherwise(F.col(col_name)))
    return df

def handle_infinite_values(df, label_col):
    """Replace infinite values with 0 - OPTIMIZED"""
    numeric_cols = [f.name for f in df.schema.fields
                   if isinstance(f.dataType, (DoubleType, IntegerType, LongType))
                   and f.name != label_col]

    # Replace inf/-inf with 0 for all columns
    for col_name in numeric_cols:
        df = df.withColumn(col_name,
                          F.when(F.col(col_name).isin([float('inf'), float('-inf')]), 0)
                          .otherwise(F.col(col_name)))
    return df

def calculate_class_weights(df, label_col):
    """Calculate inverse frequency weights for handling class imbalance"""
    class_counts = df.groupBy(label_col).count().collect()
    max_count = max([row['count'] for row in class_counts])
    weights = {row[label_col]: float(max_count) / row['count'] for row in class_counts}
    return weights

def add_sample_weights(df, label_col, weights):
    """Add sample_weight column based on class weights"""
    weight_expr = F.when(F.col(label_col) == list(weights.keys())[0], list(weights.values())[0])
    for label_val, weight_val in list(weights.items())[1:]:
        weight_expr = weight_expr.when(F.col(label_col) == label_val, weight_val)
    df = df.withColumn('sample_weight', weight_expr)
    return df

def create_binary_label(df, label_col, benign_value):
    """Create binary label (0=benign, 1=attack)"""
    # Handle both string and list inputs
    if isinstance(benign_value, list):
        benign_value = benign_value[0]

    df = df.withColumn(
        'binary_label',
        F.when(F.col(label_col) == benign_value, 0).otherwise(1)
    )
    return df

def scale_features(df, exclude_cols, output_dir, dataset_name):
    """Scale features using StandardScaler - FIXED for string columns"""
    from pyspark.ml.feature import VectorAssembler, StandardScaler
    from pyspark.sql.types import StringType, TimestampType

    # Automatically determine feature columns (exclude labels, metadata, and non-numeric)
    exclude_list = exclude_cols if isinstance(exclude_cols, list) else [exclude_cols]
    exclude_list.extend(['binary_label', 'sample_weight'])

    # Only include numeric columns (exclude strings and timestamps)
    feature_cols = [f.name for f in df.schema.fields
                   if f.name not in exclude_list
                   and not isinstance(f.dataType, (StringType, TimestampType))]

    print(f"  Using {len(feature_cols)} numeric features for scaling")

    assembler = VectorAssembler(inputCols=feature_cols, outputCol='features_temp', handleInvalid='skip')
    df_assembled = assembler.transform(df)

    scaler = StandardScaler(inputCol='features_temp', outputCol='features_scaled', withMean=True, withStd=True)
    scaler_model = scaler.fit(df_assembled)
    df_scaled = scaler_model.transform(df_assembled)

    scaler_model.write().overwrite().save(f"{output_dir}/models/{dataset_name}_scaler")

    return df_scaled.drop('features_temp'), feature_cols

## Step 5: Dataset-Specific Loading Functions

In [5]:
def load_cicids2017(data_dir):
    """Load CIC-IDS 2017 dataset"""
    files = [
        'Monday-WorkingHours.pcap_ISCX.csv',
        'Tuesday-WorkingHours.pcap_ISCX.csv',
        'Wednesday-workingHours.pcap_ISCX.csv',
        'Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv',
        'Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv',
        'Friday-WorkingHours-Morning.pcap_ISCX.csv',
        'Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv',
        'Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv'
    ]

    print(f"Loading {len(files)} CIC-IDS 2017 files...")
    dfs = []
    for file in files:
        path = f"{data_dir}/CSE-CIC-IDS2017/{file}"
        print(f"  Loading {file}...")
        df = spark.read.csv(path, header=True, inferSchema=True)
        dfs.append(df)

    print("Combining all files...")
    return reduce(lambda df1, df2: df1.union(df2), dfs)

def load_cicids2018(data_dir):
    """Load CIC-IDS 2018 dataset"""
    files = [
        '02-14-2018.csv', '02-15-2018.csv', '02-16-2018.csv',
        '02-20-2018.csv', '02-21-2018.csv', '02-22-2018.csv',
        '02-23-2018.csv', '02-28-2018.csv', '03-01-2018.csv', '03-02-2018.csv'
    ]

    print(f"Loading {len(files)} CIC-IDS 2018 files...")
    dfs = []
    for file in files:
        path = f"{data_dir}/CSE-CIC-IDS2018/{file}"
        print(f"  Loading {file}...")
        df = spark.read.csv(path, header=True, inferSchema=True)

        # Drop identifier columns that cause schema mismatch
        cols_to_drop = ['Flow ID', 'Src IP', 'Src Port', 'Dst IP']
        for col in cols_to_drop:
            if col in df.columns:
                df = df.drop(col)

        dfs.append(df)

    print("Combining all files...")
    return reduce(lambda df1, df2: df1.union(df2), dfs)

def load_unsw_nb15(data_dir):
    """Load UNSW-NB15 dataset"""
    train_path = f"{data_dir}/UNSW-NB15/UNSW_NB15_training-set.csv"
    test_path = f"{data_dir}/UNSW-NB15/UNSW_NB15_testing-set.csv"

    print("Loading UNSW-NB15 training set...")
    train_df = spark.read.csv(train_path, header=True, inferSchema=True)
    print("Loading UNSW-NB15 testing set...")
    test_df = spark.read.csv(test_path, header=True, inferSchema=True)

    print("Combining train and test sets...")
    return train_df.union(test_df)

def encode_categorical_features(df):
    """Encode categorical features for UNSW-NB15"""
    categorical_cols = ['proto', 'service', 'state']

    for col in categorical_cols:
        if col in df.columns:
            print(f"  Encoding {col}...")
            indexer = StringIndexer(inputCol=col, outputCol=f"{col}_index", handleInvalid="keep")
            df = indexer.fit(df).transform(df)
            df = df.drop(col).withColumnRenamed(f"{col}_index", col)

    return df

print("‚úÖ Dataset loading functions defined")

‚úÖ Dataset loading functions defined


## Step 6: Preprocessing Pipeline Functions

In [9]:
def preprocess_cicids2017(df, output_dir):
    """Full preprocessing pipeline for CIC-IDS 2017"""
    print("üîÑ Starting CIC-IDS 2017 preprocessing...")

    # Step 1: Clean column names
    print("Step 1: Cleaning column names...")
    df = clean_column_names(df)

    # Step 2: Deduplicate
    print("Step 2: Removing duplicates...")
    initial_count = df.count()
    df = df.dropDuplicates()
    final_count = df.count()
    print(f"  Removed {initial_count - final_count:,} duplicates")

    # Step 3: Handle missing values
    print("Step 3: Handling missing values...")
    df = handle_missing_values(df, 'label')

    # Step 4: Handle infinite values
    print("Step 4: Handling infinite values...")
    df = handle_infinite_values(df, 'label')

    # Step 5: Create binary label
    print("Step 5: Creating binary label...")
    df = create_binary_label(df, 'label', 'BENIGN')

    # Step 6: Calculate and add sample weights
    print("Step 6: Calculating class weights...")
    class_weights = calculate_class_weights(df, 'binary_label')
    df = add_sample_weights(df, 'binary_label', class_weights)

    # Step 7: Scale features
    print("Step 7: Scaling features...")
    df, feature_cols = scale_features(df, 'label', output_dir, 'cicids2017')

    # Step 8: Save to parquet
    print("Step 8: Saving to parquet...")
    parquet_path = f"{output_dir}/parquet/cicids2017_preprocessed"
    df.write.mode('overwrite').parquet(parquet_path)
    print(f"‚úÖ Saved to {parquet_path}")

    return df, class_weights, feature_cols

def preprocess_cicids2018(df, output_dir):
    """Full preprocessing pipeline for CIC-IDS 2018"""
    print("üîÑ Starting CIC-IDS 2018 preprocessing...")

    print("Step 1: Cleaning column names...")
    df = clean_column_names(df)

    print("Step 2: Removing duplicates...")
    initial_count = df.count()
    df = df.dropDuplicates()
    final_count = df.count()
    print(f"  Removed {initial_count - final_count:,} duplicates")

    print("Step 3: Handling missing values...")
    df = handle_missing_values(df, 'label')

    print("Step 4: Handling infinite values...")
    df = handle_infinite_values(df, 'label')

    print("Step 5: Creating binary label...")
    df = create_binary_label(df, 'label', 'Benign')

    print("Step 6: Calculating class weights...")
    class_weights = calculate_class_weights(df, 'binary_label')
    df = add_sample_weights(df, 'binary_label', class_weights)

    print("Step 7: Scaling features...")
    df, feature_cols = scale_features(df, 'label', output_dir, 'cicids2018')

    print("Step 8: Saving to parquet...")
    parquet_path = f"{output_dir}/parquet/cicids2018_preprocessed"
    df.write.mode('overwrite').parquet(parquet_path)
    print(f"‚úÖ Saved to {parquet_path}")

    return df, class_weights, feature_cols

def preprocess_unsw_nb15(df, output_dir):
    """Full preprocessing pipeline for UNSW-NB15"""
    print("üîÑ Starting UNSW-NB15 preprocessing...")

    print("Step 1: Cleaning column names...")
    df = clean_column_names(df)

    print("Step 2: Encoding categorical features...")
    df = encode_categorical_features(df)

    print("Step 3: Removing duplicates...")
    initial_count = df.count()
    df = df.dropDuplicates()
    final_count = df.count()
    print(f"  Removed {initial_count - final_count:,} duplicates")

    print("Step 4: Handling missing values...")
    df = handle_missing_values(df, 'label')

    print("Step 5: Handling infinite values...")
    df = handle_infinite_values(df, 'label')

    print("Step 6: Renaming label to binary_label...")
    df = df.withColumnRenamed('label', 'binary_label')

    print("Step 7: Calculating class weights...")
    class_weights = calculate_class_weights(df, 'binary_label')
    df = add_sample_weights(df, 'binary_label', class_weights)

    print("Step 8: Scaling features...")
    df, feature_cols = scale_features(df, 'attack_cat', output_dir, 'unsw_nb15')

    print("Step 9: Saving to parquet...")
    parquet_path = f"{output_dir}/parquet/unsw_nb15_preprocessed"
    df.write.mode('overwrite').parquet(parquet_path)
    print(f"‚úÖ Saved to {parquet_path}")

    return df, class_weights, feature_cols

## Step 7: Run Preprocessing for All Datasets

**‚ö†Ô∏è IMPORTANT**: Before running this cell, ensure your datasets are uploaded to Google Drive at:
- `/content/drive/MyDrive/NetworkIDS/CSE-CIC-IDS2017/`
- `/content/drive/MyDrive/NetworkIDS/CSE-CIC-IDS2018/`
- `/content/drive/MyDrive/NetworkIDS/UNSW-NB15/`

This will take **30-60 minutes** depending on dataset size and Colab resources.

In [13]:
# Check which datasets are already processed (checkpoint recovery)
import os
processed = []
if os.path.exists(f"{BASE_DIR}/output/parquet/cicids2017_preprocessed"):
    processed.append("cicids2017")
    print("‚úÖ CIC-IDS 2017 already processed - will skip")
if os.path.exists(f"{BASE_DIR}/output/parquet/cicids2018_preprocessed"):
    processed.append("cicids2018")
    print("‚úÖ CIC-IDS 2018 already processed - will skip")
if os.path.exists(f"{BASE_DIR}/output/parquet/unsw_nb15_preprocessed"):
    processed.append("unsw_nb15")
    print("‚úÖ UNSW-NB15 already processed - will skip")

if not processed:
    print("üìù No datasets processed yet. Starting fresh...")
else:
    print(f"üìù Found {len(processed)} already processed dataset(s). Resuming...")

‚úÖ CIC-IDS 2017 already processed - will skip
‚úÖ UNSW-NB15 already processed - will skip
üìù Found 2 already processed dataset(s). Resuming...


In [14]:
from functools import reduce
import time

# Re-mount Google Drive if connection lost (important for long sessions)
if IS_COLAB:
    try:
        # Test if drive is still mounted
        os.listdir(BASE_DIR)
        print("‚úÖ Google Drive connection verified")
    except:
        print("‚ö†Ô∏è Drive disconnected! Remounting...")
        from google.colab import drive
        drive.mount('/content/drive', force_remount=True)
        print("‚úÖ Drive remounted successfully")

# Set paths
DATA_DIR = BASE_DIR  # Uses mounted drive or local data/
OUTPUT_DIR = f"{BASE_DIR}/output"

# Create output directories
import os
os.makedirs(f"{OUTPUT_DIR}/models", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/parquet", exist_ok=True)

results = {}

# ===== CIC-IDS 2017 =====
if 'cicids2017' not in processed:
    try:
        start = time.time()
        print("\n" + "="*60)
        print("üì¶ PROCESSING CIC-IDS 2017")
        print("="*60)

        df_2017 = load_cicids2017(DATA_DIR)
        df_2017, weights_2017, features_2017 = preprocess_cicids2017(df_2017, OUTPUT_DIR)

        elapsed = time.time() - start
        results['cicids2017'] = {
            'status': 'SUCCESS',
            'time': f"{elapsed/60:.2f} minutes",
            'class_weights': weights_2017,
            'num_features': len(features_2017)
        }
        print(f"‚úÖ CIC-IDS 2017 completed in {elapsed/60:.2f} minutes")
        print(f"üíæ Saved to: {OUTPUT_DIR}/parquet/cicids2017_preprocessed")

    except Exception as e:
        results['cicids2017'] = {'status': 'FAILED', 'error': str(e)}
        print(f"‚ùå CIC-IDS 2017 failed: {e}")
else:
    results['cicids2017'] = {'status': 'SKIPPED', 'reason': 'Already processed'}
    print("‚è≠Ô∏è Skipping CIC-IDS 2017 (already processed)")

# ===== CIC-IDS 2018 =====
if 'cicids2018' not in processed:
    try:
        start = time.time()
        print("\n" + "="*60)
        print("üì¶ PROCESSING CIC-IDS 2018")
        print("="*60)

        df_2018 = load_cicids2018(DATA_DIR)
        df_2018, weights_2018, features_2018 = preprocess_cicids2018(df_2018, OUTPUT_DIR)

        elapsed = time.time() - start
        results['cicids2018'] = {
            'status': 'SUCCESS',
            'time': f"{elapsed/60:.2f} minutes",
            'class_weights': weights_2018,
            'num_features': len(features_2018)
        }
        print(f"‚úÖ CIC-IDS 2018 completed in {elapsed/60:.2f} minutes")
        print(f"üíæ Saved to: {OUTPUT_DIR}/parquet/cicids2018_preprocessed")

    except Exception as e:
        results['cicids2018'] = {'status': 'FAILED', 'error': str(e)}
        print(f"‚ùå CIC-IDS 2018 failed: {e}")
else:
    results['cicids2018'] = {'status': 'SKIPPED', 'reason': 'Already processed'}
    print("‚è≠Ô∏è Skipping CIC-IDS 2018 (already processed)")

# ===== UNSW-NB15 =====
if 'unsw_nb15' not in processed:
    try:
        start = time.time()
        print("\n" + "="*60)
        print("üì¶ PROCESSING UNSW-NB15")
        print("="*60)

        df_unsw = load_unsw_nb15(DATA_DIR)
        df_unsw, weights_unsw, features_unsw = preprocess_unsw_nb15(df_unsw, OUTPUT_DIR)

        elapsed = time.time() - start
        results['unsw_nb15'] = {
            'status': 'SUCCESS',
            'time': f"{elapsed/60:.2f} minutes",
            'class_weights': weights_unsw,
            'num_features': len(features_unsw)
        }
        print(f"‚úÖ UNSW-NB15 completed in {elapsed/60:.2f} minutes")
        print(f"üíæ Saved to: {OUTPUT_DIR}/parquet/unsw_nb15_preprocessed")

    except Exception as e:
        results['unsw_nb15'] = {'status': 'FAILED', 'error': str(e)}
        print(f"‚ùå UNSW-NB15 failed: {e}")
else:
    results['unsw_nb15'] = {'status': 'SKIPPED', 'reason': 'Already processed'}
    print("‚è≠Ô∏è Skipping UNSW-NB15 (already processed)")

# ===== SUMMARY =====
print("\n" + "="*60)
print("üìä PREPROCESSING SUMMARY")
print("="*60)
for dataset, result in results.items():
    status = result['status']
    print(f"\n{dataset.upper()}: {status}")
    if status == 'SUCCESS':
        print(f"  ‚è±Ô∏è  Time: {result['time']}")
        print(f"  üè∑Ô∏è  Class Weights: {result['class_weights']}")
        print(f"  üìê Features: {result['num_features']}")
    elif status == 'FAILED':
        print(f"  ‚ùå Error: {result['error']}")
    elif status == 'SKIPPED':
        print(f"  ‚è≠Ô∏è  {result['reason']}")

print("\n" + "="*60)
print("‚úÖ All available datasets processed!")
print(f"üìÅ Output location: {OUTPUT_DIR}")
print("="*60)

‚úÖ Google Drive connection verified
‚è≠Ô∏è Skipping CIC-IDS 2017 (already processed)

üì¶ PROCESSING CIC-IDS 2018
Loading 10 CIC-IDS 2018 files...
  Loading 02-14-2018.csv...
  Loading 02-15-2018.csv...
  Loading 02-16-2018.csv...
  Loading 02-20-2018.csv...
  Loading 02-21-2018.csv...
  Loading 02-22-2018.csv...
  Loading 02-23-2018.csv...
  Loading 02-28-2018.csv...
  Loading 03-01-2018.csv...
  Loading 03-02-2018.csv...
Combining all files...
üîÑ Starting CIC-IDS 2018 preprocessing...
Step 1: Cleaning column names...
Step 2: Removing duplicates...
  Removed 433,253 duplicates
Step 3: Handling missing values...
Step 4: Handling infinite values...
Step 5: Creating binary label...
Step 6: Calculating class weights...
Step 7: Scaling features...
  Using 78 numeric features for scaling
Step 8: Saving to parquet...
‚úÖ Saved to /content/drive/MyDrive/NetworkIDS/output/parquet/cicids2018_preprocessed
‚úÖ CIC-IDS 2018 completed in 106.76 minutes
üíæ Saved to: /content/drive/MyDrive/Netw

## Step 8: Verify Output Files

Check the generated parquet files and models

In [15]:
# List output files
import os

print("üìÅ Generated Files:\n")

parquet_dir = f"{OUTPUT_DIR}/parquet"
model_dir = f"{OUTPUT_DIR}/models"

if os.path.exists(parquet_dir):
    print("Parquet Files:")
    for f in os.listdir(parquet_dir):
        size = os.path.getsize(os.path.join(parquet_dir, f)) / (1024**2)  # MB
        print(f"  ‚úì {f} ({size:.2f} MB)")

if os.path.exists(model_dir):
    print("\nScaler Models:")
    for d in os.listdir(model_dir):
        print(f"  ‚úì {d}")

# Sample data from one dataset
print("\nüìä Sample Data (CIC-IDS 2017):")
parquet_path = f"{OUTPUT_DIR}/parquet/cicids2017_preprocessed.parquet"
if os.path.exists(parquet_path):
    sample_df = spark.read.parquet(parquet_path)
    print(f"  Total Records: {sample_df.count():,}")
    print(f"  Columns: {len(sample_df.columns)}")
    print(f"\n  Schema:")
    sample_df.printSchema()
    print(f"\n  Sample Rows:")
    sample_df.select('binary_label', 'sample_weight').show(5)

üìÅ Generated Files:

Parquet Files:
  ‚úì cicids2017_preprocessed (0.00 MB)
  ‚úì unsw_nb15_preprocessed (0.00 MB)
  ‚úì cicids2018_preprocessed (0.00 MB)

Scaler Models:
  ‚úì cicids2017_scaler
  ‚úì unsw_nb15_scaler
  ‚úì cicids2018_scaler

üìä Sample Data (CIC-IDS 2017):


## Step 9: Download Results (Google Colab Only)

If running on Google Colab, download the preprocessed files to your local machine

In [16]:
try:
    from google.colab import files
    import shutil

    print("üì¶ Creating download archive...")

    # Create zip of output directory
    shutil.make_archive('/content/preprocessing_output', 'zip', OUTPUT_DIR)

    print("‚¨áÔ∏è  Downloading preprocessing_output.zip...")
    files.download('/content/preprocessing_output.zip')

    print("‚úÖ Download complete!")
    print("\nThe zip contains:")
    print("  ‚Ä¢ parquet/ - Preprocessed datasets in Parquet format")
    print("  ‚Ä¢ models/ - StandardScaler models for each dataset")

except ImportError:
    print("‚ÑπÔ∏è  Not running on Google Colab - files are already saved locally at:")
    print(f"   {OUTPUT_DIR}")
    print("\nTo download from Colab, uncomment and run this cell.")

üì¶ Creating download archive...
‚¨áÔ∏è  Downloading preprocessing_output.zip...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Download complete!

The zip contains:
  ‚Ä¢ parquet/ - Preprocessed datasets in Parquet format
  ‚Ä¢ models/ - StandardScaler models for each dataset


## Next Steps

### Using Preprocessed Data

Once preprocessing is complete, you can use the parquet files for training:

```python
# Load preprocessed data
df = spark.read.parquet(f"{OUTPUT_DIR}/parquet/cicids2017_preprocessed.parquet")

# Use sample_weight column for training
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(
    featuresCol='scaled_features',
    labelCol='binary_label',
    weightCol='sample_weight',  # ‚Üê Handles class imbalance
    numTrees=100
)

model = rf.fit(df)
```

### Output Structure

```
output/
‚îú‚îÄ‚îÄ parquet/
‚îÇ   ‚îú‚îÄ‚îÄ cicids2017_preprocessed.parquet
‚îÇ   ‚îú‚îÄ‚îÄ cicids2018_preprocessed.parquet
‚îÇ   ‚îî‚îÄ‚îÄ unsw_nb15_preprocessed.parquet
‚îî‚îÄ‚îÄ models/
    ‚îú‚îÄ‚îÄ cicids2017_scaler/
    ‚îú‚îÄ‚îÄ cicids2018_scaler/
    ‚îî‚îÄ‚îÄ unsw_nb15_scaler/
```

### Class Imbalance Handling

Sample weights are calculated as: `weight = max_count / class_count`

This ensures minority classes (attacks) have higher weights during training, preventing the model from becoming biased toward the majority class (benign traffic).