# Automated CSV File Aggregation and Renaming from Subfolders

This code automates the process of aggregating CSV files from multiple subfolders and renaming them with a consistent naming pattern. It is particularly useful when data files are stored in a structured directory (e.g., from different runs or scenarios) and need to be consolidated into a single destination folder with sequential and descriptive file names.

## Purpose

The primary objectives of this code are to:

1. **Define Source and Destination Paths:**  
   The code sets a `source_path` (the directory containing multiple subfolders) and a `dest_path` (the directory where the renamed CSV files will be saved). It also defines a base string (`new_file_string`) used as part of the new filenames.

2. **Ensure Destination Directory Exists:**  
   The code makes sure that the destination directory is available by creating it if it does not already exist.

3. **List and Sort Subfolders:**  
   It scans the source directory for subfolders (ignoring any files) and sorts them. The assumption is that these subfolders are organized in a predictable naming pattern such as `base-5-1`, `base-5-2`, etc.

4. **Extract and Rename CSV Files:**  
   For each subfolder, the script:
   - Constructs the path to the CSV file (specifically named `features_timeseries_60_sec.csv`).
   - Checks if the CSV file exists.
   - Reads the CSV file into a pandas DataFrame.
   - Constructs a new filename by appending an incremental number to the base string.
   - Saves the DataFrame as a new CSV file in the destination directory.
   - Provides console feedback about the operation.



In [2]:
import os
import pandas as pd

In [None]:


# 1) Define source and destination paths
# source_path = "/Users/souba636/Documents/vinnova_project_python/data-traces3-NOMS-2025/worst_parent/5/oo"
source_path = "/Users/souba636/Documents/vinnova_project_python/data/blackhole_new/5/gc"

dest_path = "/Users/souba636/Documents/vinnova_project_python/data/scenario/blackhole_var5_dec"

new_file_string = "blackhole_var5_dec"

# Make sure the destination folder exists
os.makedirs(dest_path, exist_ok=True)

# 2) List all subfolders (assuming each subfolder is named, for example, base-5-1, base-5-2, etc.)
#    If they're named differently, you'll need to adapt this logic.
subfolders = sorted(
    [folder for folder in os.listdir(source_path) 
     if os.path.isdir(os.path.join(source_path, folder))]
)

# 3) Iterate over subfolders, read the CSV, and save to new location
for i, folder in enumerate(subfolders, start=1):
    folder_path = os.path.join(source_path, folder)
    csv_file = os.path.join(folder_path, "features_timeseries_60_sec.csv")
    
    if os.path.exists(csv_file):
        # Read CSV
        df = pd.read_csv(csv_file)
        
        # Create new file name
        new_name = f"{new_file_string}_{i}_features_timeseries_60_sec.csv"
        dest_file = os.path.join(dest_path, new_name)
        
        # Save renamed CSV
        df.to_csv(dest_file, index=False)
        
        print(f"Copied and renamed {csv_file} to {dest_file}")
    else:
        print(f"No features_timeseries_60_sec.csv found in {folder_path}")


Copied and renamed /Users/souba636/Documents/vinnova_project_python/data/blackhole_new/5/gc/1/features_timeseries_60_sec.csv to /Users/souba636/Documents/vinnova_project_python/data/scenario/blackhole_var5_dec/blackhole_var5_dec_1_features_timeseries_60_sec.csv
Copied and renamed /Users/souba636/Documents/vinnova_project_python/data/blackhole_new/5/gc/10/features_timeseries_60_sec.csv to /Users/souba636/Documents/vinnova_project_python/data/scenario/blackhole_var5_dec/blackhole_var5_dec_2_features_timeseries_60_sec.csv
Copied and renamed /Users/souba636/Documents/vinnova_project_python/data/blackhole_new/5/gc/11/features_timeseries_60_sec.csv to /Users/souba636/Documents/vinnova_project_python/data/scenario/blackhole_var5_dec/blackhole_var5_dec_3_features_timeseries_60_sec.csv
Copied and renamed /Users/souba636/Documents/vinnova_project_python/data/blackhole_new/5/gc/12/features_timeseries_60_sec.csv to /Users/souba636/Documents/vinnova_project_python/data/scenario/blackhole_var5_dec/b

# CSV Leave-One-Out Merge Script

This Python script merges multiple CSV files in a folder into training and validation datasets using a leave-one-out cross-validation strategy. It repeats the process 10 times, each time keeping a different file as the validation set.

---

## Features

- **Automatic label column detection**  
  Supports `"label"`, `"class"`, `"target"`, `"y"`, or any binary column.
- **Leave-One-Out Cross-Validation**  
  Each round selects one CSV as **validation** and merges the remaining CSVs into **training**.
- **Class Ordering**  
  - Training data: All **class 0 samples** first, followed by **class 1 samples**.  
  - Validation data: Ordered in the same way.
- **Configurable Parameters**  
  Folder path, number of rounds, file pattern, and label column name.
- **Output**  
  For each round, saves:
  - `run_XX_train.csv`
  - `run_XX_val.csv`

---

## Configuration

Update the following section in the script:

```python
FOLDER = Path("/path/to/your/folder")   # Path containing your CSV files
OUTDIR = FOLDER / "merged_runs"         # Output folder for merged files
N_ROUNDS = 10                           # Number of leave-one-out rounds
LABEL_COL = None                        # Set manually if known (e.g. "label"), else auto-detected
CSV_GLOB = "*.csv"                      # Pattern to match CSV files


## Remove duplicate rows from train and test set

In [50]:
from pathlib import Path
import pandas as pd

BASE = Path("/Users/souba636/Documents/vinnova_project_python/data/cross_val_scenario")

# Get subfolders as Path objects
folders = [f for f in BASE.iterdir() if f.is_dir()]
print("Scenario folders:", [f.name for f in folders])

for folder in folders:
    train_files = sorted(folder.glob("run_*_train.csv"))
    val_files   = sorted(folder.glob("run_*_val.csv"))

    if not train_files or not val_files:
        print(f"[WARN] Skipping {folder.name} — no run CSVs found.")
        continue

    print(f"Processing scenario: {folder.name}")

    for tr_file, va_file in zip(train_files, val_files):
        # Read CSVs
        df_train = pd.read_csv(tr_file).drop(columns=["Unnamed: 0"], errors="ignore")
        df_val   = pd.read_csv(va_file).drop(columns=["Unnamed: 0"], errors="ignore")

        # Remove duplicate rows within each set
        df_train = df_train.drop_duplicates(ignore_index=True)
        df_val   = df_val.drop_duplicates(ignore_index=True)

        # Optional: Remove rows that appear in both train and val (cross-leakage)
        common_rows = df_val.merge(df_train, how="inner")
        if not common_rows.empty:
            df_val = df_val[~df_val.apply(tuple, axis=1).isin(common_rows.apply(tuple, axis=1))]
            print(f"  Removed {len(common_rows)} overlapping rows from validation set in {va_file.name}")

        # Save cleaned CSVs
        df_train.to_csv(tr_file, index=False)
        df_val.to_csv(va_file, index=False)

        print(f"  Cleaned: {tr_file.name} ({len(df_train)} rows), {va_file.name} ({len(df_val)} rows)")

print("Duplicate removal completed for all scenarios.")


Scenario folders: ['localrepair_var20_dec', 'worstparent_var10_dec', 'disflooding_var5_base', 'disflooding_var5_dec', 'worstparent_var5_dec', 'worstparent_var15_dec', 'blackhole_var20_dec', 'worstparent_var20_base', 'localrepair_var15_base', 'localrepair_var5_base', 'disflooding_var20_dec', 'blackhole_var15_base', 'localrepair_var5_dec', 'worstparent_var10_base', 'blackhole_var5_dec', 'disflooding_var5_oo', 'blackhole_var5_base', 'disflooding_var15_base', 'worstparent_var20_oo', 'disflooding_var20_oo', 'worstparent_var10_oo', 'disflooding_var20_base', 'disflooding_var10_oo', 'disflooding_var15_dec', 'localrepair_var10_base', 'localrepair_var5_oo', 'worstparent_var5_base', 'disflooding_var10_dec', 'blackhole_var20_base', 'disflooding_var15_oo', 'worstparent_var15_oo', 'blackhole_var5_oo', 'worstparent_var15_base', 'localrepair_var20_oo', 'blackhole_var20_oo', 'localrepair_var20_base', 'worstparent_var5_oo', 'localrepair_var15_dec', 'localrepair_var10_oo', 'disflooding_var10_base', 'blac

## Drop Missing cells

In [52]:
from pathlib import Path
import pandas as pd

BASE = Path("/Users/souba636/Documents/vinnova_project_python/data/cross_val_scenario")

# Get subfolders as Path objects
folders = [f for f in BASE.iterdir() if f.is_dir()]
print("Scenario folders:", [f.name for f in folders])

for folder in folders:
    train_files = sorted(folder.glob("run_*_train.csv"))
    val_files   = sorted(folder.glob("run_*_val.csv"))

    if not train_files or not val_files:
        print(f"[WARN] Skipping {folder.name} — no run CSVs found.")
        continue

    print(f"Processing scenario: {folder.name}")

    for tr_file, va_file in zip(train_files, val_files):
        # Read CSVs and remove Unnamed: 0 if exists
        df_train = pd.read_csv(tr_file).drop(columns=["Unnamed: 0"], errors="ignore")
        df_val   = pd.read_csv(va_file).drop(columns=["Unnamed: 0"], errors="ignore")

        # Remove rows with missing values
        before_train = len(df_train)
        before_val = len(df_val)
        df_train = df_train.dropna()
        df_val = df_val.dropna()


        # Save cleaned CSVs
        df_train.to_csv(tr_file, index=False)
        df_val.to_csv(va_file, index=False)

        print(
            f"  Cleaned {tr_file.name}: "
            f"{before_train - len(df_train)} rows removed (missing/duplicates)"
        )
        print(
            f"  Cleaned {va_file.name}: "
            f"{before_val - len(df_val)} rows removed (missing/duplicates)"
        )

print(" Cleaning completed for all scenarios.")


Scenario folders: ['localrepair_var20_dec', 'worstparent_var10_dec', 'disflooding_var5_base', 'disflooding_var5_dec', 'worstparent_var5_dec', 'worstparent_var15_dec', 'blackhole_var20_dec', 'worstparent_var20_base', 'localrepair_var15_base', 'localrepair_var5_base', 'disflooding_var20_dec', 'blackhole_var15_base', 'localrepair_var5_dec', 'worstparent_var10_base', 'blackhole_var5_dec', 'disflooding_var5_oo', 'blackhole_var5_base', 'disflooding_var15_base', 'worstparent_var20_oo', 'disflooding_var20_oo', 'worstparent_var10_oo', 'disflooding_var20_base', 'disflooding_var10_oo', 'disflooding_var15_dec', 'localrepair_var10_base', 'localrepair_var5_oo', 'worstparent_var5_base', 'disflooding_var10_dec', 'blackhole_var20_base', 'disflooding_var15_oo', 'worstparent_var15_oo', 'blackhole_var5_oo', 'worstparent_var15_base', 'localrepair_var20_oo', 'blackhole_var20_oo', 'localrepair_var20_base', 'worstparent_var5_oo', 'localrepair_var15_dec', 'localrepair_var10_oo', 'disflooding_var10_base', 'blac

## Files those have number of rows in validation file = 0

In [63]:
from pathlib import Path
import pandas as pd

BASE = Path("/Users/souba636/Documents/vinnova_project_python/data/cross_val_scenario")

# Get subfolders as Path objects
folders = [f for f in BASE.iterdir() if f.is_dir()]
print("Scenario folders:", [f.name for f in folders])

empty_val_files = []

for folder in folders:
    val_files = sorted(folder.glob("run_*_val.csv"))

    if not val_files:
        print(f"[WARN] Skipping {folder.name} — no validation CSVs found.")
        continue

    for va_file in val_files:
        df_val = pd.read_csv(va_file).drop(columns=["Unnamed: 0"], errors="ignore")

        if len(df_val) == 0:
            empty_val_files.append(str(va_file))

# Final report
if empty_val_files:
    print("\n Validation files with 0 rows:")
    for f in empty_val_files:
        print(" -", f)
else:
    print("\n No validation files with 0 rows found.")


Scenario folders: ['localrepair_var20_dec', 'worstparent_var10_dec', 'disflooding_var5_base', 'disflooding_var5_dec', 'worstparent_var5_dec', 'worstparent_var15_dec', 'blackhole_var20_dec', 'worstparent_var20_base', 'localrepair_var15_base', 'localrepair_var5_base', 'disflooding_var20_dec', 'blackhole_var15_base', 'localrepair_var5_dec', 'worstparent_var10_base', 'blackhole_var5_dec', 'disflooding_var5_oo', 'blackhole_var5_base', 'disflooding_var15_base', 'worstparent_var20_oo', 'disflooding_var20_oo', 'worstparent_var10_oo', 'disflooding_var20_base', 'disflooding_var10_oo', 'disflooding_var15_dec', 'localrepair_var10_base', 'localrepair_var5_oo', 'worstparent_var5_base', 'disflooding_var10_dec', 'blackhole_var20_base', 'disflooding_var15_oo', 'worstparent_var15_oo', 'blackhole_var5_oo', 'worstparent_var15_base', 'localrepair_var20_oo', 'blackhole_var20_oo', 'localrepair_var20_base', 'worstparent_var5_oo', 'localrepair_var15_dec', 'localrepair_var10_oo', 'disflooding_var10_base', 'blac

### Create 10 cross_validation files for each attacks


In [62]:
import pandas as pd
from pathlib import Path

# ==== CONFIG ====
BASE = Path("/Users/souba636/Documents/vinnova_project_python/data/scenario/disflooding_var10_base")  # path with CSV files
OUTDIR = Path("/Users/souba636/Documents/vinnova_project_python/data/cross_val_scenario/disflooding_var10_base")
TARGET_RUNS = 10                        # how many valid runs we want
LABEL_COL = None                        # set manually if known
CSV_GLOB = "*.csv"                      # match all CSV files in BASE

# List all CSV files
files = sorted(BASE.glob(CSV_GLOB))
total_files = len(files)
print(f"Found {total_files} CSV files in {BASE}")
if total_files < TARGET_RUNS:
    raise ValueError(f"Found {total_files} CSVs but TARGET_RUNS={TARGET_RUNS}.")

# Detect label column from first file
def detect_label_col(df: pd.DataFrame):
    for c in ["label", "class", "target", "y"]:
        if c in df.columns and set(df[c].dropna().unique()).issubset({0, 1}):
            return c
    for c in df.columns[::-1]:
        vals = df[c].dropna().unique()
        if len(vals) <= 3 and set(vals).issubset({0, 1}):
            return c
    raise ValueError("Couldn't detect the label column. Set LABEL_COL manually.")

first_df = pd.read_csv(files[0]).drop(columns=["Unnamed: 0"], errors="ignore")
label_col = LABEL_COL or detect_label_col(first_df)
print(f"Using label column: {label_col}")

# Create output directory
OUTDIR.mkdir(parents=True, exist_ok=True)

# Loop until we get TARGET_RUNS valid runs
valid_run_count = 0
skipped_runs = []

for i in range(total_files):
    if valid_run_count >= TARGET_RUNS:
        break  # Stop when we have enough valid runs

    val_file = files[i]
    train_files = [f for f in files if f != val_file]

    # Load and split training data by class
    zeros, ones = [], []
    for f in train_files:
        df = pd.read_csv(f).drop(columns=["Unnamed: 0"], errors="ignore")
        if label_col not in df.columns:
            raise ValueError(f"{f.name} does not have label column '{label_col}'")
        zeros.append(df[df[label_col] == 0])
        ones.append(df[df[label_col] == 1])

    # Combine and clean train data
    train_0 = pd.concat(zeros, ignore_index=True) if zeros else pd.DataFrame(columns=first_df.columns)
    train_1 = pd.concat(ones, ignore_index=True) if ones else pd.DataFrame(columns=first_df.columns)
    train_df = pd.concat([train_0, train_1], ignore_index=True)

    # Load and order validation data
    val_df = pd.read_csv(val_file).drop(columns=["Unnamed: 0"], errors="ignore")
    val_df = pd.concat(
        [val_df[val_df[label_col] == 0], val_df[val_df[label_col] == 1]],
        ignore_index=True
    )

    # === Cleaning ===
    train_df = train_df.dropna().drop_duplicates(ignore_index=True)
    val_df = val_df.dropna().drop_duplicates(ignore_index=True)

    # Remove duplicate rows across train and validation (leakage)
    common_rows = val_df.merge(train_df, how="inner")
    if not common_rows.empty:
        val_df = val_df[~val_df.apply(tuple, axis=1).isin(common_rows.apply(tuple, axis=1))]

    # If validation set becomes empty, skip this run
    if len(val_df) == 0:
        print(f"[SKIP] Run {i+1:02d} skipped — validation empty after cleaning")
        skipped_runs.append(val_file.name)
        continue

    # Save cleaned sets
    run_name = f"run_{valid_run_count+1:02d}"  # valid run numbering
    train_path = OUTDIR / f"{run_name}_train.csv"
    val_path = OUTDIR / f"{run_name}_val.csv"
    train_df.to_csv(train_path, index=False)
    val_df.to_csv(val_path, index=False)

    # Summary
    c0_train, c1_train = (train_df[label_col] == 0).sum(), (train_df[label_col] == 1).sum()
    c0_val, c1_val = (val_df[label_col] == 0).sum(), (val_df[label_col] == 1).sum()
    print(f"[{run_name}] val={val_file.name} | train: 0={c0_train}, 1={c1_train} | val: 0={c0_val}, 1={c1_val}")

    valid_run_count += 1

print(f" Done. Created {valid_run_count} valid runs. Files written to: {OUTDIR}")

if skipped_runs:
    print("\n Skipped runs due to empty validation set after cleaning:")
    for f in skipped_runs:
        print(" -", f)


Found 20 CSV files in /Users/souba636/Documents/vinnova_project_python/data/scenario/disflooding_var10_base
Using label column: label
[run_01] val=disflooding_var10_base_10_features_timeseries_60_sec.csv | train: 0=6585, 1=7795 | val: 0=408, 1=433
[run_02] val=disflooding_var10_base_11_features_timeseries_60_sec.csv | train: 0=6592, 1=7794 | val: 0=401, 1=434
[SKIP] Run 03 skipped — validation empty after cleaning
[run_03] val=disflooding_var10_base_13_features_timeseries_60_sec.csv | train: 0=6608, 1=7794 | val: 0=385, 1=434
[run_04] val=disflooding_var10_base_14_features_timeseries_60_sec.csv | train: 0=6594, 1=7801 | val: 0=399, 1=427
[run_05] val=disflooding_var10_base_15_features_timeseries_60_sec.csv | train: 0=6640, 1=7794 | val: 0=353, 1=434
[run_06] val=disflooding_var10_base_16_features_timeseries_60_sec.csv | train: 0=6654, 1=7794 | val: 0=339, 1=434
[run_07] val=disflooding_var10_base_17_features_timeseries_60_sec.csv | train: 0=6626, 1=7794 | val: 0=367, 1=434
[run_08] val

### Check for 

1. missing values

2. duplicate rows (within each file)

3. train/val overlap (data leakage) per run

4. label column validity (binary {0,1}, no NaNs)

5. schema consistency (same columns across all files)

6. column count (should be 14 features + 1 label)

7. non‑numeric feature values

8. constant (zero-variance) feature columns

9. class counts and ordering (0s before 1s)

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

BASE = Path("/Users/souba636/Documents/vinnova_project_python/data/cross_val_scenario")

# Get subfolders as Path objects
folders = [f for f in BASE.iterdir() if f.is_dir()]
print("Scenario folders:", [f.name for f in folders])

# === PARAMETERS ===
LABEL_COL = None       # set if known, else auto-detect
EXPECTED_FEATURES = 14 # you said 14 features + 1 label = 15 cols

# --- Helper functions ---
def detect_label_col(df: pd.DataFrame):
    for c in ["label", "class", "target", "y"]:
        if c in df.columns and set(df[c].dropna().unique()).issubset({0,1}):
            return c
    for c in df.columns[::-1]:
        vals = df[c].dropna().unique()
        if len(vals) <= 3 and set(vals).issubset({0,1}):
            return c
    raise ValueError("Could not detect label column; set LABEL_COL explicitly.")

def is_ordered_zeros_then_ones(s: pd.Series) -> bool:
    arr = s.to_numpy()
    return not np.any((arr[1:] == 0) & (arr[:-1] == 1))

# Collect all reports
reports = []

for folder in folders:
    train_files = sorted(folder.glob("run_*_train.csv"))
    val_files   = sorted(folder.glob("run_*_val.csv"))
    
    if not train_files or not val_files:
        print(f"[WARN] Skipping {folder.name} — no run CSVs found.")
        continue
    
    # Detect label column
    first_df = pd.read_csv(train_files[0], nrows=1000)
    label_col = LABEL_COL or detect_label_col(first_df)

    for tr, va in zip(train_files, val_files):
        run_id = tr.stem.replace("_train","")

        df_train = pd.read_csv(tr)
        df_val   = pd.read_csv(va)

        # Basic checks
        same_cols = list(df_train.columns) == list(df_val.columns)
        col_count_ok = len(df_train.columns) == (EXPECTED_FEATURES + 1)
        miss_train = int(df_train.isna().sum().sum())
        miss_val   = int(df_val.isna().sum().sum())
        dup_train  = int(df_train.duplicated().sum())
        dup_val    = int(df_val.duplicated().sum())
        # leakage    = int(len(df_val.merge(df_train, on=list(df_train.columns), how="inner")))
        duplicates = df_val.merge(df_train, on=list(df_train.columns), how="inner")
        leakage = len(duplicates)
        leakage_label0 = int((duplicates[label_col] == 0).sum())
        leakage_label1 = int((duplicates[label_col] == 1).sum())
        ordered_train = is_ordered_zeros_then_ones(df_train[label_col])
        ordered_val   = is_ordered_zeros_then_ones(df_val[label_col])

        reports.append({
            "scenario": folder.name,
            "run": run_id,
            "total_rows_train": len(df_train),
            "total_rows_val": len(df_val),
            "missing_cells_train": miss_train,
            "missing_cells_val": miss_val,
            "dup_rows_train": dup_train,
            "dup_rows_val": dup_val,
            "leakage_rows_train∩val": leakage,
            "leakage_label0_count": leakage_label0,
            "leakage_label1_count": leakage_label1,
            "same_schema": same_cols,
            "col_count_ok": col_count_ok,
            "ordered_train_0_then_1": ordered_train,
            "ordered_val_0_then_1": ordered_val
        })

# Save report
report_df = pd.DataFrame(reports)
report_path = BASE / "cross_val_validation_report.csv"
report_df.to_csv(report_path, index=False)
print(f"\nValidation report saved: {report_path}")

# Quick summary
print("\nSummary of issues:")
print(report_df[
    (~report_df["same_schema"]) |
    (~report_df["col_count_ok"]) |
    (report_df["missing_cells_train"] > 0) |
    (report_df["missing_cells_val"] > 0) |
    (report_df["dup_rows_train"] > 0) |
    (report_df["dup_rows_val"] > 0) |
    (report_df["leakage_rows_train∩val"] > 0) |
    (~report_df["ordered_train_0_then_1"]) |
    (~report_df["ordered_val_0_then_1"])
])


In [71]:
from pathlib import Path
import pandas as pd
import numpy as np

BASE = Path("/Users/souba636/Documents/vinnova_project_python/data/cross_val_scenario")

# Get subfolders as Path objects
folders = [f for f in BASE.iterdir() if f.is_dir()]
print("Scenario folders:", [f.name for f in folders])

# === PARAMETERS ===
LABEL_COL = "label"       # set if known, else auto-detect
EXPECTED_FEATURES = 14 # you said 14 features + 1 label = 15 cols

# Collect all reports
results = []

for folder in folders:
    train_files = sorted(folder.glob("run_*_train.csv"))
    val_files   = sorted(folder.glob("run_*_val.csv"))
    if not train_files or not val_files:
        print(f"[WARN] Skipping {folder.name} — no train/val CSVs found.")
        continue

    for tr_file, va_file in zip(train_files, val_files):
        # Load data
        print(f"Processing scenario: {folder.name}")
        print(f"Train_file: {tr_file}")
        print(f"Processing: {tr_file.name} and {va_file.name}")
        df_train = pd.read_csv(tr_file)
        df_val   = pd.read_csv(va_file)

        print(f"df_train columns: {df_train.columns.tolist()}")
        print(f"df_val columns: {df_val.columns.tolist()}")
        

        # if LABEL_COL not in df_train.columns or LABEL_COL not in df_val.columns:
        #    raise ValueError(f"Label column '{LABEL_COL}' not found in {tr_file.name} or {va_file.name}")

        # Count labels
        train_label0 = (df_train[LABEL_COL] == 0).sum()
        train_label1 = (df_train[LABEL_COL] == 1).sum()
        val_label0   = (df_val[LABEL_COL] == 0).sum()
        val_label1   = (df_val[LABEL_COL] == 1).sum()

        # Append results
        results.append({
            "scenario": folder.name,
            "run": tr_file.stem.replace("_train", ""),
            "train_label0": train_label0,
            "train_label1": train_label1,
            "val_label0": val_label0,
            "val_label1": val_label1
        })

# Save summary CSV
summary_df = pd.DataFrame(results)
summary_path = BASE / "label_counts_summary.csv"
summary_df.to_csv(summary_path, index=False)

print(f"\n Summary saved to {summary_path}")


Scenario folders: ['localrepair_var20_dec', 'worstparent_var10_dec', 'disflooding_var5_base', 'disflooding_var5_dec', 'worstparent_var5_dec', 'worstparent_var15_dec', 'blackhole_var20_dec', 'worstparent_var20_base', 'localrepair_var15_base', 'localrepair_var5_base', 'disflooding_var20_dec', 'blackhole_var15_base', 'localrepair_var5_dec', 'worstparent_var10_base', 'blackhole_var5_dec', 'disflooding_var5_oo', 'blackhole_var5_base', 'disflooding_var15_base', 'worstparent_var20_oo', 'disflooding_var20_oo', 'worstparent_var10_oo', 'disflooding_var20_base', 'disflooding_var10_oo', 'disflooding_var15_dec', 'localrepair_var10_base', 'localrepair_var5_oo', 'worstparent_var5_base', 'disflooding_var10_dec', 'blackhole_var20_base', 'disflooding_var15_oo', 'worstparent_var15_oo', 'blackhole_var5_oo', 'worstparent_var15_base', 'localrepair_var20_oo', 'blackhole_var20_oo', 'localrepair_var20_base', 'worstparent_var5_oo', 'localrepair_var15_dec', 'localrepair_var10_oo', 'disflooding_var10_base', 'blac

## MMD Between the domains

In [77]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

# ==== CONFIG ====
BASE = Path("/Users/souba636/Documents/vinnova_project_python/data/cross_val_scenario")
LABEL_COL = "label"     # Change if needed
CSV_TYPE = "train"      # "train" or "val"
SIGMA = 1.0             # RBF kernel width
RUNS = range(1, 2)      # run_01 ... run_10

# ==== MMD FUNCTIONS ====
def rbf_kernel(X, Y, sigma):
    XX = np.sum(X**2, axis=1)[:, None]
    YY = np.sum(Y**2, axis=1)[None, :]
    d2 = XX + YY - 2 * (X @ Y.T)
    return np.exp(-d2 / (2 * sigma**2 + 1e-12))

def mmd_rbf(X, Y, sigma):
    Kxx = rbf_kernel(X, X, sigma)
    Kyy = rbf_kernel(Y, Y, sigma)
    Kxy = rbf_kernel(X, Y, sigma)
    return float(Kxx.mean() + Kyy.mean() - 2 * Kxy.mean())

# ==== FOLDERS ====
folders = sorted([f for f in BASE.iterdir() if f.is_dir()])
folder_names = [f.name for f in folders]
print(f"Found {len(folders)} folders.")

# ==== RESULTS ====
all_results = []

for run_id in RUNS:
    print(f"\n=== Processing run_{run_id:02d} ===")
    
    # Step 1: Pre-load all datasets for this run
    datasets = []
    valid_names = []
    for folder in folders:
        fpath = folder / f"run_{run_id:02d}_{CSV_TYPE}.csv"
        if not fpath.exists():
            continue
        df = pd.read_csv(fpath).drop(columns=["Unnamed: 0"], errors="ignore")
        if LABEL_COL in df.columns:
            df = df.drop(columns=[LABEL_COL])
        df = df.apply(pd.to_numeric, errors="coerce").dropna()
        if df.shape[0] > 0 and df.shape[1] > 0:
            datasets.append(df.to_numpy())
            valid_names.append(folder.name)
    
    n = len(datasets)
    print(f"Found {n} valid datasets for run_{run_id:02d}: {valid_names}")
    if n < 2:
        print(f"[SKIP] Not enough datasets for run_{run_id:02d}")
        continue
    
    # Step 2: Compute lower-triangular MMD matrix
    M = np.zeros((n, n))
    for i in range(n):
        for j in range(i):
            val = mmd_rbf(datasets[i], datasets[j], SIGMA)
            print(f"MMD({valid_names[i]}, {valid_names[j]}) = {val:.4f}")
            M[i, j] = val
            M[j, i] = val  # fill symmetric position
    
    # Step 3: Store results in a long-form table
    for i in range(n):
        for j in range(i):
            all_results.append({
                "run": run_id,
                "folder_a": valid_names[i],
                "folder_b": valid_names[j],
                "mmd": M[i, j]
            })
    
    # Step 4: Save CSV + heatmap for this run
    mat_df = pd.DataFrame(M, index=valid_names, columns=valid_names)
    mat_df.to_csv(BASE / f"mmd_matrix_run_{run_id:02d}_{CSV_TYPE}.csv")
    
    plt.figure(figsize=(10, 8))
    plt.imshow(M, interpolation="nearest", aspect="auto")
    plt.xticks(range(n), valid_names, rotation=90)
    plt.yticks(range(n), valid_names)
    plt.colorbar(label="MMD")
    plt.title(f"MMD Heatmap — run_{run_id:02d} ({CSV_TYPE})")
    plt.tight_layout()
    plt.savefig(BASE / f"mmd_heatmap_run_{run_id:02d}_{CSV_TYPE}.png", dpi=200)
    plt.close()

# ==== Save long-form results ====
results_df = pd.DataFrame(all_results)
results_df.to_csv(BASE / f"mmd_results_{CSV_TYPE}.csv", index=False)
print(f"All results saved to {BASE}/mmd_results_{CSV_TYPE}.csv")


Found 48 folders.

=== Processing run_01 ===
Found 48 valid datasets for run_01: ['blackhole_var10_base', 'blackhole_var10_dec', 'blackhole_var10_oo', 'blackhole_var15_base', 'blackhole_var15_dec', 'blackhole_var15_oo', 'blackhole_var20_base', 'blackhole_var20_dec', 'blackhole_var20_oo', 'blackhole_var5_base', 'blackhole_var5_dec', 'blackhole_var5_oo', 'disflooding_var10_base', 'disflooding_var10_dec', 'disflooding_var10_oo', 'disflooding_var15_base', 'disflooding_var15_dec', 'disflooding_var15_oo', 'disflooding_var20_base', 'disflooding_var20_dec', 'disflooding_var20_oo', 'disflooding_var5_base', 'disflooding_var5_dec', 'disflooding_var5_oo', 'localrepair_var10_base', 'localrepair_var10_dec', 'localrepair_var10_oo', 'localrepair_var15_base', 'localrepair_var15_dec', 'localrepair_var15_oo', 'localrepair_var20_base', 'localrepair_var20_dec', 'localrepair_var20_oo', 'localrepair_var5_base', 'localrepair_var5_dec', 'localrepair_var5_oo', 'worstparent_var10_base', 'worstparent_var10_dec', 

In [4]:
import os
import re
from pathlib import Path

folder = Path("/Users/souba636/Documents/IoT_attack_CL_IDS/data/attack_data/")

for root, dirs, files in os.walk(folder):
    for fname in files:
        # only target files ending with the pattern
        if fname.endswith("_features_timeseries_60_sec.csv"):
            m = re.search(r"(\d+_features_timeseries_60_sec\.csv)$", fname)
            if m:
                new_name = m.group(1)  # keep only from the number onwards
                old_path = os.path.join(root, fname)
                new_path = os.path.join(root, new_name)
                
                # rename only if different
                if old_path != new_path:
                    os.rename(old_path, new_path)
                    print(f"Renamed in {root}: {fname} -> {new_name}")

Renamed in /Users/souba636/Documents/IoT_attack_CL_IDS/data/attack_data/localrepair_var20_dec: localrepair_var20_dec_16_features_timeseries_60_sec.csv -> 16_features_timeseries_60_sec.csv
Renamed in /Users/souba636/Documents/IoT_attack_CL_IDS/data/attack_data/localrepair_var20_dec: localrepair_var20_dec_10_features_timeseries_60_sec.csv -> 10_features_timeseries_60_sec.csv
Renamed in /Users/souba636/Documents/IoT_attack_CL_IDS/data/attack_data/localrepair_var20_dec: localrepair_var20_dec_7_features_timeseries_60_sec.csv -> 7_features_timeseries_60_sec.csv
Renamed in /Users/souba636/Documents/IoT_attack_CL_IDS/data/attack_data/localrepair_var20_dec: localrepair_var20_dec_1_features_timeseries_60_sec.csv -> 1_features_timeseries_60_sec.csv
Renamed in /Users/souba636/Documents/IoT_attack_CL_IDS/data/attack_data/localrepair_var20_dec: localrepair_var20_dec_11_features_timeseries_60_sec.csv -> 11_features_timeseries_60_sec.csv
Renamed in /Users/souba636/Documents/IoT_attack_CL_IDS/data/atta

In [5]:
import os, glob, re, random
import numpy as np
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader
from pathlib import Path

# -----------------------
# Config
# -----------------------
BASE = Path("/Users/souba636/Documents/IoT_attack_CL_IDS/data/attack_data/blackhole_var5_base")  # folder that directly contains the 20 CSVs
DROP_COLS = ["Unnamed: 0"]
SEQUENCE_LENGTH = 10
BATCH_SIZE = 128
RANDOM_STATE = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# -----------------------
# File discovery (supports both spellings)
# -----------------------

pat1 = os.path.join(BASE, "*_features_timeseries_60_sec.csv")   # common variant
files = glob.glob(pat1) + glob.glob(pat1)
assert len(files) >= 20, f"Expected at least 20 files, found {len(files)} in {BASE}"

def extract_index(path):
    # pull the number between ..._ and _60_sec.csv
    m = re.search(r"_(\d+)_60_sec\.csv$", os.path.basename(path))
    return int(m.group(1)) if m else 10**9  # push unknowns to end

files = sorted(files, key=extract_index)[:20]  # ensure exactly 20, ordered

random.seed(RANDOM_STATE)
random.shuffle(files)
train_files = files[:16]
test_files  = files[16:20]

print("Train files:", [os.path.basename(f) for f in train_files])
print("Test  files:", [os.path.basename(f) for f in test_files])

# -----------------------
# Utilities
# -----------------------
def load_csv(path):
    df = pd.read_csv(path)
    for c in DROP_COLS:
        if c in df.columns:
            df = df.drop(columns=[c])
    assert "label" in df.columns, f"'label' column missing in {os.path.basename(path)}"
    return df

def seq_maker(df, sequence_length=10, label_col="label"):
    df_feat = df.drop(columns=[label_col])
    labels = df[label_col].astype(int).values

    attack_idxs = np.where(labels == 1)[0]
    if len(attack_idxs) == 0:
        start_attack = len(labels) + sequence_length   # all zeros
    else:
        start_attack = max(0, attack_idxs[0] - sequence_length)

    sequences = []
    for i in range(len(df_feat) - sequence_length):
        sequences.append(df_feat.iloc[i:i+sequence_length].values.flatten())

    if not sequences:
        return pd.DataFrame(columns=[*range(df_feat.shape[1]*sequence_length), "label"])

    seq_df = pd.DataFrame(sequences)
    zeros = [0] * min(start_attack, len(seq_df))
    ones  = [1] * (len(seq_df) - len(zeros))
    seq_df["label"] = zeros + ones
    return seq_df

def safe_minmax_normalize(df, global_min, global_max, label_col="label"):
    feat_cols = [c for c in df.columns if c != label_col]
    denom = (global_max - global_min).replace(0, 1)  # avoid div/0
    out = df.copy()
    out[feat_cols] = (out[feat_cols] - global_min) / denom
    out[feat_cols] = out[feat_cols].replace([np.inf, -np.inf], np.nan).fillna(0.0)
    return out

# -----------------------
# Load all, compute train-only global min/max (excluding 'label')
# -----------------------
train_dfs = [load_csv(p) for p in train_files]
test_dfs  = [load_csv(p) for p in test_files]

feat_cols = [c for c in train_dfs[0].columns if c != "label"]
train_feat_mins = [df[feat_cols].min(axis=0) for df in train_dfs]
train_feat_maxs = [df[feat_cols].max(axis=0) for df in train_dfs]
global_min = pd.concat(train_feat_mins, axis=1).min(axis=1)
global_max = pd.concat(train_feat_maxs, axis=1).max(axis=1)

# -----------------------
# Normalize using train stats
# -----------------------
norm_train = [safe_minmax_normalize(df, global_min, global_max, "label") for df in train_dfs]
norm_test  = [safe_minmax_normalize(df, global_min, global_max, "label") for df in test_dfs]

# -----------------------
# Sequence-ify and concat
# -----------------------
seq_train_parts = [seq_maker(df, SEQUENCE_LENGTH, "label") for df in norm_train]
seq_test_parts  = [seq_maker(df, SEQUENCE_LENGTH, "label") for df in norm_test]

seq_train_parts = [df for df in seq_train_parts if not df.empty]
seq_test_parts  = [df for df in seq_test_parts if not df.empty]

seq_train = pd.concat(seq_train_parts, ignore_index=True)
seq_test  = pd.concat(seq_test_parts,  ignore_index=True)

# -----------------------
# Tensors & Dataloaders
# -----------------------
X_train = torch.tensor(seq_train.iloc[:, :-1].values, dtype=torch.float32)
y_train = torch.tensor(seq_train.iloc[:,  -1].values.astype(int), dtype=torch.long)
X_test  = torch.tensor(seq_test.iloc[:,  :-1].values, dtype=torch.float32)
y_test  = torch.tensor(seq_test.iloc[:,   -1].values.astype(int), dtype=torch.long)

X_train = torch.nan_to_num(X_train, nan=0.0)
X_test  = torch.nan_to_num(X_test,  nan=0.0)

feature_dim = X_train.shape[1]  # (#features * SEQUENCE_LENGTH)
X_train = X_train.view(-1, 1, feature_dim)
X_test  = X_test.view(-1, 1, feature_dim)

train_dataset = TensorDataset(X_train, y_train)
test_dataset  = TensorDataset(X_test,  y_test)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader  = DataLoader(test_dataset,  batch_size=BATCH_SIZE, shuffle=False)

print("X_train:", X_train.shape, "y_train:", y_train.shape)
print("X_test :", X_test.shape,  "y_test :", y_test.shape)


Train files: ['20_features_timeseries_60_sec.csv', '2_features_timeseries_60_sec.csv', '10_features_timeseries_60_sec.csv', '4_features_timeseries_60_sec.csv', '3_features_timeseries_60_sec.csv', '16_features_timeseries_60_sec.csv', '1_features_timeseries_60_sec.csv', '17_features_timeseries_60_sec.csv', '13_features_timeseries_60_sec.csv', '6_features_timeseries_60_sec.csv', '11_features_timeseries_60_sec.csv', '5_features_timeseries_60_sec.csv', '8_features_timeseries_60_sec.csv', '19_features_timeseries_60_sec.csv', '12_features_timeseries_60_sec.csv', '7_features_timeseries_60_sec.csv']
Test  files: ['9_features_timeseries_60_sec.csv', '15_features_timeseries_60_sec.csv', '14_features_timeseries_60_sec.csv', '18_features_timeseries_60_sec.csv']
X_train: torch.Size([8640, 1, 140]) y_train: torch.Size([8640])
X_test : torch.Size([2160, 1, 140]) y_test : torch.Size([2160])
