# Step 1: Data Preprocessing and Splitting by Type

This notebook loads the full CICIoT2023 dataset, cleans the data, and splits it by attack type (the `label` column). The preprocessed dataset is also saved as **preprocessed_CICIoT2023.csv** for future use.

In [1]:
# Import required libraries
import os
import glob
import pandas as pd

In [2]:
# Define the dataset directory and pattern for CSV files
dataset_dir = "./CICIoT2023"
file_pattern = os.path.join(dataset_dir, "part*.csv")

# List all CSV files
csv_files = glob.glob(file_pattern)
print("Number of CSV files found:", len(csv_files))

# Load and concatenate all CSV files into a single DataFrame
data_list = []
for file in csv_files:
    print("Reading file:", file)
    df = pd.read_csv(file)
    data_list.append(df)

data = pd.concat(data_list, ignore_index=True)
print("Combined dataset shape:", data.shape)

Number of CSV files found: 10
Reading file: ./CICIoT2023/part-00006-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00007-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00008-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00004-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00009-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00005-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00002-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00003-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00000-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Reading file: ./CICIoT2023/part-00001-363d1ba3-8ab5-4f96-bc25-4d5862db7cb9-c000.csv
Combined dataset shape: (2366956, 47)


In [3]:
# Data Cleaning

# Check for missing values in each column
missing_counts = data.isnull().sum()
print("Missing values in each column:\n", missing_counts)

# Drop rows with missing values
data_cleaned = data.dropna().reset_index(drop=True)
print("Dataset shape after dropping missing values:", data_cleaned.shape)

# Drop duplicate rows if any
data_cleaned = data_cleaned.drop_duplicates().reset_index(drop=True)
print("Dataset shape after dropping duplicates:", data_cleaned.shape)

Missing values in each column:
 flow_duration      0
Header_Length      0
Protocol Type      0
Duration           0
Rate               0
Srate              0
Drate              0
fin_flag_number    0
syn_flag_number    0
rst_flag_number    0
psh_flag_number    0
ack_flag_number    0
ece_flag_number    0
cwr_flag_number    0
ack_count          0
syn_count          0
fin_count          0
urg_count          0
rst_count          0
HTTP               0
HTTPS              0
DNS                0
Telnet             0
SMTP               0
SSH                0
IRC                0
TCP                0
UDP                0
DHCP               0
ARP                0
ICMP               0
IPv                0
LLC                0
Tot sum            0
Min                0
Max                0
AVG                0
Std                0
Tot size           0
IAT                0
Number             0
Magnitue           0
Radius             0
Covariance         0
Variance           0
Weight             0
la

In [4]:
# Display the first 5 rows of the cleaned dataset to verify the preprocessing steps
print("First 5 rows of the cleaned dataset:")
print(data_cleaned.head())

First 5 rows of the cleaned dataset:
   flow_duration  Header_Length  Protocol Type  Duration       Rate  \
0       4.625763           84.0           17.0      64.0   0.432361   
1      76.135781       428611.8            8.2     151.9  42.352135   
2       4.549627          108.0            6.0      64.0   0.762226   
3       0.000000           54.0            6.0      64.0   0.000000   
4       0.000000            0.0            1.0      64.0   3.178249   

       Srate  Drate  fin_flag_number  syn_flag_number  rst_flag_number  ...  \
0   0.432361    0.0              0.0              0.0              0.0  ...   
1  42.352135    0.0              0.0              0.0              0.0  ...   
2   0.762226    0.0              0.0              1.0              0.0  ...   
3   0.000000    0.0              0.0              1.0              0.0  ...   
4   3.178249    0.0              0.0              0.0              0.0  ...   

           Std  Tot size           IAT  Number   Magnitue    

In [5]:
# Save the preprocessed dataset for future use
preprocessed_file = "preprocessed_CICIoT2023.csv"
data_cleaned.to_csv(preprocessed_file, index=False)
print(f"Preprocessed dataset saved as {preprocessed_file}. Shape: {data_cleaned.shape}")

Preprocessed dataset saved as preprocessed_CICIoT2023.csv. Shape: (2366956, 47)


In [6]:
# Split Data by Type based on the 'label' column

# Get unique labels in the dataset
unique_labels = data_cleaned['label'].unique()
print("Unique labels in the dataset:", unique_labels)

# Create a dictionary to store data subsets by label
data_by_label = {label: data_cleaned[data_cleaned['label'] == label] for label in unique_labels}

# Print the number of samples for each label
for label, df in data_by_label.items():
    print(f"Label: {label}, Number of samples: {len(df)}")

Unique labels in the dataset: ['DoS-UDP_Flood' 'BenignTraffic' 'DDoS-SynonymousIP_Flood'
 'DDoS-SYN_Flood' 'DDoS-ICMP_Flood' 'DDoS-RSTFINFlood' 'DDoS-UDP_Flood'
 'DDoS-ACK_Fragmentation' 'Mirai-udpplain' 'DDoS-UDP_Fragmentation'
 'DDoS-TCP_Flood' 'DoS-TCP_Flood' 'Mirai-greeth_flood' 'DDoS-PSHACK_Flood'
 'Mirai-greip_flood' 'DoS-SYN_Flood' 'MITM-ArpSpoofing'
 'Recon-HostDiscovery' 'DDoS-ICMP_Fragmentation' 'VulnerabilityScan'
 'Recon-PortScan' 'DoS-HTTP_Flood' 'Uploading_Attack' 'DNS_Spoofing' 'XSS'
 'Backdoor_Malware' 'Recon-OSScan' 'DDoS-HTTP_Flood' 'BrowserHijacking'
 'DDoS-SlowLoris' 'Recon-PingSweep' 'SqlInjection' 'DictionaryBruteForce'
 'CommandInjection']
Label: DoS-UDP_Flood, Number of samples: 168753
Label: BenignTraffic, Number of samples: 55859
Label: DDoS-SynonymousIP_Flood, Number of samples: 182094
Label: DDoS-SYN_Flood, Number of samples: 206146
Label: DDoS-ICMP_Flood, Number of samples: 364557
Label: DDoS-RSTFINFlood, Number of samples: 204892
Label: DDoS-UDP_Flood, Num

In [7]:
# Optionally, save each subset to a separate CSV file
output_dir = "./split_by_label"
os.makedirs(output_dir, exist_ok=True)
for label, df in data_by_label.items():
    # Create a safe filename by replacing problematic characters
    safe_label = label.replace("/", "_").replace(" ", "_")
    output_file = os.path.join(output_dir, f"{safe_label}.csv")
    df.to_csv(output_file, index=False)
    print(f"Saved {label} data to {output_file}")

Saved DoS-UDP_Flood data to ./split_by_label/DoS-UDP_Flood.csv
Saved BenignTraffic data to ./split_by_label/BenignTraffic.csv
Saved DDoS-SynonymousIP_Flood data to ./split_by_label/DDoS-SynonymousIP_Flood.csv
Saved DDoS-SYN_Flood data to ./split_by_label/DDoS-SYN_Flood.csv
Saved DDoS-ICMP_Flood data to ./split_by_label/DDoS-ICMP_Flood.csv
Saved DDoS-RSTFINFlood data to ./split_by_label/DDoS-RSTFINFlood.csv
Saved DDoS-UDP_Flood data to ./split_by_label/DDoS-UDP_Flood.csv
Saved DDoS-ACK_Fragmentation data to ./split_by_label/DDoS-ACK_Fragmentation.csv
Saved Mirai-udpplain data to ./split_by_label/Mirai-udpplain.csv
Saved DDoS-UDP_Fragmentation data to ./split_by_label/DDoS-UDP_Fragmentation.csv
Saved DDoS-TCP_Flood data to ./split_by_label/DDoS-TCP_Flood.csv
Saved DoS-TCP_Flood data to ./split_by_label/DoS-TCP_Flood.csv
Saved Mirai-greeth_flood data to ./split_by_label/Mirai-greeth_flood.csv
Saved DDoS-PSHACK_Flood data to ./split_by_label/DDoS-PSHACK_Flood.csv
Saved Mirai-greip_flood da