# Feature Enigneering 

Feature engineering is a crucial step in preparing data for model training, as it directly impacts the performance and interpretability of the network anomaly detection system. First, we load the processed datasets to ensure that all preprocessing steps, such as label encoding and standardization, have been correctly applied. Handling missing values is essential because gaps in the data can distort model predictions; we address this by using median or mode imputation to maintain consistency while preserving data distribution. Next, we perform feature selection to remove redundant, highly correlated, or low-variance features that add noise and increase computational complexity, thereby improving model efficiency and generalization. Feature transformation, including scaling numerical features, ensures that models—especially those relying on distance-based calculations—are not biased by large numerical ranges. Additionally, encoding categorical variables into numerical representations enables machine learning algorithms to process them effectively. Finally, we save and upload the transformed dataset to AWS S3, ensuring version control, accessibility, and reproducibility for model training. By carefully engineering features, we enhance model accuracy, prevent overfitting, and optimize training time, ultimately leading to a more robust and reliable network anomaly detection system.



In [4]:
# Feature Engineering Notebook (Feature_Engineering.ipynb)

import pandas as pd
import numpy as np
import boto3  # AWS S3 for data handling
import os
from sklearn.preprocessing import StandardScaler

# 🔹 Load Processed Data from S3 (or local directory)
s3_bucket_name = "network-anomaly-dataset-001aefd6"
processed_data_prefix = "processed/"

# List of processed files
selected_files = [
    "processed_Monday-WorkingHours.pcap_ISCX.csv",
    "processed_Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
    "processed_Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv",
    "processed_Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv",
    "processed_Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv"
]

dfs = {}

s3 = boto3.client("s3")

for file_name in selected_files:
    file_path = processed_data_prefix + file_name
    obj = s3.get_object(Bucket=s3_bucket_name, Key=file_path)
    df = pd.read_csv(obj["Body"])
    dfs[file_name] = df
    print(f" Loaded {file_name} - Shape: {df.shape}")

# 🔹 Feature Selection - Drop Unnecessary Columns
drop_columns = ["Flow Bytes/s", "Flow Packets/s", "Fwd Header Length.1"]  

for name, df in dfs.items():
    df.drop(columns=drop_columns, inplace=True, errors="ignore")
    dfs[name] = df

print("Dropped unnecessary columns.")

# 🔹 Handle Missing Values
for name, df in dfs.items():
    df.fillna(df.median(), inplace=True)  # Fill NaN values with median
    dfs[name] = df

print("Missing values handled.")

# 🔹 Feature Scaling - Normalize Numeric Data
scaler = StandardScaler()

for name, df in dfs.items():
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    dfs[name] = df

print("Features scaled.")

# 🔹 Save Final Features Dataset Locally & Upload to S3
output_path = "Final_Features/"
os.makedirs(output_path, exist_ok=True)

for name, df in dfs.items():
    output_file = output_path + name.replace("processed_", "final_")
    df.to_csv(output_file, index=False)
    print(f" Saved: {output_file}")

    # Upload to S3
    s3.upload_file(output_file, s3_bucket_name, "Final_Features/" + name.replace("processed_", "final_"))
    print(f" Uploaded to S3: {name.replace('processed_', 'final_')}")

print("Feature Engineering Completed!")


 Loaded processed_Monday-WorkingHours.pcap_ISCX.csv - Shape: (78701, 79)
 Loaded processed_Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv - Shape: (94342, 79)
 Loaded processed_Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv - Shape: (110641, 79)
 Loaded processed_Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv - Shape: (79071, 79)
 Loaded processed_Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv - Shape: (89472, 79)
Dropped unnecessary columns.
Missing values handled.
Features scaled.
 Saved: Final_Features/final_Monday-WorkingHours.pcap_ISCX.csv
 Uploaded to S3: final_Monday-WorkingHours.pcap_ISCX.csv
 Saved: Final_Features/final_Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
 Uploaded to S3: final_Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
 Saved: Final_Features/final_Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
 Uploaded to S3: final_Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
 Saved: Final_Features/final_Thursday-WorkingHours-Morn