<a href="https://colab.research.google.com/github/vangapandukundan/Threat-Detection-in-Cyber-Security-Using-AI/blob/v2-harshitha/datapreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
!unzip -u /content/MachineLearningCSV.zip -d /content/MachineLearningCSV
!unzip -u /content/GeneratedLabelledFlows.zip -d /content/GeneratedLabelledFlows

Archive:  /content/MachineLearningCSV.zip
Archive:  /content/GeneratedLabelledFlows.zip


In [27]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Function to read and combine all CSV files in a folder
def load_all_csvs_from_folder(folder_path):
    all_dataframes = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".csv"):
            file_path = os.path.join(folder_path, filename)
            try:
                df = pd.read_csv(file_path, low_memory=False)
                all_dataframes.append(df)
                print(f"Loaded: {filename}")
            except Exception as e:
                print(f"Error reading {filename}: {e}")
    # Combine all DataFrames
    combined_df = pd.concat(all_dataframes, ignore_index=True)
    return combined_df

# Function to preprocess the dataset
def preprocess_data(df):
    # Drop columns with too many missing values
    df.dropna(axis=1, thresh=len(df) * 0.6, inplace=True)

    # Fill remaining missing values with 0
    df.fillna(0, inplace=True)

    # Drop non-informative or redundant columns
    df.drop(columns=[col for col in df.columns if 'Flow ID' in col or 'Timestamp' in col], errors='ignore', inplace=True)

    # Encode categorical variables
    for column in df.select_dtypes(include=['object']).columns:
        if column != 'Label':
            le = LabelEncoder()
            df[column] = le.fit_transform(df[column].astype(str))

    # Encode labels (binary classification: normal vs attack)
    df['Label'] = df['Label'].apply(lambda x: 0 if 'BENIGN' in str(x).upper() else 1)

    return df

# Function to train and evaluate a Random Forest model
def train_improved_model(df):
    X = df.drop('Label', axis=1)
    y = df['Label']

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Split into training and testing data
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42, stratify=y)

    # Train a Random Forest Classifier
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)

    # Predict and evaluate
    y_pred = clf.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)

    return report

# Example usage
folder_path = "/content/MachineLearningCSV/MachineLearningCVE"  # Replace with your real path

try:
    raw_df = load_all_csvs_from_folder(folder_path)
    processed_df = preprocess_data(raw_df)
    model_report = train_improved_model(processed_df)
except Exception as e:
    model_report = str(e)

print(model_report)


Loaded: Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
Loaded: Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
Loaded: Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
Loaded: Monday-WorkingHours.pcap_ISCX.csv
Loaded: Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
Loaded: Friday-WorkingHours-Morning.pcap_ISCX.csv
Loaded: Tuesday-WorkingHours.pcap_ISCX.csv
Loaded: Wednesday-workingHours.pcap_ISCX.csv
'Label'


The code performs the following preprocessing steps on the dataset:

Handles missing values: It drops columns with more than 40% missing values and fills the remaining missing values with 0.
Drops irrelevant columns: It removes columns like 'Flow ID' and 'Timestamp' which are not useful for the model.
Encodes categorical features: It uses LabelEncoder to convert categorical columns (except 'Label') into numerical representations.
Encodes the 'Label' column: It converts the 'Label' column into a binary format (0 for 'BENIGN' and 1 for attacks).