Anomaly detection has been the main focus of many researchers’ due to its potential in detecting novel attacks. However, its adoption to real-world applications has been hampered due to system complexity as these systems require a substantial amount of testing, evaluation, and tuning prior to deployment.
Our project aims to help the field experts in Cybersecurity and non-experts by notifying them of potential malicious activity and its nature. This dataset was generated by the Canadian Institute for Cybersecurity (CIC) and the Communications Security Establishment (CSE) to leverage anomaly detection techniques to detect network intrusion.
The attacking infrastructure includes 50 machines and the victim organization has 5 departments and includes 420 machines and 30 servers. The dataset includes the captures network traffic and system logs of each machine, along with 80 features extracted from the captured traffic using CICFlowMeter-V3.[1]
We make use of AWS Sagemaker to do pre-processing, EDA and model training and testing.
The dataset used in this project was "A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)". It was accessed on 14th Nov 2021 from https://registry.opendata.aws/cse-cic-ids2018
We process the csv files into a pandas dataframe and also reduce the memory utilization. We do this in Preprocessing for Pickling file.
We clean the dataset of all outliers and invalid entries in Unpickle, Clean and Drop
We look at various correlation heatmaps which are by class labels and the entire data distribution as a whole in EDA.
We train and evaluate a baseline model in LogisticRegression
We evaluate 3 ensemble models on the sample of the processed data and then perform hyperparameter tuning on the full dataset in Model Selection