# HYBRID INTRUSION DETECTION SYSTEM (PLACEHOLDER)

This is the code for the paper " "

Author: William S. Ventura (w.stephan.ventura@gmail.com)
Organization: Whiting School of Engineering, Johns Hopkins University

## IMPORT LIBRARIES

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from genetic_selection import GeneticSelectionCV
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.tree import DecisionTreeClassifier
from sklearn_genetic.plots import plot_fitness_evolution
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support
from sklearn.metrics import f1_score, roc_auc_score
import hdbscan
import xgboost as xgb
from xgboost import plot_importance

2022-12-01 05:10:28.524698: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Reading in the CIC_Collection Dataset

The CIC Collection Dataset is a combination of CIC-IDS2017, CIC-DoS2017, CSE-CIC-IDS2018 and CIC-DDoS2019. Publicly available on Kaggle @ https://www.kaggle.com/code/dhoogla/cic-collection-00-clean-up

Note: The cleaned up version of this collection has removed contaminated features found in the CIC datasets and other NIDS datasets which have equal blind predictive power across all available attack classes, despite having had access to only one attack class during training.
The features which contaminate the CIC collection dataset in the aforementioned way are in order of severity:

    PSH Flag Count, ECE Flag Count, RST Flag Count, ACK Flag Count
    Fwd Packet Length Min
    Bwd Packet Length Min
    Packet Length Min
    Protocol


Note: At the start of this project the original cic-collection was used, it was later modified to use the cleaned up version, both .parquet files are available in the "ids_data" folder

Note: Due to the massive size of the dataset and hardware limitations, a sampled subset of CIC_Collection is used. The subsets are in the "ids_data" folder.

In [3]:
# Load Dataset
pd.set_option('display.max_columns', None)
df = pd.read_parquet("./ids_data/cic-collection.parquet")

In [4]:
df.shape

(9167581, 79)

## Improving the CIC Collection
###  1.Removing contaminating features

In [5]:
df.columns

Index(['Protocol', 'Flow Duration', 'Total Fwd Packets',
       'Total Backward Packets', 'Fwd Packets Length Total',
       'Bwd Packets Length Total', 'Fwd Packet Length Max',
       'Fwd Packet Length Min', 'Fwd Packet Length Mean',
       'Fwd Packet Length Std', 'Bwd Packet Length Max',
       'Bwd Packet Length Min', 'Bwd Packet Length Mean',
       'Bwd Packet Length Std', 'Flow Bytes/s', 'Flow Packets/s',
       'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min',
       'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max',
       'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean', 'Bwd IAT Std',
       'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags', 'Bwd PSH Flags',
       'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
       'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
       'Packet Length Min', 'Packet Length Max', 'Packet Length Mean',
       'Packet Length Std', 'Packet Length Variance', 'FIN Flag Count',
       'SYN Flag Count', 'RST Fla

In [6]:
df = df.drop(columns=['PSH Flag Count', 'ECE Flag Count', 'RST Flag Count', 'ACK Flag Count', 'Fwd Packet Length Min', 'Bwd Packet Length Min', 'Packet Length Min', 'Protocol', 'Down/Up Ratio'], axis=0)
df.shape

(9167581, 70)

## 2. Removing features with no separating power

For the CIC collection, 11 features with 0 predictive power have been identified based on the findings in "Discovering Non-Metadata Contaminant Features in Intrusion Detection Datasets"

Link:
https://www.researchgate.net/publication/363265363_Discovering_Non-Metadata_Contaminant_Features_in_Intrusion_Detection_Datasets?channel=doi&linkId=6314afa85eed5e4bd1478531&showFulltext=true

In [7]:
#df = df.drop(columns=['Bwd Avg Bulk Rate', 'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk', 'Bwd PSH Flags', 'Bwd URG Flags', 'CWE Flag Count', 'FIN Flag Count', 'Fwd Avg Bulk Rate', 'Fwd Avg Bytes/Bulk', 'Fwd Avg Packets/Bulk', 'Fwd URG Flags'])
#df.shape


In [8]:
df.to_parquet('./ids_data/cic-collection-clean.parquet')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9167581 entries, 0 to 9167580
Data columns (total 70 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   Flow Duration             int64  
 1   Total Fwd Packets         int32  
 2   Total Backward Packets    int32  
 3   Fwd Packets Length Total  float64
 4   Bwd Packets Length Total  float64
 5   Fwd Packet Length Max     float64
 6   Fwd Packet Length Mean    float32
 7   Fwd Packet Length Std     float32
 8   Bwd Packet Length Max     float64
 9   Bwd Packet Length Mean    float32
 10  Bwd Packet Length Std     float32
 11  Flow Bytes/s              float64
 12  Flow Packets/s            float64
 13  Flow IAT Mean             float32
 14  Flow IAT Std              float32
 15  Flow IAT Max              float64
 16  Flow IAT Min              float64
 17  Fwd IAT Total             float64
 18  Fwd IAT Mean              float32
 19  Fwd IAT Std               float32
 20  Fwd IAT Max             

In [10]:
# Drop unnecessary columns (Extra Label Column)
df.drop(columns=["Label"], axis=1, inplace=True)

## Preprocessing

In [11]:
# Z-Score Normalization
features = df.dtypes[df.dtypes != 'object'].index
df[features] = df[features].apply(
    lambda x: (x - x.mean()) / (x.std()))
# Fill nan values with 0
df = df.fillna(0)

In [12]:
# encoding labels
labelencoder = LabelEncoder()
df.iloc[:, -1] = labelencoder.fit_transform(df.iloc[:, -1])
df.ClassLabel.value_counts()

0    7186189
3    1234729
4     397344
1     145968
2     103244
5      94857
7       2995
6       2255
Name: ClassLabel, dtype: int64

In [13]:
df.shape

(9167581, 69)

## Data Sampling
Since the data is too large, a small subset will be generated to train the model using HDBSCAN clustering

In [14]:
# Can Adjust Sample Size, but HDBSCAN was taking too long with the original data set
# Going to resample twice
df_sample1 = df.sample(frac=0.01, random_state=1)
print(f"DF Sampled Shape: {df_sample1.shape}")

DF Sampled Shape: (91676, 69)


In [15]:
# Minors are 5: Infiltration, 7: Webattack, 6: Portscan
# Keep the minor size and sampling from the remaining major classes
df_minor = df_sample1[(df_sample1['ClassLabel'] == 5) | \
                      (df_sample1['ClassLabel'] == 7) | (df_sample1['ClassLabel'] == 6)]
df_major = df_sample1.drop(df_minor.index)

In [16]:
X = df_major.drop(['ClassLabel'], axis=1)
y = df_major.iloc[:, -1].values.reshape(-1, 1)
y = np.ravel(y)
X.shape

(90599, 68)

In [17]:
# Use HDBSCAN to Cluster the data samples
print('start clusturing')
clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
cluster_labels = clusterer.labels_
df_major["ClusterLabels"] = cluster_labels
print(df_major["ClusterLabels"].value_counts())
print("done clustering \n")

start clusturing
-1       27569
 1380     2905
 2941     1311
 925       710
 2892      618
         ...  
 3672        5
 1423        5
 2322        5
 2178        5
 3738        5
Name: ClusterLabels, Length: 3854, dtype: int64
done clustering 



In [18]:
cols = list(df_major)
# with 2 layer of metadata removed it is 58, without it is 69
cols.insert(69, cols.pop(cols.index('ClassLabel')))
df_major = df_major.loc[:, cols]

In [19]:
df_major

Unnamed: 0,Flow Duration,Total Fwd Packets,Total Backward Packets,Fwd Packets Length Total,Bwd Packets Length Total,Fwd Packet Length Max,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,URG Flag Count,CWE Flag Count,Avg Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init Fwd Win Bytes,Init Bwd Win Bytes,Fwd Act Data Packets,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,ClusterLabels,ClassLabel
6300475,-0.024155,-0.018775,-0.012927,-0.023878,-0.007616,-0.519466,-0.339401,-0.543296,-0.407501,-0.201760,-0.501630,-0.044793,-0.104969,-0.015411,-0.005265,-0.011938,-0.002767,-0.023239,-0.017209,-0.005652,-0.011422,-0.002981,-0.334857,-0.200174,-0.262981,-0.275231,-0.086333,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004222,-0.093281,-0.091193,-0.477361,-0.340576,-0.463334,-0.203713,-0.114025,-0.205288,-0.193904,-0.049057,-0.310628,-0.339401,-0.201760,0.0,0.0,0.0,0.0,0.0,0.0,-0.018775,-0.023878,-0.012927,-0.007616,-0.557987,-0.430904,-0.017215,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,-0.022912,-0.002439,-0.010545,-0.087599,3534,0
5913184,-0.017002,-0.017323,-0.009483,-0.013505,-0.007278,1.275878,0.758992,1.634291,-0.223685,-0.078187,-0.089415,-0.044922,-0.105976,-0.013454,-0.001905,-0.006988,-0.002767,-0.016085,-0.013337,-0.000905,-0.006472,-0.003018,-0.323544,-0.182879,-0.239616,-0.258876,-0.086376,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004232,-0.093808,-0.094031,0.185264,0.095845,0.235656,-0.107130,-0.114025,-0.205288,-0.193904,-0.049057,0.089143,0.758992,-0.078187,0.0,0.0,0.0,0.0,0.0,0.0,-0.017323,-0.013505,-0.009483,-0.007278,-0.546362,-0.419995,-0.017215,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,-0.022912,-0.002439,-0.010545,-0.087599,1391,0
423760,-0.024212,-0.019259,-0.014650,-0.024691,-0.007813,-0.587215,-0.583488,-0.543296,-0.514233,-0.543963,-0.501630,-0.044928,0.122823,-0.015453,-0.005312,-0.011980,-0.002767,-0.023295,-0.017335,-0.005652,-0.011464,-0.003018,-0.334876,-0.200260,-0.262981,-0.275270,-0.086426,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004223,0.026202,0.549481,-0.580564,-0.679005,-0.600738,-0.206392,-0.114025,-0.205288,5.157202,-0.049057,-0.696823,-0.583488,-0.543963,0.0,0.0,0.0,0.0,0.0,0.0,-0.019259,-0.024691,-0.014650,-0.007813,-0.539070,-0.410372,-0.017702,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,-0.022912,-0.002439,-0.010545,-0.087599,418,0
1616769,0.156313,-0.010548,0.011187,-0.009007,-0.005949,0.283557,-0.088134,0.309289,0.160043,-0.138616,0.166299,-0.044927,-0.105991,-0.003678,0.005398,-0.000357,-0.002767,0.157841,0.004902,0.005178,0.000159,-0.003018,3.897836,1.093564,0.642553,0.492960,-0.086316,-0.179931,-0.023297,-0.014409,0.0,0.009330,0.004261,-0.093817,-0.094070,0.071414,-0.187852,0.066182,-0.143281,-0.114025,-0.205288,-0.193904,-0.049057,-0.235634,-0.088134,-0.138616,0.0,0.0,0.0,0.0,0.0,0.0,-0.010548,-0.009007,0.011187,-0.005949,0.985021,-0.415930,-0.016240,0.031772,-0.057800,-0.039814,-0.049389,-0.046643,0.005244,0.002855,0.001712,-0.015847,-1,0
5473594,-0.024212,-0.018291,-0.014650,-0.024320,-0.007813,-0.525444,-0.509305,-0.450089,-0.514233,-0.543963,-0.501630,-0.043752,-0.013365,-0.015453,-0.005312,-0.011979,-0.002767,-0.023295,-0.017334,-0.005651,-0.011464,-0.003018,-0.334876,-0.200260,-0.262981,-0.275270,-0.086426,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004222,-0.020939,0.036185,-0.555173,-0.649369,-0.562935,-0.206189,-0.114025,-0.205288,-0.193904,-0.049057,-0.663004,-0.509305,-0.543963,0.0,0.0,0.0,0.0,0.0,0.0,-0.018291,-0.024320,-0.014650,-0.007813,-0.505516,-0.430852,-0.017215,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,-0.022912,-0.002439,-0.010545,-0.087599,2824,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3007592,0.099460,-0.018775,-0.016372,-0.024691,-0.007813,-0.587215,-0.583488,-0.543296,-0.514233,-0.543963,-0.501630,-0.044928,-0.105994,0.258664,-0.005312,0.080548,0.078858,0.100304,0.256765,-0.005652,0.081064,0.078606,-0.334876,-0.200260,-0.262981,-0.275270,-0.086426,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004221,-0.093819,-0.094077,-0.580564,-0.679005,-0.600738,-0.206392,-0.114025,-0.205288,-0.193904,-0.049057,-0.696823,-0.583488,-0.543963,0.0,0.0,0.0,0.0,0.0,0.0,-0.018775,-0.024691,-0.016372,-0.007813,-0.449663,-0.430904,-0.017702,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,0.209201,-0.002439,0.087033,0.871862,2485,3
6771053,-0.023580,-0.018775,-0.016372,-0.024691,-0.007813,-0.587215,-0.583488,-0.543296,-0.514233,-0.543963,-0.501630,-0.044928,-0.105948,-0.014052,-0.005312,-0.011507,-0.002350,-0.022663,-0.015934,-0.005652,-0.010991,-0.002601,-0.334876,-0.200260,-0.262981,-0.275270,-0.086426,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004221,-0.093771,-0.094077,-0.580564,-0.679005,-0.600738,-0.206392,-0.114025,-0.205288,-0.193904,-0.049057,-0.696823,-0.583488,-0.543963,0.0,0.0,0.0,0.0,0.0,0.0,-0.018775,-0.024691,-0.016372,-0.007813,2.823886,-0.430904,-0.017702,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,-0.022912,-0.002439,-0.010545,-0.087599,1742,0
2842438,-0.024178,-0.018291,-0.009483,-0.024452,-0.007061,-0.547363,-0.535628,-0.483162,0.302353,0.110568,0.501685,-0.044235,-0.102984,-0.015441,-0.005292,-0.011955,-0.002767,-0.023295,-0.017334,-0.005651,-0.011463,-0.003018,-0.334081,-0.199041,-0.260307,-0.273627,-0.086422,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004228,-0.092466,-0.084401,0.209017,-0.091056,0.326065,-0.084512,-0.114025,-0.205288,-0.193904,-0.049057,-0.083400,-0.535628,0.110568,0.0,0.0,0.0,0.0,0.0,0.0,-0.018291,-0.024452,-0.009483,-0.007061,-0.125061,-0.419995,-0.017215,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,-0.022912,-0.002439,-0.010545,-0.087599,819,3
5876236,-0.017600,-0.017323,-0.011205,-0.013505,-0.007526,1.275878,0.758992,1.634291,-0.202508,-0.210813,-0.059370,-0.044923,-0.105977,-0.013358,-0.001728,-0.007106,-0.002767,-0.016683,-0.013669,-0.000923,-0.006590,-0.003018,-0.332980,-0.195898,-0.254990,-0.271281,-0.086365,-0.179931,-0.023297,-0.014409,0.0,0.009329,0.004229,-0.093807,-0.094040,0.185264,0.013044,0.272796,-0.098119,-0.114025,-0.205288,-0.193904,-0.049057,0.013928,0.758992,-0.210813,0.0,0.0,0.0,0.0,0.0,0.0,-0.017323,-0.013505,-0.011205,-0.007526,-0.545411,-0.419995,-0.017215,0.031772,-0.079321,-0.064744,-0.097397,-0.065099,-0.022912,-0.002439,-0.010545,-0.087599,1380,0


In [21]:
def sampling(df):
    name = df.name
    frac = 1.0
    return df.sample(frac=frac)
result = df_major.groupby('ClusterLabels', group_keys=False).apply(sampling)
result = result.drop(["ClusterLabels"], axis=1)
result = result.append(df_minor)

In [22]:
result['ClassLabel'].value_counts()

0    71634
3    12505
4     3997
1     1413
2     1050
5     1018
7       36
6       23
Name: ClassLabel, dtype: int64

In [23]:
result.to_csv('./ids_data/CIC_Collection_clean_sample.csv', index=0)

## Split train set and test set

In [24]:
df_clean_sample = pd.read_csv('./ids_data/CIC_Collection_clean_sample.csv')

In [25]:
X = df_clean_sample.drop(['ClassLabel'],axis=1).values
y = df_clean_sample.iloc[:, -1].values.reshape(-1,1)
y = np.ravel(y)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, random_state = 0,stratify = y)

## Feature Engineering
### Feature selection by genetic algorithm

In [None]:
clf = SVC(gamma='auto')

evolved_estimator = GAFeatureSelectionCV(
    estimator=clf,
    cv=3,
    scoring="accuracy",
    population_size=100,
    generations=15,
    n_jobs=-1,
    verbose=True,
    keep_top_k=2,
    elitism=True,
)

evolved_estimator.fit(X, y)
features = evolved_estimator.best_features_

In [None]:
print(f"Original X Shape: {X.shape}")

X_best_features = X[:, features]

print(f"Best Features X Shape: {X_best_features.shape}")

pd.DataFrame(X_best_features).to_csv('./ids_data/X_best_features.csv', index=0)

### Feature selection by Fast Correlation Based Filter (FCBF)
GitHub repo: https://github.com/SantiagoEG/FCBF_module

In [1]:
from FCBF_module import FCBF, FCBFK, FCBFiP, get_i
fcbf = FCBFK(k = 20)
X_bf = pd.read_csv('./ids_data/X_best_features.csv')

In [None]:
X_bbf = fcbf.fit_transform(X_bf)

## Resplit train & test sets after feature selection from genetic algorithm
## and fast correlation based filter

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_bbf,y, train_size = 0.8, test_size = 0.2, random_state = 0,stratify = y)

In [None]:
X_train.shape

In [None]:
pd.Series(y_train).value_counts()

## Synthetic Minority Oversampling Technique (SMOTE) to solve class imbalance

In [None]:
from imblearn.over_sampling import SMOTE
smote=SMOTE(n_jobs=-1,sampling_strategy='not majority')

In [None]:
X_train, y_train = smote.fit_resample(X_train, y_train)

In [None]:
pd.Series(y_train).value_counts()

## Intrusion Detection System Model Training
### Training four base learners: Random forest, XGBoost (Isolation forest for anomaly)

In [None]:
# Applying Random Forest

randomforest = RandomForestClassifier(random_state=7)
randomforest.fit(X_train, y_train)
rf_score = randomforest.score(X_test, y_test)
y_predict = randomforest.predict(X_test)
y_actual = y_test
print('Accuracy of RF: '+ str(rf_score))
precision,recall, fscore, none= precision_recall_fscore_support(y_actual, y_predict, average='weighted')
print('Precision of RF: '+(str(precision)))
print('Recall of RF: '+(str(recall)))
print('F1-score of RF: '+(str(fscore))
print(classification_report(y_true,y_predict))
cm=confusion_matrix(y_true,y_predict)
f,ax=plt.subplots(figsize=(7,5=7))
sns.heatmap(cm, annot=True,linewidth=0.5,linecolor="blue",fmt=".0f", ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()