## Machine Learning based Intrusion Detection System.

Performed the Machine Learning workflow which included the following steps:

1. **Data Collection:** 
  - This included finding and selecting data that was processed and used to train our machine learning model. 
   - The source of the dataset used is from [Kaggle](https://www.kaggle.com/sampadab17/network-intrusion-detection). It is a huge repository of community published data.
  - Short description: The dataset consists of a wide variety of intrusions simulated in a military network environment. It has 25192 rows & 42 columns.
  - *Note: Just after loading the data, the data was split into 7:3 training and testing data to prevent data leakage.*

2. **Data Analysis and Preparation:**
  - Here, we analysed the data to find any discrepancies, interesting patterns, coorrelation in data, etc. This step is also popularly known as *Exploratory Data Analysis*.
  - After analysing, we performed some standard data preprocessing techniques. It was done wherever we felt that it would affect our process.
  - Most of our time was consumed during this process.
  - Some methods used are:
    - *Data cleaning* - handling missing values by mean imputation, etc.
    - *Data Scaling and Normalisation* - Scaling or Normalisation is common preprocessing technique used in machine learning where the data is ususally normalised to a scale of 0 to 1.
    - *Data Encoding* - Most of the models cannot process strings/objects. So the data needs to be transformed to numerical data. This process is known as data encoding(also data transformation).
    - *Feature Selection* - Removing redundant features or selecting the most "useful" features. We used `recursive feature elimination` for feature selection.
3. **Model Selection/Building:**
  - Here, we choose the right models that can be used for the required task.
  - Our task required us to use a *Classification model*.
  - We selected 2 State of the art models - *LightGBM* and *XGBoost*.
  - We also selected a few standard models to compare the results namely Logistic Regression, SVC, Naive Bayes.

4. **Model Training:**
  - The model was trained on the training data which took a few minutes for each model.

5. **Model Evaluation**
  - After the model is trained, we evaluate the performance of the model.
  - The evaluation metrics used are: *Precision* and *Recall*.

6. **Parameter Tuning**
  - A model needs to be "tuned" for each particular scenario/ usecase based on the dataset.
  - This includes changing various parameters and evaluating the results simulaneously.

7. **Making Predictions**
  - After the model parameters are finalized and it is trained, it can be saved and used for making predictions.

### Import Relevant Libraries

In [None]:
# Data Manipulation libraries
import pandas as pd
import numpy as np
# Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# for normalization
from sklearn.preprocessing import StandardScaler
# for encoding
from sklearn.preprocessing import LabelEncoder
# for feature selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

# for model selection and training
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB 
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score

# for model evaluation
from sklearn.metrics import confusion_matrix, classification_report, f1_score

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Seed for random state
SEED = 42

### Load the dataset

In [None]:
df = pd.read_csv("../input/network-intrusion-detection/Train_data.csv")

In [None]:
# Let's view the data.
print("Training data has {} rows & {} columns".format(df.shape[0],df.shape[1]))
df.head()

#### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop("class", axis=1), df["class"], test_size=0.3, random_state=SEED)

### Data Analysis and Preprocessing

In [None]:
X_train.info()

In [None]:
# Descriptive analysis of the data
X_train.describe()

- We identified various features with binary values and a few with redundancy.
- A few features have object data type that need to be encoded into numerical values.
- Also, few features have high scale difference and need to normalised.

In [None]:
print(X_train['is_host_login'].value_counts())
print(X_train['num_outbound_cmds'].value_counts())

Here, we found that 'is_host_login' and 'num_outbound_cmds' have only one unique value i.e., 0. This introduces redundancy, as a feature with only 1 value won't affect our model. We can remove it and reduce the size of the data and hence improve the training process.

In [None]:
# 'num_outbound_cmds' and 'is_host_login' are redundant column so remove it from data.
X_train.drop(['num_outbound_cmds','is_host_login'], axis=1, inplace=True)
X_test.drop(['num_outbound_cmds','is_host_login'], axis=1, inplace=True)

#### Exploratory Data Analysis

*We plot various graphs to identify distributions, relationships or any pattern that is not visible by seeing raw data.*

In [None]:
# Target Class Distribution
sns.countplot(y_train)

In [None]:
sns.countplot(X_train['protocol_type'], hue=y_train)

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(X_train['flag'], hue=y_train)

In [None]:
plt.figure(figsize=(16,6))
sns.distplot(X_train['count'], kde=False)

In [None]:
sns.distplot(X_train.dst_host_srv_count)

##### Observations from Data Analysis
- We identified a slight imbalance in the target column "class" of our dataset. But it is not significant, otherwise we could go for oversampling.
- 80% of traffic belongs `TCP` while 12% belongs to `UDP` and rest to `ICMP`.
- Most of the `ICMP` traffic had `anomaly`; most of the `UDP` traffic was `normal`; while the distribution was almost equal in case of `TCP`.
- The traffic distribution on the basis of flags was also uneven where most of it had `SF(Sign Flag)`.
- Most of the traffic with `SF` was `normal`, while that had `S0` flag had `anomaly`.
- Most of the traffic recorded was `unique`.
- Count of most of the connections having the same destination host and using the same service was either very low or very high.

**We are `encoding` the target class to 0s and 1s, so that it can be used for further analysis and training.**

In [None]:
# Encoding target class to 0 and 1
y_train = y_train.apply(lambda x: 1 if x=="anomaly" else 0)
y_test = y_test.apply(lambda x: 1 if x=="anomaly" else 0)

In [None]:
# Correlation Heatmap
plt.figure(figsize=(16,10))
sns.heatmap(X_train.corr().apply(abs))

In [None]:
corr_with_target = X_train.corrwith(y_train).apply(abs)
corr_with_target[corr_with_target>0.7]

- From the above correlation heatmap, we can see that most of the data has very low correlation. This is a good characterstic for our Machine Learning Process.
- Few features had high correlation with our target class namely,`same_srv_rate`, `dst_host_srv_count`, which will be helpful for our model.

#### Encoding Categorical Data

In [None]:
# Custom Label Encoder for handling unknown values
class LabelEncoderExt(object):
    def __init__(self):
        self.label_encoder = LabelEncoder()

    def fit(self, data):
        self.label_encoder = self.label_encoder.fit(list(data) + ['Unknown'])
        self.classes_ = self.label_encoder.classes_
        return self

    def transform(self, data):
        new_data = list(data)
        for unique_item in np.unique(data):
            if unique_item not in self.label_encoder.classes_:
                new_data = ['Unknown' if x==unique_item else x for x in new_data]
        return self.label_encoder.transform(new_data)

In [None]:
le = LabelEncoderExt()

# encode the selected columns
for col in X_train.select_dtypes("object"):
  le.fit(X_train[col])
  X_train[col] = le.transform(X_train[col])
  X_test[col] = le.transform(X_test[col])

#### Normalizing the numerical data.

In [None]:
scaler = StandardScaler()
# store the columns
cols = X_train.columns

# transform the data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns = cols)
X_test_scaled = pd.DataFrame(X_test_scaled, columns = cols)

#### Feature Selection

In [None]:
rfc = RandomForestClassifier();

rfc.fit(X_train_scaled, y_train)

feat_imp = pd.DataFrame({'feature':X_train.columns,'importance':rfc.feature_importances_})
feat_imp = feat_imp.sort_values('importance',ascending=False).set_index('feature')

**Visualization of the Feature Importances**

In [None]:
# plot feat_imp
plt.figure(figsize = (10, 5))
plt.title("Feature Importance")
plt.ylabel("Importances")
plt.xlabel("Features")
plt.xticks(rotation=90)
plt.plot(feat_imp)

**Using `recursive feature elimination` for Feature Selection**

In [None]:
estimator = RandomForestClassifier()
selector = RFE(estimator)
selector.fit(X_train_scaled, y_train)

X_train_scaled = selector.transform(X_train_scaled)
X_test_scaled = selector.transform(X_test_scaled)

### Model Selection

In [None]:
# SVC Model
svc = SVC(random_state=SEED)

# LogisticRegression Model
lr = LogisticRegression()

# Gaussian Naive Bayes Model
bnb = BernoulliNB()

In [None]:
# Train XGBoost Classifier
xgbc = XGBClassifier(eval_metric="logloss", random_state=SEED)

# Train LightGBM Classifier
lgbmc = LGBMClassifier(random_state=SEED)

#### Model Testing on Validation Data

In [None]:
models = {}
models['SVC']= svc
models['LogisticRegression']= lr
models['Naive Bayes Classifier']= bnb
models['XGBoost Classifier']= xgbc
models['LightGBM Classifier']= lgbmc
scores = {}
for name in models:
  scores[name]={}
  for scorer in ['precision','recall']:
    scores[name][scorer] = cross_val_score(models[name], X_train_scaled, y_train, cv=10, scoring=scorer)

In [None]:
def line(name):
  return '*'*(25-len(name)//2)

for name in models:
  print(line(name), name, 'Model Validation', line(name))

  for scorer in ['precision','recall']:
    mean = round(np.mean(scores[name][scorer])*100,2)
    stdev = round(np.std(scores[name][scorer])*100,2)
    print ("Mean {}:".format(scorer),"\n", mean,"%", "+-",stdev)
    print()

Though SVC classifier was close, but from the above results, we can observe that our model XGBoost Classifier and LightGBM Classifier perform the best on the validation data.

The evaluation metrics used are:
- Precision: also called positive predictive value, is the fraction of correct positive predictions among all the positive predictions.

- Recall: also known as sensitivity, is the fraction of correct positive predictions that were correct positives.

Precision and Recall can be calculated by:

![Precision and Recall](https://wikimedia.org/api/rest_v1/media/math/render/svg/d37e557b5bfc8de22afa8aad1c187a357ac81bdb)
Reference: [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall)


In [None]:
for name in models:
    for scorer in ['precision','recall']:
        scores[name][scorer] = scores[name][scorer].mean()
scores=pd.DataFrame(scores).swapaxes("index", "columns")*100

In [None]:
scores.plot(kind = "bar",  ylim=[80,100], figsize=(10,6), rot=0)

In [None]:
models = {}
models['SVC']= svc
models['LogisticRegression']= lr
models['Naive Bayes Classifier']= bnb
models['XGBoost Classifier']= xgbc
models['LightGBM Classifier']= lgbmc
preds={}
for name in models:
    models[name].fit(X_train_scaled, y_train)
    preds[name] = models[name].predict(X_test_scaled)
print("Predictions complete.")

In [None]:
def line(name,sym="*"):
    return sym*(25-len(name)//2)
target_names=["normal","anamoly"]
for name in models:
    print(line(name), name, 'Model Testing', line(name))
    print(confusion_matrix(y_test, preds[name]))
    print(line(name,'-'))
    print(classification_report(y_test, preds[name], target_names=target_names))

In [None]:
f1s = {}
for name in models:
    f1s[name]=f1_score(y_test, preds[name])
f1s=pd.DataFrame(f1s.values(),index=f1s.keys(),columns=["F1-score"])*100

In [None]:
f1s.plot(kind = "bar",  ylim=[80,100], figsize=(10,6), rot=0)