<a href="https://colab.research.google.com/github/securitylab-repository/ia-security-project-2021/blob/main/projetiasec2021_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projet AI & Security - 2021

## Part 1

### 1. 
<img src='https://drive.google.com/uc?id=17QDSJ6k0ANSKEReSUUYG99WEE4DdILfI' alt="50%" style="zoom:20%" />

As we have seen in the class, the linear classifiers such as perceptron and SVM work only if the dataset is linearly separable like the image above.

However, there are solutions using concepts called basic expansion and kernel that extend these linear classifiers to be efficient and pratical in the case where data are not linearly separable, they admit instead a quadratic or polynomial separator (cf. images below).

<img src='https://drive.google.com/uc?id=1BJ99omnZMtaFDW1_iXROEaeWx7KQFg3z' alt="50%" style="zoom:20%" />

![](https://drive.google.com/uc?id=1fNsV8VR4DvXnXQNNwQRu8DqPixprn3D6)


> See these videos

- [video 1](https://efrei365net-my.sharepoint.com/:v:/g/personal/boussad_aitsalem_efrei_net/Ec_sBy1Rzq5JuI_kcoDoia0BAdNvbf7puoawodRgTCWFGQ?e=8dFfXe)
- [video 2](https://efrei365net-my.sharepoint.com/:v:/g/personal/boussad_aitsalem_efrei_net/Ec_sBy1Rzq5JuI_kcoDoia0BAdNvbf7puoawodRgTCWFGQ?e=w4D8HC)
- [video 3](https://efrei365net-my.sharepoint.com/:v:/g/personal/boussad_aitsalem_efrei_net/EfhSuex2aqlKnhE4TAU-PHQBwCcPJdPupeIEpQW4ZyWlHg?e=cHwbX1)
- [video 4](https://efrei365net-my.sharepoint.com/:v:/g/personal/boussad_aitsalem_efrei_net/EeeVAcBUz5VEj9zsUobX_ZMBotaaKI5sssw0_M-1KdLRyQ?e=vhFaCP)

and make a presentation explaining how to go from linear perceptron/SVM algorithm to the non-linear one.

###2. 

We have discussed in class, the case of binary classification. However, in practice we can face to multi-class learning problem. In this context, give the differences between:

  - The One-vs-Rest strategy
  - The One-vs-One strategy 

## Part 2

In this section you will try to build a network attack classifier from scratch using machine learning. The dataset that we will use is the [NSL-KDD dataset](https://www.unb.ca/cic/datasets/nsl.html), which is an improvement to a classic network intrusion detection dataset used widely by security data science professionals. The original [1999 KDD Cup dataset](https://kdd.ics.uci.edu/databases/kddcup99/task.html) was created for the DARPA Intrusion Detection Evaluation Program, prepared and managed by MIT Lincoln Laboratory.

The data was collected over nine weeks and consists of raw tcpdump traffic in a local area network (LAN) that simulates the environment of a typical United States Air Force LAN. Some network attacks were deliberately carried out during the recording period. There were 38 different types of attacks, but only 24 are available in the training set. These attacks belong to four general categories:

- dos Denial of service
-  r2l Unauthorized accesses from remote servers
-  u2r Privilege escalation attempts
- probe Brute-force probing attacks



### Exploring the Data

Let’s begin by getting more intimate with the data on hand. The labeled training data as comma-separated values (CSV) looks like this:
```
0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,
0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,
0.00,normal,20
```
```
0,icmp,eco_i,SF,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,21,0.00,0.00,
0.00,0.00,1.00,0.00,1.00,2,60,1.00,0.00,1.00,0.50,0.00,0.00,0.00,
0.00,ipsweep,17 
```

The last value in each record is an artifact of the NSL-KDD improvement that we can ignore. The class label is the second-to-last value in each record, and the other 41 values correspond to these features:

<img src='https://drive.google.com/uc?id=1gw43YWxXRbDC2PHqIV2v_Wsv0GxylnG8' alt="50%" style="zoom:20%;" />

<img src='https://drive.google.com/uc?id=1L1ItB_4j1wYXhyvjCVibs25mMBq8eOU5' alt="50%" style="zoom:20%;" />

<img src='https://drive.google.com/uc?id=1BuM-twCi4hgKKPg63ULjjtgWNaQqurSQ' alt="50%" style="zoom:20%;" />


## Reading and processing dataset



In [None]:
import os
from collections import defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
dataset_root = 'datasets/nsl-kdd'

In [None]:
train_file = os.path.join(dataset_root, 'KDDTrain+.txt')
test_file = os.path.join(dataset_root, 'KDDTest+.txt')

In [None]:
# header_names is a list of feature names in the same order as the data
header_names = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack_type', 'success_pred']

Our task is to devise a general classifier that categorizes each individual sample as one of five classes: `benign`, `dos`, `r2l`, `u2r`, or `probe`. The training dataset contains samples that are labeled with the specific attack: `ftp_write` and `guess_passwd` attacks correspond to the `r2l` category, `smurf` and `udpstorm` correspond to the `dos` category, and so on. The mapping from attack labels to attack categories is specified in the file `training_attack_types.txt`

Thnaks to the  following code we find that there are 41 attack types specified. Each of them belonging to some category : `benign`, `dos`, `r2l`, `u2r`, or `probe`. The mapping attack types and catorgy is saved in `attack_mapping` variable.


In [None]:
# training_attack_types.txt maps each of the 22 different attacks to 1 of 4 categories
# file obtained from http://kdd.ics.uci.edu/databases/kddcup99/training_attack_types

category = defaultdict(list)
category['benign'].append('normal')

with open('datasets/training_attack_types.txt', 'r') as f:
    for line in f.readlines():
        attack, cat = line.strip().split(' ')
        category[cat].append(attack)

attack_mapping = dict((v,k) for k in category for v in category[k])
print(attack_mapping)

> Starting from this point, create training and test set by:
  - adding a new feature `attack_category` representing each category. 
  - removing unnecessary feature

## Generating and analyzing train and test sets

> In order to understand the two data set, let’s **compute** and **plot** the `attack_type` and `attack_category` distributions.

> How do you note ?

## Data preparation

> Split the **test** and **training** DataFrames into **data (x)** and **labels (y)**

The following code use [kddcup.names](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names) to separate between nominal (categorical), binary and numerical (continous) features.

In [None]:
# Differentiating between nominal, binary, and numeric features

# root_shell is marked as a continuous feature in the kddcup.names 
# file, but it is supposed to be a binary feature according to the 
# dataset documentation

col_names = np.array(header_names)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()


> Use the function `pd.get_dummies()`that applies [one-hot encoding](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding) to categorical (nominal) variables such as flag, creating multiple binary variables for each possible value of flag that appears in the dataset. For instance, if a sample has value flag=S2, its dummy variable representation (for flag) will be: 

```
# flag_S0, flag_S1, flag_S2, flag_S3, flag_SF, flag_SH
[    0,       0,       1,       0,       0,       0    ]
```

> For each sample, only one of these variables can have the value 1; hence the name “one-hot.”



The following code gives us the descriptive statistics for the `duration` feature. 

In [None]:
# Example statistics for the 'duration' feature before scaling
train_x['duration'].describe()

> Extend the code to all continous features

> Is the dataset standardized or normalized ? If no, solve the problem.

### Classification

It's time now to start the training and test step. For each algorithm SVM, Knn and decision-Tree:
- Train a Model
- Use cross validation process
to set the algorithm parameters
- Compute the error rate


> Are you satisfied with the error rate ? If no :
  - Can we explain this high error rate in relation to the problem raised during the `Generating and analyzing train and test sets` section ?
  - Try to address the problem.