# Advanced Machine Learning and Artificial Intelligence (MScA 32017)

# Project: Anomalies Detection using Autoencoders

## Notebook 4: Project Instructions

## Yuri Balasanov, Andrey Kobyshev, &copy; iLykei 2018

This notebook describes the project      .

# Instructions on data preparation using simple example

The following few methods of data preparation will be useful in the project.  

If a model has been trained it can be used for prediction of a new set of data. In this case new data need to be prepared on the same way as train data were.

Consider a simple example of what we need to do in this case.

## Toy Train Dataset

Consider a toy data containing some information about cars. The response contains two classes. 

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing

In [2]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [3]:
toyTrain = pd.DataFrame([['Cadillac', 'Red', 62391, 304, 1],
                          ['Cadillac', 'Blue', 2310, 213, 0], 
                          ['Ford', 'Red', 25391, 375, 0], 
                          ['Ford', 'Green', 12840, 160, 1]],
                         columns=['Brand', 'Color', 'Mileage','Horsepower','Class'])
toyTrain

Unnamed: 0,Brand,Color,Mileage,Horsepower,Class
0,Cadillac,Red,62391,304,1
1,Cadillac,Blue,2310,213,0
2,Ford,Red,25391,375,0
3,Ford,Green,12840,160,1


There are two categorical columns that represent Brand and Color.  

## One hot encoding

Transform character variables into numeric format using **one hot encoding** implemented in `pd.get_dummies()`.

In [4]:
toyTrain = pd.get_dummies(toyTrain, columns = ['Brand', 'Color'])
toyTrain

Unnamed: 0,Mileage,Horsepower,Class,Brand_Cadillac,Brand_Ford,Color_Blue,Color_Green,Color_Red
0,62391,304,1,1,0,0,0,1
1,2310,213,0,1,0,1,0,0
2,25391,375,0,0,1,0,0,1
3,12840,160,1,0,1,0,1,0


There are two "one hot" columns for **Brand** categories and three columns for **Color**.  
Create a list of features and make **"Class"** the last column.

In [5]:
featuresList = [col for col in toyTrain if col != 'Class']
print('featuresList: ',featuresList)
toyTrain = toyTrain[featuresList + ['Class']]
print('\nToy sample: \n')
toyTrain

featuresList:  ['Mileage', 'Horsepower', 'Brand_Cadillac', 'Brand_Ford', 'Color_Blue', 'Color_Green', 'Color_Red']

Toy sample: 



Unnamed: 0,Mileage,Horsepower,Brand_Cadillac,Brand_Ford,Color_Blue,Color_Green,Color_Red,Class
0,62391,304,1,0,0,0,1,1
1,2310,213,1,0,1,0,0,0
2,25391,375,0,1,0,0,1,0
3,12840,160,0,1,0,1,0,1


## Standardization 

Summaries of all variables are given by:

In [6]:
toyTrain.describe(percentiles=[])

Unnamed: 0,Mileage,Horsepower,Brand_Cadillac,Brand_Ford,Color_Blue,Color_Green,Color_Red,Class
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,25733.0,263.0,0.5,0.5,0.25,0.25,0.5,0.5
std,26196.642953,95.453304,0.57735,0.57735,0.5,0.5,0.57735,0.57735
min,2310.0,160.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,19115.5,258.5,0.5,0.5,0.0,0.0,0.5,0.5
max,62391.0,375.0,1.0,1.0,1.0,1.0,1.0,1.0


To standardize variables by z-scoring use `StandardScaler()`.

In [7]:
scaler = preprocessing.StandardScaler()
scaler.fit(toyTrain[featuresList]);

In [8]:
toyTrain[featuresList] = scaler.transform(toyTrain[featuresList])
toyTrain

Unnamed: 0,Mileage,Horsepower,Brand_Cadillac,Brand_Ford,Color_Blue,Color_Green,Color_Red,Class
0,1.615818,0.495978,1.0,-1.0,-0.57735,-0.57735,1.0,1
1,-1.032443,-0.604851,1.0,-1.0,1.732051,-0.57735,-1.0,0
2,-0.015075,1.354866,-1.0,1.0,-0.57735,-0.57735,1.0,0
3,-0.5683,-1.245993,-1.0,1.0,-0.57735,1.732051,-1.0,1


Check results of standardization.

In [9]:
print('Mean values:')
print(toyTrain[featuresList].mean())
print('\nStd values:')
print(toyTrain[featuresList].std(ddof=0))

Mean values:
Mileage          -2.775558e-17
Horsepower        5.551115e-17
Brand_Cadillac    0.000000e+00
Brand_Ford        0.000000e+00
Color_Blue       -5.551115e-17
Color_Green      -5.551115e-17
Color_Red         0.000000e+00
dtype: float64

Std values:
Mileage           1.0
Horsepower        1.0
Brand_Cadillac    1.0
Brand_Ford        1.0
Color_Blue        1.0
Color_Green       1.0
Color_Red         1.0
dtype: float64


Note that it is necessary to set `ddof=0` (delta degrees of freedom) because `sklearn.preprocessing.StandardScaler` uses `numpy.std()` with `ddof=0`.  
$std = \sqrt{\frac{\sum_{i=1}^{N}( x_{i}-\overset{\_}{x}) ^{2}}{%
N-ddof}}$  


## Preparing Test Dataset

Create test data manually.

In [10]:
toyTest = pd.DataFrame([['Cadillac', 'Black', 8332, 304, 1],
                          ['Chevrolet', 'Green', 3194, 355, 0]],
                         columns=['Brand', 'Color', 'Mileage','Horsepower','Class'])
toyTest

Unnamed: 0,Brand,Color,Mileage,Horsepower,Class
0,Cadillac,Black,8332,304,1
1,Chevrolet,Green,3194,355,0


To obtain prediction transform the test dataset in the same way as the train dataset.

In [11]:
toyTest = pd.get_dummies(toyTest, columns = ['Brand', 'Color'])
toyTest

Unnamed: 0,Mileage,Horsepower,Class,Brand_Cadillac,Brand_Chevrolet,Color_Black,Color_Green
0,8332,304,1,1,0,1,0
1,3194,355,0,0,1,0,1


This creates a different set of "one hot" columns. 

To get the same set of columns add the missing columns of the train set with zero values and drop the extra columns that were not included in the train set.

In [12]:
missing_cols = set(toyTrain.columns) - set(toyTest.columns)
print('missing_cols: ',missing_cols)
for c in missing_cols:
    toyTest[c] = 0
toyTest = toyTest[toyTrain.columns].copy()
toyTest

missing_cols:  {'Brand_Ford', 'Color_Blue', 'Color_Red'}


Unnamed: 0,Mileage,Horsepower,Brand_Cadillac,Brand_Ford,Color_Blue,Color_Green,Color_Red,Class
0,8332,304,1,0,0,0,0,1
1,3194,355,0,0,0,1,0,0


Now use `scaler` that has been fit to the train dataset to transform the test dataset.

In [13]:
toyTest[featuresList] = scaler.transform(toyTest[featuresList])
toyTest

Unnamed: 0,Mileage,Horsepower,Brand_Cadillac,Brand_Ford,Color_Blue,Color_Green,Color_Red,Class
0,-0.767005,0.495978,1.0,-1.0,-0.57735,-0.57735,-1.0,1
1,-0.993478,1.112926,-1.0,-1.0,-0.57735,1.732051,-1.0,0


The test dataset is now ready for making predictions.

# Project data and step-by-step instructions

The goal of this project is to to detect illegitimate connections in a computer network using an **autoencoder** 

## Data set description

Data set for this project is from [The Third International Knowledge Discovery and Data Mining Tools Competition](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) at KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. File `kddCupTrain.csv` with the data necessary for this project contains only one of multiple types of attacks (see below).  

The competition [task](http://kdd.ics.uci.edu/databases/kddcup99/task.html) was building a network intrusion detector capable of distinguishing "bad" connections, called intrusions or attacks, from "good" normal connections. This database contains a variety of intrusions simulated in a military network environment.

The original KDD training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type. The simulated attacks fall in one of the following four categories:
1. Denial of Service Attack (DoS): is an attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users access to a machine.
2. User to Root Attack (U2R): is a class in which the attacker starts out with access to a normal user account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and is able to exploit some vulnerability to gain root access to the system.
3. Remote to Local Attack (R2L): occurs when an attacker who has the ability to send packets to a machine over a network but who does not have an account on that machine exploits some vulnerability to gain local access as a user of that machine.
4. Probing Attack: is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.


Attacks contained in the dataset: 

Attack Category | Attack Type
--- | ---
DoS | back, land, neptune, <br>pod, smurf, teardrop
U2R | buffer_overflow, loadmodule, <br>perl, rootkit
R2L | ftp_write, guess_passwd, <br>imap, multihop, rhf, <br>spy, warezclient, warezmaster
Probe | portsweep, ipsweep, <br>satan, nmap


KDD-99 features can be classified into three groups:  
1) **Basic features**: this category encapsulates all the attributes that can be extracted from a TCP/IP connection. Most of these features leading to an implicit delay in detection.  
2) **Traffic features**: this category includes features that are computed with respect to a window interval and is divided into two groups:
* **"same host" features**: examine only the connections in the past 2 seconds that have the same destination host as the current connection, and calculate statistics related to protocol behavior, service, etc.
* **"same service" features**: examine only the connections in the past 2 seconds that have the same service as the current connection.  

These two types of "traffic" features are called time-based as opposed to the following connection-based type. 

* **"connection-based" features**: there are several types of slow probing attacks that scan the hosts (or ports) using a much larger time interval than 2 seconds, for example, one in every minute. As a result, these attacks do not produce intrusion patterns with a time window of 2 seconds. To detect such attacks the “same host” and “same service” features are  recalculated but based on the connection window of 100 connections rather than a time window of 2 seconds. These features are called **connection-based traffic features**.  

3) **Content features**: unlike most of the DoS and Probing attacks, the R2L and U2R attacks don’t have any frequent sequential intrusion patterns. This is because the DoS and Probing attacks involve many connections to some host(s) in a very short period of time. Unlike them, the R2L and U2R attacks are embedded in the data portions of the packets, and normally involve only a single connection. To detect these kinds of attacks, one needs some features in order to look for suspicious behavior in the data portion, e.g., number of failed login attempts. These features are called **content features**.

#### Table 1: Basic features of individual TCP connections.
nn | feature name |	description |	type
--:|------------|-----------|-----------
0 | duration | length (number of seconds) of the connection |	continuous
1 | protocol_type |	type of the protocol, e.g. tcp, udp, etc. |	symbolic
2 | service |	network service on the destination, e.g., http, telnet, etc. |	symbolic
3 | flag |	normal or error status of the connection |	symbolic 
4 | src_bytes |	number of data bytes from source to destination |	continuous
5 | dst_bytes |	number of data bytes from destination to source |	continuous
6 | land |	1 if connection is from/to the same host/port; 0 otherwise |	binary
7 | wrong_fragment |	number of "wrong" fragments |	continuous
8 | urgent |	number of urgent packets |	continuous

#### Table 2: Content features within a connection suggested by domain knowledge.
nn | feature name |	description |	type
---:|------------ | ------------ | --------
9 | hot |	number of "hot" indicators |	continuous
10 | num_failed_logins |	number of failed login attempts |	continuous
11 | logged_in |	1 if successfully logged in; 0 otherwise |	binary
12 | num_compromised |	number of "compromised" conditions |	continuous
13 | root_shell |	1 if root shell is obtained; 0 otherwise |	binary
14 | su_attempted |	1 if "su root" command attempted; 0 otherwise |	binary
15 | num_root |	number of "root" accesses |	continuous
16 | num_file_creations |	number of file creation operations |	continuous
17 | num_shells |	number of shell prompts |	continuous
18 | num_access_files |	number of operations on access control files |	continuous
19 | num_outbound_cmds |	number of outbound commands in an ftp session |	continuous
20 | is_hot_login |	1 if the login belongs to the "hot" list; 0 otherwise |	binary
21 | is_guest_login |	1 if the login is a "guest" login; 0 otherwise |	binary

#### Table 3: Traffic features computed using a two-second time window.
nn  | feature name |	description |	type
---:|------------ | ------------ | --------
22 | count |	number of connections to the same host as the current connection in the past two seconds |	continuous
 | | Note: The following  features refer to these same-host connections.	|
23 | serror_rate |	% of connections that have "SYN" errors |	continuous
24 | rerror_rate |	% of connections that have "REJ" errors |	continuous
25 | same_srv_rate |	% of connections to the same service |	continuous
26 | diff_srv_rate |	% of connections to different services |	continuous
27 | srv_count |	number of connections to the same service as the current connection in the past two seconds |	continuous
 |  | Note: The following features refer to these same-service connections.	|
28 | srv_serror_rate |	% of connections that have "SYN" errors |	continuous
29 | srv_rerror_rate |	% of connections that have "REJ" errors |	continuous
30 | srv_diff_host_rate |	% of connections to different hosts |	continuous
31 | dst_host_count |	number of connections from the same address to the same host as the current connection in the past two seconds |	continuous
32 | dst_host_srv_count | number of connections from the same host to the same service as the current connection in the past two seconds | continuous
 |  | Note: The following features refer to these same-host and same-service connections.	|
33 | dst_host_same_srv_rate | |	continuous
34 | dst_host_diff_srv_rate | |	continuous
35 | dst_host_same_src_port_rate | |	continuous
36 | dst_host_srv_diff_host_rate | |	continuous
37 | dst_host_serror_rate | |	continuous
38 | dst_host_srv_serror_rate | |	continuous
39 | dst_host_rerror_rate | |	continuous
40 | dst_host_srv_rerror_rate | |	continuous

The attribute labeled **41** in the data set is the **"Class"** attribute which indicates whether a given instance is a normal connection instance or an attack.

## Project instructions

### 1. Preparing the data

#### 1.1. Reading data

Read the train dataset  `kddCupTrain.csv` from `kddCupData.zip` and check it for missing values.

In [14]:
kddCupTrain = pd.read_csv('kddCupTrain.csv',header=None)
print("Shape of kddCupTrain: ",kddCupTrain.shape)
print("There are any missing values: ", kddCupTrain.isnull().values.any())
print(kddCupTrain.head(3))

Shape of kddCupTrain:  (985262, 42)
There are any missing values:  False
   0    1     2   3    4      5   6   7   8   9   10  11  12  13  14  15  16  \
0   0  tcp  http  SF  215  45076   0   0   0   0   0   1   0   0   0   0   0   
1   0  tcp  http  SF  162   4528   0   0   0   0   0   1   0   0   0   0   0   
2   0  tcp  http  SF  236   1228   0   0   0   0   0   1   0   0   0   0   0   

   17  18  19  20  21  22  23   24   25   26   27   28   29   30  31  32   33  \
0   0   0   0   0   0   1   1  0.0  0.0  0.0  0.0  1.0  0.0  0.0   0   0  0.0   
1   0   0   0   0   0   2   2  0.0  0.0  0.0  0.0  1.0  0.0  0.0   1   1  1.0   
2   0   0   0   0   0   1   1  0.0  0.0  0.0  0.0  1.0  0.0  0.0   2   2  1.0   

    34   35   36   37   38   39   40       41  
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  normal.  
1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  normal.  
2  0.0  0.5  0.0  0.0  0.0  0.0  0.0  normal.  


The train dataset contains instances of only two class types from the [original](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) data: 
- "normal." - no attack
- "ipsweep." - a subtype of "probe" or Probing Attack type.  

In [15]:
kddCupTrain.iloc[:,-1].unique()

array(['normal.', 'ipsweep.'], dtype=object)

Rename column '41' to 'Class' and transform its values from symbolic type fo binary:  
- "normal." to 0  
- "ipsweep." to 1

In [16]:
kddCupTrain.rename(columns={41:'Class'}, inplace=True)
kddCupTrain['Class'] = np.where(kddCupTrain['Class'] == 'normal.', 0, 1)

Check counts of classes.

In [17]:
count_classes = pd.value_counts(kddCupTrain['Class'], sort = True)
print(count_classes)

0    972781
1     12481
Name: Class, dtype: int64


The dataset is highly imbalanced. Normal connections overwhelmingly outnumber fraudulent ones. 
This suggests using an autoencoder to detect attacks as rare deviations from normal.

#### 1.2. Remove the uninformative columns

Look at summaries of numeric features.

In [18]:
print(kddCupTrain.describe(percentiles=[]))

                   0             4             5              6         7  \
count  985262.000000  9.852620e+05  9.852620e+05  985262.000000  985262.0   
mean      215.078631  1.459258e+03  3.193730e+03       0.000007       0.0   
std      1343.633640  1.097984e+05  3.401613e+04       0.002665       0.0   
min         0.000000  0.000000e+00  0.000000e+00       0.000000       0.0   
50%         0.000000  2.300000e+02  4.060000e+02       0.000000       0.0   
max     58329.000000  8.958152e+07  1.173059e+07       1.000000       0.0   

                   8              9             10             11  \
count  985262.000000  985262.000000  985262.000000  985262.000000   
mean        0.000036       0.048908       0.000097       0.710185   
std         0.015897       0.926008       0.013058       0.453677   
min         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       1.000000   
max        14.000000      77.000000       4.00

Note that some features are constant (min = max and std = 0.0). Such features are not necessary and need to be removed using

`kddCupTrain.drop(columnsList, axis=1, inplace=True)`.
  
#### 1.3. Transform symbolic features to "One Hot" columns

Transform character features "1", "2" and "3" into "One Hot" columns using `pandas.get_dummies()` as shown in the section above. 

As a result, the first two rows of `kddCupTrain` should look like this:  

|  0  |  4  |    5 | 6 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 20 | 21 | 22 | 23   
--|--
0 | 0 | 215 | 45076 | 0 | 0 | 0 |  0 |  1 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |  0 |  1 | 1   
1 | 0 | 162 | 4528 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2   

| 24 | 25 | 26 | 27 |  28 |  29  | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38
--|--
0|0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |  0 |  0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0   
1|0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |  1 |  1 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0   

  | 39 | 40 | Class | 1_icmp | 1_tcp | 1_udp | 2_IRC | 2_X11 | 2_auth | 2_ctf
--|--
0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 |  0  |    0  |     0  |    0   
1 | 0.0 | 0.0 | 0 | 0 | 1 | 0 |  0  |    0  |     0  |    0   

 | 2_domain | 2_domain_u | 2_eco_i | 2_ecr_i | 2_finger | 2_ftp | 2_ftp_data
 --|--
0  |   0  |    0  |   0  |  0  |   0  |    0   |   0   
1  |   0  |    0  |   0  |  0  |   0  |    0   |   0   

| 2_gopher | 2_http | 2_imap4 | 2_link | 2_mtp | 2_name | 2_ntp_u | 2_other 
--|--
0 |   0  |     1   |     0  |     0  |    0  |     0  |      0   |     0   
1 |   0  |     1   |     0  |     0  |    0  |     0  |      0   |     0   

| 2_pop_3 | 2_private | 2_red_i | 2_remote_job | 2_rje | 2_shell | 2_smtp | 2_ssh 
--|--
0 |       0 |         0 |       0 |            0 |     0 |       0 |      0 |     0   
1 |       0 |         0 |       0 |            0 |     0 |       0 |      0 |     0   

| 2_telnet | 2_tftp_u | 2_tim_i | 2_time | 2_urh_i | 2_urp_i | 2_whois | 3_OTH  
--|--
0 |        0 |     0 |       0 |      0 |       0 |       0 |       0 |     0   
1 |        0 |     0 |       0 |      0 |       0 |       0 |       0 |     0   

| 3_REJ | 3_RSTO | 3_RSTR | 3_S0 | 3_S1 | 3_S2 | 3_S3 | 3_SF | 3_SH  
--|--
0 |     0 |      0 |      0 |    0 |    0 |    0 |    0 |    1 |    0  
1 |     0 |      0 |      0 |    0 |    0 |    0 |    0 |    1 |    0  

After removing uninformative variables and replacing character variables with hot encoding the dataset contains 83 numeric features.

#### 1.4. Standardize the training dataset

Create a list of features, standardize the features columns using `sklearn.preprocessing.StandardScaler`.

**Further steps are similar to those covered in the notebook `MScA_32017_AMLAI_AE3_FraudDetection.ipynb`.**

#### 1.5. Split the data into train and test subsets 

Use `sklearn.model_selection.train_test_split()` function. Reserve 20% of data for the test data. Do not forget to set the parameter `stratify` to keep the class size ratio within each of data sets.

#### 1.6. Detach the labels from the train and the test datasets

#### 1.7. Separate the "normal" instances

An autoencoder will be trained to reconstruct class "normal". Separate the "normal" instances in both `train` and `test` datasets.

### 2. Build the model

Follow the steps of  `MScA_32017_AMLAI_AE3_FraudDetection.ipynb` to create similar model for this project.

#### 2.1. select architecture of autoencoder

Try different numbers and dimentions of layers. Use `BatchNormalization` and `Dropout` layers to achieve better results.

#### 2.2. Fit the model

Fit autoencoder to the "normal" instances of the train dataset.  
Use `ModelCheckpoint` callback to save the best model to file:

`checkpointer = ModelCheckpoint(filepath="autoencoder.h5",
                               verbose=0,
                               save_best_only=True)`    
                               
### 3. Evaluation

#### 3.1. Load the fitted autoencoder from file `"autoencoder.h5"`.

#### 3.2.  Reconstruction

Reconstruct the **test** dataset using the fitted autoencoder, calculate the **mean squared error** of the prediction.  

#### 3.3. Evaluate

Calculate MSE of each observation by averaging squared errors of reconstruction of all features in a row.  

Calculate ROC and AUC. Select appropriate quantile and calculate Accuracy and Cohen's Kappa.

Tune parameters of autoencoder to achieve better results (see Section 2.1.). Try to make AUC = 0.97 or better to get a good score.

### 4.Create submission

#### 4.1. Read test data

Read the **test** dataset from ['kddCupTest.csv'](kddCupTest.csv) and check it for missing values.  

In [19]:
kddCupTest = pd.read_csv('kddCupTest.csv', header=None)
print(kddCupTest.head(3))

   0    1         2   3    4    5   6   7   8   9   10  11  12  13  14  15  \
0   0  tcp      http  SF  220  370   0   0   0   0   0   1   0   0   0   0   
1   0  udp   private  SF  105  145   0   0   0   0   0   0   0   0   0   0   
2   0  tcp  ftp_data  SF  245    0   0   0   0   0   0   0   0   0   0   0   

   16  17  18  19  20  21  22  23   24   25   26   27   28   29   30   31  \
0   0   0   0   0   0   0   4   4  0.0  0.0  0.0  0.0  1.0  0.0  0.0   23   
1   0   0   0   0   0   0   1   1  0.0  0.0  0.0  0.0  1.0  0.0  0.0  255   
2   0   0   0   0   0   0   1   1  0.0  0.0  0.0  0.0  1.0  0.0  0.0  227   

    32    33    34    35    36   37    38   39    40  
0  255  1.00  0.00  0.04  0.02  0.0  0.01  0.0  0.03  
1  241  0.95  0.01  0.00  0.00  0.0  0.00  0.0  0.00  
2   71  0.31  0.02  0.31  0.00  0.0  0.00  0.0  0.00  


Notice that there are no lables in the test dataset. The features columns are the same.

#### 4.2. Do "One hot" transformation of categorical features

Don't forget to make the features list exactly the same as in the **train** dataset.  

#### 4.3. Standardize the test dataset

Use the scaler fitted to the training dataset.

#### 4.4. Make predictions and save the results to csv file.

Reconstruct the **kddCupTest** dataset, calculate **mean squared error** as reconstruction error.  
Save MSE to scv file. 

`result_df = pd.DataFrame({'reconstruction_error': testMSE})
 result_df.to_csv('filename.csv')`

The format should be as follows:  
  
,reconstruction_error  
0,0.019312
1,0.049165
2,0.084997 

#### 4.5. Upload the results

Upload the saved file using [shiny test application](http://shiny.ilykei.com:3838/courses/AdvancedML/AutoEncoder).
The uploaded results will be used for calculation of AUC. The goal for this project is to get AUC not less then 0.97.
