# Pickle CICIDS 2017

This notebooks intended use is to load the CSV data into a Pandas dataframe, normalize and scale the data, then write the DataFrame into a pickle to save these steps for every ML framework run.  
The output are two pickle files: cic_test_data and cic_test_labels.  
**TODO**: Split these into training and test in a meaningful way! (Most likely by hand?)
These pickles can be restored as dataframes by calling [pandas.read_pickle()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html).  
**Hint**: If your are missing the *_clean* CSVs, try running the notebook *Data Sanitazation.ipynb*

In [None]:
import os
import json
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from sklearn import preprocessing
pd.set_option('display.max_columns', None)

## Data Loading and Prep

As there is literally "Inifnity" written in the CSV dataset, we set an additional filter so that these will be replaced by a NaN-representation (that will lateron be set to zero).  
Also, the external_ip field is set to 0.0.0.0 if either NaN or non existent.  
Finally, the dtype for the external_ip column had to be set manually to object as Pandas kept getting confused.

In [None]:
cic_data = pd.DataFrame()

datafile_names_sorted = [
    'Monday-WorkingHours.pcap_ISCX_clean.csv',
    'Tuesday-WorkingHours.pcap_ISCX_clean.csv',
    'Wednesday-WorkingHours.pcap_ISCX_clean.csv',
    'Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX_clean.csv',
    'Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX_clean.csv',
    'Friday-WorkingHours-Morning.pcap_ISCX_clean.csv',
    'Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX_clean.csv',
    'Friday-WorkingHours-Afternoon-DDos.pcap_ISCX_clean.csv'
]

for filename in datafile_names_sorted:
    inputFileName = os.path.join('CICIDS2017', filename)
    print('Appending', inputFileName)
    
    new_flows = pd.read_csv(
        inputFileName,
        na_values="Infinity",
        dtype={'external_ip':'object'},
        parse_dates=['timestamp']
    )
    
    # as this field is not in all flows, double check for it
    if 'external_ip' not in new_flows:
            new_flows['external_ip'] = "0.0.0.0"
    new_flows['external_ip'].fillna("0.0.0.0", inplace=True)
    
    cic_data = cic_data.append(new_flows,ignore_index=True,sort=False)

print('Found these class labels:', str(cic_data.label.unique()))

In [None]:
cic_data.fillna(value=0, inplace=True)

In [None]:
print(cic_data.isnull().values.any())

In [None]:
cic_data.head()

## Data Encoding

There's still a problem: How can we encode IP addresses in a way that the neural network can make use of them while preserving the hierarchical information they contain?  
Encoding IPs through One Hot lets comlexity and training times explode, so for now I am splitting each IP into its four octet pairs and interpret them as numbers.  
Maybe there's a better way to represent them (especially because I am only able to encode IPv4 right now).  

**Important**: If this breaks, you forgot to remove the broken external IP in Friday DDoS @ 2017-07-07T15:58:00,26794

In [None]:
# https://stackoverflow.com/questions/14745022/how-to-split-a-column-into-two-columns
# FIXME: Right now, only IPv4 (4 octets)

# Split the String representation of the IP into it's four octects, which are delimited by a dot
cic_data['source_ip_o1'],cic_data['source_ip_o2'],cic_data['source_ip_o3'],cic_data['source_ip_o4'] = cic_data['source_ip'].str.split('.').str
cic_data['destination_ip_o1'],cic_data['destination_ip_o2'],cic_data['destination_ip_o3'],cic_data['destination_ip_o4'] = cic_data['destination_ip'].str.split('.').str
cic_data['external_ip_o1'],cic_data['external_ip_o2'],cic_data['external_ip_o3'],cic_data['external_ip_o4'] = cic_data['external_ip'].str.split('.').str

# After completion, drop the initial columns, as they aren't needed anymore
cic_data.drop(['source_ip'], axis=1, inplace=True)
cic_data.drop(['destination_ip'], axis=1, inplace=True)
cic_data.drop(['external_ip'], axis=1, inplace=True)

In [None]:
# as we're going to normalize the dataset, we should drop the flow_id, as this info cannot be normalized.
cic_data.drop(['flow_id'], axis=1, inplace=True)

# Finally, let's inspect the outcome
cic_data.head()

Now that this is out of the way, we still need to encode the labels column to numeric values.  
To do this, I'm going to be using the Keras Tokenizer.  
The labels of the dataset (as in: *Benign*, *DDoS*, *Portscan*, etc) are converted into a list of integers and split off of the main DataFrame.  
After this step there is a variable `cic_labels` that holds an integer-encoded list of labels.  
A humble example (not representative):  

|Label         | Value          |
|------------- |---------:|
|Benign      | 0|
|DDoS        | 1|
|Portscan    | 2|  

So if the order of the first three Netflows would be *Benign*, *Benign*, *DDos*,  
the resulting `enc_labels` would look like this: `[1,1,2]`

In [None]:
cic_labels = cic_data.filter(['label'])
cic_data.drop(['label'], axis=1, inplace=True)

In [None]:
number_of_classes = len(cic_labels['label'].unique())

# FIXME: don't fit the tokenizer on the full ds! Split off a test set first!

# tokenize the LABELS
label_tokenizer = Tokenizer(num_words=number_of_classes+1, filters='') # don't filter any of the characters. 1 entry = 1 label 
label_tokenizer.fit_on_texts(cic_labels['label'].unique())

# Run the fitted tokenizer on the label column and save the encoded data as dataframe
enc_labels = label_tokenizer.texts_to_sequences(cic_labels['label'])

# finally, append the encoded labels to the label dataframe
cic_labels = pd.concat([cic_labels, pd.DataFrame(columns=['label_encoded'],dtype=np.int8,data=enc_labels)], axis=1)
cic_labels.head()

To be able to translate the encoded labels back, write the Tokenizer wordlist to a file near the CSVs.

In [None]:
filename = os.path.join('CICIDS2017','cic_label_wordindex.json')
print('Writing encoder data to file {}: {}'.format(filename, label_tokenizer.word_index))
with open(filename, 'w') as outfile:
    json.dump(label_tokenizer.word_index, outfile)

At this point, no non-number-stuff should remain in cic_data.  
Let's check.

In [None]:
cic_data.head()

## Feature Standardization

As the scaler converts the DataFrames to numpy arrays, save the header info to recreate a DataFrame afterwards.

In [None]:
cic_data_header = list(cic_data.columns.values)

As many ML implementations behave badly if confronted with non-scaled inputs, we go ahead and transform all features to center, then scale it.

In [None]:
scaler = preprocessing.MinMaxScaler()
scaler.fit(cic_data) # fit the scaler on the training data

# transform samples without any refitting
cic_data = scaler.transform(cic_data)

In [None]:
# Recreate the DataFrame
cic_data = pd.DataFrame(columns=cic_data_header, data=cic_data)

Let's look at the scaled data

In [None]:
cic_data.head()

## Serialization

So at this point, we have training and test sets with data and labels. The data parts are encoded and scaled, the encoded indizes are written away as json files.  
It would be nice if this data could be used for future runs, right? Right!  
That's why we serialize each dataframe into a python binary pickle on it's own (which is a feature directly supported by [Pandas](https://pandas.pydata.org/pandas-docs/stable/api.html#id12) - nice, eh?)

In [None]:
def write_to_pickle(dataframe, filename):
    dataframe.to_pickle(os.path.join('CICIDS2017', filename+'.pkl'))

In [None]:
write_to_pickle(cic_data, 'cic_data')
write_to_pickle(cic_labels, 'cic_labels')