# A Lightweight Concept Drift Detection and Adaptation Framework for IoT Data Streams
This is the code for the paper entitled "**A Lightweight Concept Drift Detection and Adaptation Framework for IoT Data Streams**" accepted in IEEE Internet of Things Magazine.  
Authors: Li Yang (lyang339@uwo.ca) and Abdallah Shami (Abdallah.Shami@uwo.ca)  
Organization: The Optimized Computing and Communications (OC2) Lab, ECE Department, Western University

**Notebook 1: Data pre-processing**  
Aims:  
&nbsp; 1): Assign columns names and transform the original 'txt' files to dataframes  
&nbsp; 2): Transform the multi-class dataset to the binary dataset for anomaly detection  
&nbsp; 3): Label encoding to pre-process string features  

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## Read the NSL-KDD dataset
The NSL-KDD dataset is publicly available at: [[1]](https://www.unb.ca/cic/datasets/nsl.html) [[2]](https://github.com/jmnwong/NSL-KDD-Dataset)

In [2]:
#Assign column names
col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label","difficulty"]

In [3]:
#Read the original training and test sets
df1 = pd.read_csv("KDDTrain+.txt", header = None, names = col_names)
df2 = pd.read_csv("KDDTest+.txt", header = None, names = col_names)

In [4]:
#display the dataset
df1

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label,difficulty
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20.0
1,0,udp,other,SF,146,0,0,0,0,0,...,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal,15.0
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,19.0
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal,21.0
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76030,2,tcp,ftp_data,SF,2194619,0,0,0,0,0,...,1.00,0.00,1.00,0.07,0.00,0.00,0.00,0.00,normal,7.0
76031,0,tcp,http,SF,193,369,0,0,0,0,...,1.00,0.00,0.50,0.05,0.00,0.00,0.00,0.00,normal,21.0
76032,0,tcp,http,SF,247,9195,0,0,0,0,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal,21.0
76033,0,tcp,ftp_data,SF,12,0,0,0,0,0,...,0.22,0.02,0.22,0.03,0.00,0.00,0.01,0.00,normal,20.0


## Transform the NSL-KDD dataset to binary dataset (normal & attack)

In [5]:
#"normal" label is set to 0, all attack labels are set to 1

df1.drop(['difficulty'], axis=1, inplace=True)
df2.drop(['difficulty'], axis=1, inplace=True)

df1['label'][df1['label']=='normal']=0
df1['label'][df1['label']!=0]=1
df2['label'][df2['label']=='normal']=0
df2['label'][df2['label']!=0]=1

## Label encoding to transform string features to numerical features

In [6]:
#Using Label encoder to transform string features to numerical features
from sklearn.preprocessing import LabelEncoder
def Encoding (df):
    cat_features=[x for x in df.columns if df[x].dtype=="object"]
    le=LabelEncoder()
    for col in cat_features:
        if col in df.columns:
            i = df.columns.get_loc(col)
            df.iloc[:,i] = le.fit_transform(df.iloc[:,i].astype(str))
    return df

In [7]:
df1 = Encoding(df1)
df2 = Encoding(df2)

## Save the pre-processed dataset
df1: training set  
df2: test set  
df: training & test set  

In [9]:
df = pd.concat([df1, df2], ignore_index=True)

In [10]:
df1.to_csv('NSL_KDD_binary_train.csv',index=0)
df2.to_csv('NSL_KDD_binary_test.csv',index=0)

In [11]:
df.to_csv('NSL_KDD_binary(train+test).csv',index=0)