### 連續型變數
1. Start time:<br>
    此特性定義流的開始時間。它是網絡流場的基本特徵之一。開始時間轉換為分類的數值。將時間格式轉換為數值的公式是小時* 3600 +分鐘* 60 +秒，它與結束時間比較表明流量是惡意的。
2. Duration:<br>
    流的持續時間表示完成特定流所需的總時間。流的持續時間被用來計算平均分組速率和平均位元組速率。  
3. TotPkts: Total packets<br>
    特徵被定義為在特定流中傳輸的數据包的個數。它存儲在特定時間段或流期間傳輸的數据包的數量。  
4. TotBytes: Total Bytes (TotBytes, SrcBytes)<br>
    内容指定用戶端根據請求發送的總位元組數。它的總位元組大小是衡量網絡量測的一個重要名額。

### 類別型變數
1. Source IP Address: 找出僵屍網路的label<br>
    The IP Address is used to uniquely identify the desired host to contact. It is also one of the basic features of net flow fields. The source IP address is the IP address of the computer and or website that are currently visiting, or using. The source IP address is converted to decimal format for further processing. It is computed as follows 10.0.2.112 is converted to 167772784 
2. Protocol:<br>
    協定是一組特殊的規則，在通信連接中，端點在通信時使用。協定指定通信實體之間的互動。使用的協定有不同的類型，它們是TCP、UDP、ICMP、SMTP等。
3. port: (Sport, Dport)<br>
    允許我們識別我們的數據或請求必須發送到的服務或應用程序。它們可用於獲取針對攻擊目標的遠程系統的資訊。埠號80，53，25被標記為具有不同僵屍網路攻擊的惡意流，它們是HTTP建立的僵屍網路、垃圾郵件僵屍網路和基於DNS伺服器的僵屍網路。16-bit to 10-bit processing.
4. dir:<br>
    訓示指定數據是否沿兩個方向或僅一個方向移動。方向也指定路徑，流量將從源到目的地通過互聯網絡。大多數流是雙向的，可以用雙面箭頭表示，單向流用單邊箭頭表示。大多數垃圾郵件僵屍網路使用單向流。
5. States:<br>
    表示網路流的狀態有不同的類型，它們是SYN、RST、CON、ACK、FIN.。 There are different types of states that represent the network flow they are SYN, RST, CON, ACK, FIN. In the SYN state client sends a SYN message which contains the server's port and the client's Initial Sequence Number to the server. The server sends back its own SYN and ACK.The Client sends an ACK. Final state is the state is a now a half-closed connection. The client no longer sends data, but is still able to receive data from the server. Upon receiving this FIN, the server enters a close state. CON is the connection state in when once the connection is established it is in CON state. The RST state is the connection reset state in which the host refuses a connection. Too many SYN state is received means sender is infected. Too many RST state is received means receiver is infected.
6. Tos:  (sToS, dToS)<br>
    ToS is defined as type of service。它是一種將優先順序分配給每個IP包的機制，以及請求特定處理的機制，如高輸送量、高可靠性或低延遲。通常情况下欄位將為0。

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import cm

In [3]:
pd.options.mode.chained_assignment = None
%matplotlib inline

In [4]:
def loadingdata(path):
    # import dataset
    df = pd.read_csv(path)
    print (df.isnull().sum())
    print ('Number of Original Dataset : %d' % (df.shape[0]))
    print ('Number of Drop NaN         : %d' % (df.dropna(axis = 0).shape[0]))
    return df

In [5]:
path_10 = "capture20110810.binetflow.txt"
df_10 = loadingdata(path_10)

StartTime         0
Dur               0
Proto             0
SrcAddr           0
Sport          9379
Dir               0
DstAddr           0
Dport          4390
State             1
sTos          10590
dTos         195190
TotPkts           0
TotBytes          0
SrcBytes          0
Label             0
dtype: int64
Number of Original Dataset : 2824636
Number of Drop NaN         : 2619340


In [5]:
df_10.iloc[0, :]

StartTime         2011/08/10 09:46:53.047277
Dur                                  3550.18
Proto                                    udp
SrcAddr                        212.50.71.179
Sport                                  39678
Dir                                      <->
DstAddr                        147.32.84.229
Dport                                  13363
State                                    CON
sTos                                       0
dTos                                       0
TotPkts                                   12
TotBytes                                 875
SrcBytes                                 413
Label        flow=Background-UDP-Established
Name: 0, dtype: object

In [6]:
# 將類別型變數的遺失值補上眾數
# 將時間轉換成秒數
# Sport, Dport有些是16-bit, 要轉換成10-bit


def preprocessing(df, con_normal):
    # labeling botnet
    bot = df[~con_normal]
    nonbot = df[con_normal]
    bot["label"] = 1
    nonbot["label"] = 0
    df = pd.concat([bot, nonbot])
    df = df.drop(["Label"], axis=1).reset_index(drop=True)
    print('Number of Botnet : %d' % (bot.shape[0]))
    print('Number of Normal : %d' % (nonbot.shape[0]))
    
    # missing value filling by mode
    df.Sport = df.Sport.fillna(value=df[~(df['Sport'].isnull())].Sport.mode()[0])
    df.Dport = df.Dport.fillna(value=df[~(df['Dport'].isnull())].Dport.mode()[0])
    df.State = df.State.fillna(value=df[~(df['State'].isnull())].State.mode()[0])
    df.sTos = df.sTos.fillna(value=df[~(df['sTos'].isnull())].sTos.mode()[0])
    df.dTos = df.dTos.fillna(value=df[~(df['dTos'].isnull())].dTos.mode()[0])
    
    # StartTime & EndTime to second
    con0 = pd.DataFrame(df.StartTime.str.split('/', 3).tolist(), columns = ['year', 'month', 'a'])
    con1 = pd.DataFrame(con0.a.str.split(':', 3).tolist(), columns = ['b', 'mins', 'sec'])
    con2 = pd.DataFrame(con1.b.str.split(' ', 2).tolist(), columns = ['day', 'hr'])
    con2.hr = pd.to_numeric(con2.hr, errors='coerce')
    con1.mins = pd.to_numeric(con1.mins, errors='coerce')
    con1.sec = pd.to_numeric(con1.sec, errors='coerce')
    df["StartTime"] = con2.hr*3600 + con1.mins*60 + con1.sec
    
    # select row of Sport is 16-bit
    con0 = pd.DataFrame(df.Sport.str.split('x',2).tolist(), columns = ['Sport','y'])
    Sport_16 = df.iloc[con0[~(con0['y'].isnull())].index,:].reset_index(drop=True)
    # 16 to 10
    Sport_16_ = Sport_16[["Sport"]]
    a=[]
    for i in range(Sport_16_.shape[0]):
        x = int(Sport_16_.iloc[i,].Sport, 16)
        a.append(x)
    Sport_10 = pd.DataFrame(a, columns=["Sport_10"])
    Sport_16["Sport"] = Sport_10.Sport_10
    df = pd.concat([df.iloc[con0[(con0['y'].isnull())].index,:], Sport_16], axis=0).reset_index(drop=True)
    del Sport_16
    del Sport_16_
    
    # select row of Dport is 16-bit
    con0 = pd.DataFrame(df.Dport.str.split('x',2).tolist(), columns = ['Sport','y'])
    Dport_16 = df.iloc[con0[~(con0['y'].isnull())].index,:].reset_index(drop=True)
    # 16 to 10
    Dport_16_ = Dport_16[["Dport"]]
    a=[]
    for i in range(Dport_16_.shape[0]):
        x = int(Dport_16_.iloc[i,].Dport, 16)
        a.append(x)
    Dport_10 = pd.DataFrame(a, columns=["Dport_10"])
    Dport_16["Dport"] = Dport_10.Dport_10
    df = pd.concat([df.iloc[con0[(con0['y'].isnull())].index,:], Dport_16], axis=0).reset_index(drop=True)
    del Dport_16
    del Dport_16_
    df = df.sort_values(by=["StartTime"]).reset_index(drop=True)
    return df

In [7]:
con_normal = (df_10.SrcAddr != "147.32.84.165")
df_10_new = preprocessing(df_10, con_normal)

Number of Botnet : 40961
Number of Normal : 2783675


In [8]:
df_10_new.iloc[0, :]

StartTime            35213
Dur                3550.18
Proto                  udp
SrcAddr      212.50.71.179
Sport                39678
Dir                    <->
DstAddr      147.32.84.229
Dport                13363
State                  CON
sTos                     0
dTos                     0
TotPkts                 12
TotBytes               875
SrcBytes               413
label                    0
Name: 0, dtype: object

In [8]:
df_10_new.to_csv("20110810_preprocessing.txt", index=False)

4. fillna by mode
1. StartTime convert to sencond
2. Sport and Dport which contain 16-bit convert to 10-bit
3. label dataset by botnet IP_address

In [9]:
path_11 = "capture20110811.binetflow.txt"
df_11 = loadingdata(path_11)

con_normal = (df_11.SrcAddr != "147.32.84.165")
df_11_new = preprocessing(df_11, con_normal)

df_11_new.to_csv("20110811_preprocessing.txt", index=False)

StartTime         0
Dur               0
Proto             0
SrcAddr           0
Sport          3993
Dir               0
DstAddr           0
Dport          2973
State             0
sTos           4324
dTos         269835
TotPkts           0
TotBytes          0
SrcBytes          0
Label             0
dtype: int64
Number of Original Dataset : 1808122
Number of Drop NaN         : 1534307
Number of Botnet : 20941
Number of Normal : 1787181


In [10]:
path_19 = "capture20110819.binetflow.txt"
df_19 = loadingdata(path_19)

con_normal = (df_19.SrcAddr != "147.32.84.165") & (df_19.SrcAddr != "147.32.84.191") & (df_19.SrcAddr != "147.32.84.192") 
df_19_new = preprocessing(df_19, con_normal)
df_19_new.to_csv("20110819_preprocessing.txt", index=False)

StartTime        0
Dur              0
Proto            0
SrcAddr          0
Sport         1812
Dir              0
DstAddr          0
Dport          837
State            0
sTos          2059
dTos         28267
TotPkts          0
TotBytes         0
SrcBytes         0
Label            0
dtype: int64
Number of Original Dataset : 325471
Number of Drop NaN         : 295270
Number of Botnet : 2143
Number of Normal : 323328
