## **Predict Hacker Behaviour from Honeypot Logs**

##### **Goal:** We want to build a machine learning model that can detect whether a user/IP address is malicious (hacker) or benign (normal user) based on how they interact with a fake server (called a honeypot)

We got the dataset from:
https://www.kaggle.com/datasets/jsrojas/labeled-network-traffic-flows-114-applications

### Importing necessary libraries

In [188]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## **Loading Dataset**

### Reading our original dataset that we downloaded and viewing it

In [189]:
df = pd.read_csv('./data/network_traffic.csv')

In [190]:
df.head()


Unnamed: 0,flow_key,src_ip_numeric,src_ip,src_port,dst_ip,dst_port,proto,pktTotalCount,octetTotalCount,min_ps,...,b_flowEnd,b_flowDuration,b_min_piat,b_max_piat,b_avg_piat,b_std_dev_piat,flowEndReason,category,application_protocol,web_service
0,3acee4f4ea001cd5e6d9584d4036b53d,3232266497,192.168.121.1,67,172.16.255.185,67,17,22,7620,328,...,1555955000.0,1554399000000.0,0.010354,198.657965,51.689181,84.916348,2.0,Network,Unknown,DHCP
1,974ec5991b439c9a7176b88be0c90df0,3232266497,192.168.121.1,67,172.16.255.186,67,17,17,5670,328,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,Network,Unknown,DHCP
2,3acee4f4ea001cd5e6d9584d4036b53d,3232266497,192.168.121.1,67,172.16.255.185,67,17,43,15124,328,...,1555965000.0,1554409000000.0,0.022332,340.268454,89.927588,124.270745,2.0,Network,Unknown,DHCP
3,974ec5991b439c9a7176b88be0c90df0,3232266497,192.168.121.1,67,172.16.255.186,67,17,30,10086,328,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,Network,Unknown,DHCP
4,cfa7c2740072befaa89c202499729e08,3232266497,192.168.121.1,0,10.130.1.166,0,1,1,56,56,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,Network,Unknown,ICMP


## **PreProcessing**

### As our dataset is very large (around 1.6 GB), we're aiming to extract necessary 5 columns and 1000 rows to train our model

In [191]:
columns_to_keep = ['src_ip_numeric', 'src_port', 'dst_port', 'proto', 'flowDuration']
df = df[columns_to_keep]

### Assigning labels to our dataset because it didn't have labels already
**Malicious ( 1 ) | Benigine ( 0 )**

In [192]:
def assign_label(row):
    if row['dst_port'] in [22, 23, 445, 3128]:
        return 1
    elif row['proto'] == 1 and row['flowDuration'] == 0.0:
        return 1
    elif row['flowDuration'] < 1.0 and row['proto'] not in [6, 17]:
        return 1
    else:
        return 0

df['label'] = df.apply(assign_label, axis=1)

In [193]:
malicious = df[df['label'] == 1]
benign = df[df['label'] == 0]

malicious_sample = malicious.sample(n=min(500, len(malicious)), random_state=42)
benign_sample = benign.sample(n=500, random_state=42)

df_balanced = pd.concat([malicious_sample, benign_sample]).sample(frac=1, random_state=42).reset_index(drop=True)


#### Saving our balanced dataset to a new .csv file

In [194]:
df_balanced.to_csv("small_subset.csv", index=False)

### Now working on the **new**, **preprocessed** dataset

In [195]:
hp = pd.read_csv('./data/real_honeypot.csv')
hp.sample(10)

Unnamed: 0,src_ip_numeric,src_port,dst_port,proto,flowDuration,label
639,3232267590,54053,443,6,77.070391,0
197,3232266584,63181,3128,6,1.059775,1
675,3232266771,49841,445,6,0.004014,1
110,3232266544,50472,53,17,0.124374,0
323,3232268398,62929,53,17,0.380242,0
730,3232266802,50874,3128,6,0.773341,1
965,3232266781,50544,3128,6,12.640676,1
142,3232268081,65052,53,17,0.000828,0
344,3232266756,50043,3128,6,9.755951,1
801,3232268349,6881,1,17,0.0,0


In [196]:
(hp.dtypes)

src_ip_numeric      int64
src_port            int64
dst_port            int64
proto               int64
flowDuration      float64
label               int64
dtype: object

In [198]:
X = hp.drop('label', axis=1)
y = hp['label']               


## **Training The Model**

In [199]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


In [200]:
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)


## **Model Evaluation**

In [201]:
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.995
Confusion Matrix:
 [[ 99   1]
 [  0 100]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      0.99       100
           1       0.99      1.00      1.00       100

    accuracy                           0.99       200
   macro avg       1.00      0.99      0.99       200
weighted avg       1.00      0.99      0.99       200

