**This notebook tries to find out the hacking attempts made on a server and rate the extent of attempt using the network flow data.**

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline

In [6]:
data = pd.read_csv('../input/network_data.csv')

In [7]:
data.head()

In [8]:
data['total_data'] = data['num_packets']*data['num_bytes']

In [9]:
data.head()

In [10]:
data.sort_values('start_time',inplace=True)

In [11]:
data.reset_index(drop=True,inplace=True)

In [12]:
data.head()

In [13]:
data['source_ip'].nunique()

In [14]:
data['source_port'].nunique()

In [15]:
data['destination_port'].nunique()

In [16]:
data['destination_ip'].nunique()

In [17]:
source_ip = data.groupby('source_ip')['source_port'].count()

In [18]:
source_ip.sort_values(ascending=False,inplace=True)

In [19]:
source_ip = pd.DataFrame(source_ip)

In [20]:
source_ip.columns = ['count']

In [21]:
source_ip.head()

In [22]:
source_ip.describe()

In [23]:
data['num_packets'].nunique()

In [24]:
data['num_bytes'].nunique()

In [25]:
data['total_data'].nunique()

In [26]:
source_ip['count'].hist()

**Almost all source ip address have less than 5000 counts in the specified time . We might think as the two most abundant
source_ip addresses as malicious.**

In [28]:
destination_ip = pd.DataFrame(data.groupby('destination_ip')['source_port'].count().sort_values(ascending=False))

In [29]:
destination_ip.head()

In [30]:
#same case here as the source_ip one.

In [31]:
destination_ip.columns = ['count']

In [32]:
destination_ip['count'].hist()

In [33]:
data['total_data'].describe()

In [34]:
plt.hist(data['total_data'],color='g',range=(0,10000))

In [35]:
sorted_total_data = data['total_data'].sort_values(ascending=False)

In [36]:
sorted_total_data.reset_index(drop=True,inplace=True)

In [37]:
#something is happening when the total data size lies in [2000,4000]

In [38]:
data['num_bytes'].describe()

In [39]:
plt.hist(data['num_bytes'],color='g',range=(0,1000))

**packets with no. of byte in 300-400 and 500-600 are most abundant**

In [41]:
data['num_packets'].describe()

In [42]:
plt.hist(data['num_packets'],color='g',range=(0,100))

In [43]:
single_packet_data = data[data['num_packets'] == 1]['num_packets']

In [44]:
len(single_packet_data)

In [45]:
source_ip['count'][:5]

In [46]:
type(data['source_ip'][0])

In [47]:
ports_1 = data['destination_port'][data['source_ip']=='135.0777d.04511.237']

In [48]:
ports_1.describe()

In [49]:
packet_1 = data['num_packets'][data['source_ip']=='135.0777d.04511.237']

In [50]:
packet_1.describe()

In [51]:
plt.hist(packet_1,range=(10000,90000))

In [52]:
sns.kdeplot(packet_1[:10000])

In [53]:
sns.kdeplot(packet_1[10000:])

**for the source_ip with maximum occurences ,all the destination_ports are same (i.e. , 22) and also the no. of packets in 
in range[11000,12000] seem to be the most abundant.**

In [55]:
bytes_1 = data['num_bytes'][data['source_ip']=='135.0777d.04511.237']

In [56]:
bytes_1.describe()

In [57]:
#let's check total data

In [58]:
total_data_1  = data['total_data'][data['source_ip']=='135.0777d.04511.237']

In [59]:
total_data_1.describe()

**we see that the 25, 50 and 75 percentile values of total data come out to be same**

In [61]:
ports_2 = data['destination_port'][data['source_ip']=='135.0777d.04511.232']

In [62]:
ports_2.describe()

same thing happens with the second most abundant .
let's check for destination ip 

In [65]:
ports_1_ = data['source_port'][data['destination_ip']=='135.0777d.04511.237']

In [66]:
ports_1_.describe()

In [67]:
ports_2_ = data['source_port'][data['destination_ip']=='135.0777d.04511.232']

In [68]:
ports_2_.describe()

**From above we can see that** *135.0777d.04511.232* **and ***135.0777d.04511.237* **seem to be communicating way too much. It can be malicious or they are just communicating**

**let's find the time intervals between the rows having 135.0777d.04511.232 as source ip address**

In [71]:
data.columns

In [72]:
def convert_timestamp(time):
    return datetime.datetime.fromtimestamp(int(time))

In [73]:
def calculate_time_diff(source_ip):
    time_diff = []
    flag = 0
    for i in range(len(data)):
        if data['source_ip'][i] == source_ip:
            if flag == 0:
                t = data['start_time'][i]
                flag = 1
            else:
                diff = convert_timestamp(data['start_time'][i]) - convert_timestamp(t)
                t = data['start_time'][i]
                time_diff.append(diff)
    return time_diff

In [74]:
time_diff_1 = calculate_time_diff('135.0777d.04511.237')

In [75]:
time_diff_1.sort(reverse=True)

In [76]:
time_diff_1[:5]

In [77]:
count = 0
for i in range(len(time_diff_1)):
    if time_diff_1[i] == datetime.timedelta(0):
        count+=1
print('Total '+str(count)+' entries have ZERO time difference')

**The above data suggests a strong hacking attempt.**

In [79]:
time_diff_2 = calculate_time_diff('135.0777d.04511.232')

In [80]:
count = 0
for i in range(len(time_diff_2)):
    if time_diff_2[i] == datetime.timedelta(0):
        count+=1
print('Total '+str(count)+' entries have ZERO time difference')

In [81]:
len(time_diff_2)

**Again this suggests a strong hacking attempt**

In [83]:
sns.jointplot(data['num_packets'],data['num_bytes'],color='g')

**There are total 585 unique source ip addresses out of which the top two account for almost 50% of data. Most cases have zero time delay with the same port number (22) for both the addresses.**

Now let us define some features that can be used to assess the model.

In [86]:
len(data[data['destination_port']==22])

In [87]:
time_diff = []
time_diff.append(0)
t = data['start_time'][0]
for i in range(1,len(data)):
    time_diff.append(data['start_time'][i] - t)
    t = data['start_time'][i]
time_diff = pd.Series(time_diff)

In [88]:
num_bytes = data['num_bytes']

In [89]:
num_packets = data['num_packets']

In [90]:
print(len(data))
print(len(data[data['source_port']==22]))
print(data['source_port'].nunique())

In [91]:
print(len(data))
print(len(data[data['destination_port']==22]))
print(data['destination_port'].nunique())

In [92]:
data.groupby('destination_port')['source_ip'].count().sort_values(ascending=False)[:10]

We can see that all other port has negligible numbers as compared to other ports

In [94]:
data.groupby('source_port')['source_ip'].count().sort_values(ascending=False)[:10]

**Same case here. So, I am trying to categorise the ports. The records with port number=22 will be placed against its value and 0 for the remaining cases.**

In [96]:
source_port = pd.Series([1 if x==22 else 0 for x in data['source_port']])

In [97]:
destination_port = pd.Series([1 if x==22 else 0 for x in data['destination_port']])

In [98]:
type(source_port)

In [99]:
source_ip = [1 if x=='135.0777d.04511.237' or x=='135.0777d.04511.232' else 0 for x in data['source_ip']]
dest_ip = [1 if x=='135.0777d.04511.237' or x=='135.0777d.04511.232' else 0 for x in data['destination_ip']]

In [100]:
df = pd.DataFrame({'time_diff':time_diff,'num_bytes':num_bytes,'num_packets':num_packets,'source_port':source_port,
                   'destination_port':destination_port,'destination_ip':dest_ip,'source_ip':source_ip})

In [101]:
df.columns

In [102]:
df_with_dummy = pd.get_dummies(df,prefix=['dest_with','source_with','dest_ip','source_ip'],
                               columns=['destination_port','source_port','destination_ip','source_ip'])

In [103]:
df_with_dummy.head()

In [104]:
mx = MinMaxScaler()

In [105]:
df_with_dummy = mx.fit_transform(df_with_dummy)

In [106]:
df_with_dummy.shape

In [107]:
final_df = pd.DataFrame(df_with_dummy,columns=['num_bytes', 'num_packets', 'time_diff','dest_with_0' ,'dest_with_1',
                                               'source_with_0','source_with_1','dest_ip_0',
                                               'dest_ip_1' ,'source_ip_0 ','source_ip_1'])

In [108]:
final_df.head()

In [109]:
from sklearn.cluster import KMeans

In [110]:
kmeans = KMeans(n_clusters=5, random_state=0,n_jobs=-1).fit(df_with_dummy)

In [111]:
new_vals = pd.Series(kmeans.predict(df_with_dummy))
print(new_vals.groupby(new_vals).count())

In [112]:
final_df.insert((final_df.shape[1]),'kmeans',new_vals)

In [113]:
final_df.head()

In [114]:
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(final_df['num_bytes'],final_df['dest_ip_1'],c=new_vals,s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('num_bytes')
ax.set_ylabel('num_pack')
plt.colorbar(scatter)

In [115]:
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(final_df['num_packets'],final_df['dest_ip_1'],c=new_vals,s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('num_packets')
ax.set_ylabel('dest_ip')
plt.colorbar(scatter)

In [116]:
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(final_df['time_diff'],final_df['dest_ip_1'],c=new_vals,s=50)
ax.set_title('time_difference')
ax.set_xlabel('num_packets')
ax.set_ylabel('dest_ip_1')
plt.colorbar(scatter)

**So, we can see that the data points in cluster 1 show anamolauous behaviour because of:
1. The num of bytes and packets sent port port 22 is very less. In cluster 4 it is very high.This is because attacker might have found the free port and started sending bulk of data to check for the correct values. Or , cluster no.4 may  respresent long term secure connection.
2. For other clusters we can see that there is a bit of time difference but not in cluster 0 or 1.This may be due to checking continuolsy for open or free ports.**

In [118]:
cluster_count = new_vals.groupby(new_vals).count()

In [119]:
rating = (cluster_count[0]+cluster_count[1])/len(final_df) * 10

In [120]:
cluster_count

In [121]:
rating