Description of code: This sample of code has been used in our project to extract certain features from the information provided in pcap files. The code also batches those features into blocks with a maximum size of 10, based on the IP address. This batching is the first version we used in our project, where the values of the ten rows are summed together.

The first portion of the code just involves loading the sample data file, which has been provided in our github folder. The file is titled "sample pcap data for code peer review.csv". This file contains 25 rows of data from 3 different IP addresses. The end result of the code should produce data on those 3 batches of IP addresses. 

I have noted the block of code to start your review. You will need to change the file path in the pd.read.csv to make sure you are mapping it to the location of the sample pcap data file.

In [73]:
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [74]:
data_df_SUEE17_short = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ECE 697/sample pcap data for code peer review.csv')

a,b = np.shape(data_df_SUEE17_short) # dimension of dataset to be used later in manual feature extraction
info_data = data_df_SUEE17_short['Info'] # loading only the "Info" column of Wireshark output file

print('number of samples in dataset: ', a)

number of samples in dataset:  25


# Start review here:

In [75]:
new_features = np.zeros((a,16)) # matrix to house the newly extracted features

#define some keywords that we are going to search for
PCAP_keywords = ['SYN','FIN','PSH','ACK', 'reassembled PDU', 'unseen segment', 'Previus segment not captured', 'HTTP', 'Fragmented', 'Bad', 'UDP', 'Malformed', 'Null', 'TCP']
PCAP_integers = ['Seq', 'Ack']

for i in range(0,a): # iterate through all the data samples
  keyword_count = 0 # for new_features index
  for j in PCAP_keywords: # iterate through all of the keywords we want to search for
    found_keyword = info_data[i].find(j)
    if found_keyword >= 0:
      new_features[i,keyword_count] = 1 # if keyword is found
    if found_keyword == -1: 
      new_features[i,keyword_count] = 0 # if keyword is not found
    keyword_count = keyword_count + 1

  for k in PCAP_integers: # now going to iterate though all the integers we are looking for
    start_location = info_data[i].find(k) 
    end_location = info_data[i].find(' ',start_location) # if we find the keyword, we want the location of the space afterwards to locate the integer
    found_integer = info_data[i][start_location+4:end_location] # then keep just the integer, not the location

    if start_location >= 0:
      new_features[i,keyword_count] = found_integer # if keyword is found, save integer
    if start_location == -1:
      new_features[i,keyword_count] = 0 # if keyword is not found, save a zero

    keyword_count = keyword_count + 1

In [76]:
#create a new dataframe with our newly extracted features, then concatenate it with our original sample dataframe:
new_features_df = pd.DataFrame(new_features, columns = ['SYN','FIN','PSH','ACK','PDU','unseen','nocap','HTTPreq','fragmented','bad','UDP','malformed','null','TCP','Seq#','Ack#']) #https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
data_df_SUEE17_short_new = pd.concat([data_df_SUEE17_short,new_features_df], axis=1) #https://pandas.pydata.org/docs/reference/api/pandas.concat.html

In [77]:
#next we will batch these features based on IP address:
batching_samples = data_df_SUEE17_short_new.head(a) #https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html
batching_samples = batching_samples.to_numpy() #https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
np.shape(batching_samples)

start_index = 0 #start at the first IP address
stop_index = 0
new_data = np.zeros((a,17)) # new matrix to contain our newly batched features

for j in range (0,a):
  ip_start = batching_samples[start_index,2] # locate the IP address for the starting index

  #look at the next 10 samples, decide where to stop(either at 10, or when the IP address changes). If we reach the end of all samples, then stop there
  if start_index >= a-10:
    for i in range (start_index+1,a):
      if ip_start != batching_samples[i,2]:
        stop_index = i
        break
      stop_index = i
  else:
    for i in range (start_index+1,start_index+10):
      if ip_start != batching_samples[i,2]:
        stop_index = i
        break
      stop_index = i

  #matrix that contains only the rows we care about for this specific batch
  batch_matrix = batching_samples[start_index:stop_index,0:23]

  #calculate new features
  for k in range(0,16):
    new_data[j,k] = sum(batch_matrix[:,k+7])
  
  start_index = stop_index

# Stop Review Here

#### Note: Here is the output generated. We can see that the first three rows of the new_data table have been populated with sum() values from the batched IP addresses. The 25 samples provided consisted of 9 rows from the first IP address, 6 rows from the second IP address, and 10 rows from the third IP address. This table shows the respective sums from the first 9 rows of the newly generated features, then the next 6 rows, then the next 10 rows.


In [78]:
new_data[0:5]

array([[  1.,   0.,   7.,   9.,   6.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   7., 885.,   8.,   0.],
       [  1.,   0.,   4.,   6.,   4.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   4., 466.,   5.,   0.],
       [  5.,   0.,   1.,   8.,   1.,   0.,   0.,   1.,   0.,   0.,   0.,
          0.,   0.,   3.,   3.,   3.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.]])