# NOTEBOOK 3 

**TEAM MEMBERS :-** <br/>
**1. Lacey Hamilton**<br/>
**2. Megha Viswanath**<br/>
**3. Yena Hong**

## WHY WORK ON A SAMPLE ?

**To gain a better understanding of the features and determine the most effective feature engineering approach, we initially performed our modeling and analysis using a subset of the dataset, selecting only one CSV file from each malware family instead of using all the provided CSVs. This strategy allowed us to efficiently explore the relationships between features, evaluate different feature engineering techniques, and refine our models in a more manageable and focused manner, ultimately enhancing the performance of our final Android malware detection model.**


### Same code is applied for the Multiclass Classifier Notebook 4 with some added improvements

In [1]:
## All Library imports for the entire project given here
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

In [2]:
#Combine all datasets 
# Get all CSV files in the current directory
csv_files = [f for f in os.listdir('.') if f.endswith('.csv')]

# Combine all CSV files into a single DataFrame
df = pd.concat((pd.read_csv(f) for f in csv_files))


In [4]:
df.head(2)

Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,10.42.0.151-104.46.62.41-58063-443-6,10.42.0.151,58063,104.46.62.41,443,6,13-06-2017 05:22,209484,1,2,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADWARE
1,10.42.0.151-104.46.62.41-58063-443-6,10.42.0.151,58063,104.46.62.41,443,6,13-06-2017 05:22,31308,2,0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADWARE


In [5]:
df.shape

(11368, 85)

In [7]:
df = df.dropna()
df.shape

(11368, 85)

In [9]:
df.columns = df.columns.str.strip().str.replace(' ', '_')

In [11]:
#Check if the number of same values between teh two columns equals the number of rows in the dataframe
if df.shape[0] == len(np.where(df.iloc[:, 40] == df.iloc[:, 61])[0]):
    print("The values in both the columns are same.")
    df = df.drop(df.columns[61], axis=1)
    #print("Duplicate column thus dropped.")
else:
    print("The values do not match. Hence we can not assume both the columns have same values.")


The values in both the columns are same.


In [12]:
df.select_dtypes(exclude='number').columns

Index(['Flow_ID', 'Source_IP', 'Destination_IP', 'Timestamp', 'Label'], dtype='object')

In [13]:
df = df.drop("Flow_ID", axis=1)
df = df.drop("Source_IP", axis=1)
df = df.drop("Destination_IP", axis=1)
df = df.drop("Timestamp", axis=1)

In [17]:
import pandas as pd

# Assuming your dataset is stored in a Pandas DataFrame called 'data'
# Create new features by calculating ratios
data = df
data['Fwd_Bwd_Packet_Length_Mean_Ratio'] = data['Fwd_Packet_Length_Mean'] / data['Bwd_Packet_Length_Mean']
data['Fwd_Bwd_Packets_per_s_Ratio'] = data['Fwd_Packets/s'] / data['Bwd_Packets/s']
data['Fwd_Bwd_Header_Length_Ratio'] = data['Fwd_Header_Length'] / data['Bwd_Header_Length']
data['Fwd_Bwd_IAT_Mean_Ratio'] = data['Fwd_IAT_Mean'] / data['Bwd_IAT_Mean']
data['Flow_Bytes_Packets_per_s_Ratio'] = data['Flow_Bytes/s'] / data['Flow_Packets/s']
data['Avg_Fwd_Bwd_Segment_Size_Ratio'] = data['Avg_Fwd_Segment_Size'] / data['Avg_Bwd_Segment_Size']


In [19]:
# Aggregated features
data['Total_Packets'] = data['Total_Fwd_Packets'] + data['Total_Backward_Packets']
data['Total_Bytes'] = data['Total_Length_of_Fwd_Packets'] + data['Total_Length_of_Bwd_Packets']
data['Avg_Packet_Length'] = (data['Fwd_Packet_Length_Mean'] + data['Bwd_Packet_Length_Mean']) / 2
data['Total_IAT'] = data['Fwd_IAT_Total'] + data['Bwd_IAT_Total']
data['Total_Header_Length'] = data['Fwd_Header_Length'] + data['Bwd_Header_Length']

# Interaction features
data['Fwd_Bwd_Packet_Length_Mean_Diff'] = data['Fwd_Packet_Length_Mean'] - data['Bwd_Packet_Length_Mean']
data['Fwd_Bwd_Packet_Length_Mean_Product'] = data['Fwd_Packet_Length_Mean'] * data['Bwd_Packet_Length_Mean']
data['Fwd_Bwd_IAT_Mean_Diff'] = data['Fwd_IAT_Mean'] - data['Bwd_IAT_Mean']
data['Fwd_Bwd_IAT_Mean_Product'] = data['Fwd_IAT_Mean'] * data['Bwd_IAT_Mean']

# Statistical features
packet_length_features = ['Fwd_Packet_Length_Max', 'Fwd_Packet_Length_Min', 'Fwd_Packet_Length_Mean',
                          'Fwd_Packet_Length_Std', 'Bwd_Packet_Length_Max', 'Bwd_Packet_Length_Min',
                          'Bwd_Packet_Length_Mean', 'Bwd_Packet_Length_Std']

data['Packet_Length_Mean'] = data[packet_length_features].mean(axis=1)
data['Packet_Length_Median'] = data[packet_length_features].median(axis=1)
data['Packet_Length_Std'] = data[packet_length_features].std(axis=1)
data['Packet_Length_Range'] = data[packet_length_features].max(axis=1) - data[packet_length_features].min(axis=1)


In [21]:
#Encode the categorical variables: Protocol and Label

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-hot encoding for the 'Protocol' column
one_hot_encoder = OneHotEncoder(sparse=False)
protocol_one_hot = one_hot_encoder.fit_transform(data['Protocol'].values.reshape(-1, 1))

# Create new column names for the one-hot encoded 'Protocol' features
protocol_columns = ['Protocol_' + str(i) for i in range(protocol_one_hot.shape[1])]

# Add the one-hot encoded 'Protocol' features to the DataFrame and drop the original column
data[protocol_columns] = pd.DataFrame(protocol_one_hot, index=data.index)
data = data.drop('Protocol', axis=1)

# Label encoding for the 'Label' column (target variable)
label_encoder = LabelEncoder()
data['Label'] = label_encoder.fit_transform(data['Label'])


In [28]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Separate features and target
X = data.drop('Label', axis=1)
y = data['Label']

# Replace infinite values with NaN
X = X.replace([np.inf, -np.inf], np.nan)

# Check the percentage of missing values in each column
missing_percentages = X.isna().mean().sort_values(ascending=False)
print("Missing value percentages:\n", missing_percentages)

# Impute missing values using mean, median, or a constant value
# You can choose an appropriate strategy based on your data
X = X.fillna(X.mean())

# Scale the features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# LASSO model for feature importance
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)

# Pair feature names with their coefficients
feature_coefficient_pairs = list(zip(X.columns, lasso_importances))

# Sort the feature coefficient pairs by absolute coefficient value (descending order)
sorted_feature_coefficient_pairs = sorted(feature_coefficient_pairs, key=lambda x: abs(x[1]), reverse=True)

# Print the sorted feature coefficient pairs
print("Feature coefficients (LASSO):")
for feature, coefficient in sorted_feature_coefficient_pairs:
    print(f"{feature}: {coefficient}")

Missing value percentages:
 Fwd_Bwd_IAT_Mean_Ratio              0.602657
Avg_Fwd_Bwd_Segment_Size_Ratio      0.499912
Fwd_Bwd_Packet_Length_Mean_Ratio    0.499912
Fwd_Bwd_Header_Length_Ratio         0.308058
Fwd_Bwd_Packets_per_s_Ratio         0.307618
                                      ...   
Subflow_Fwd_Bytes                   0.000000
Subflow_Bwd_Packets                 0.000000
Subflow_Bwd_Bytes                   0.000000
Init_Win_bytes_forward              0.000000
Source_Port                         0.000000
Length: 98, dtype: float64
Feature coefficients (LASSO):
Protocol_1: 0.23129489862524058
min_seg_size_forward: -0.21495615533169332
FIN_Flag_Count: -0.12914669701213655
Packet_Length_Std: -0.1252223275532595
Bwd_Packet_Length_Max: -0.10548986530830447
Flow_Bytes_Packets_per_s_Ratio: 0.10434929680346951
URG_Flag_Count: -0.09051037266251083
Flow_IAT_Mean: -0.07810852199074796
Fwd_PSH_Flags: -0.05443746777880298
Bwd_Packets/s: -0.05299150538152009
Init_Win_bytes_forward: 0.05

In [26]:
# Pair feature names with their importances
feature_importance_pairs = list(zip(X.columns, rf_importances))

# Sort the feature importance pairs by importance (descending order)
sorted_feature_importance_pairs = sorted(feature_importance_pairs, key=lambda x: x[1], reverse=True)

# Print the sorted feature importance pairs
print("Feature importances (Random Forest):")
for feature, importance in sorted_feature_importance_pairs:
    print(f"{feature}: {importance}")


Feature importances (Random Forest):
Source_Port: 0.05267238990400279
Flow_Duration: 0.04081082960740954
Flow_IAT_Max: 0.039271399895039874
Flow_IAT_Min: 0.039124293290823016
Flow_IAT_Mean: 0.037536511670189815
Init_Win_bytes_forward: 0.03685480406148481
Fwd_Packets/s: 0.03598903236403668
Flow_Packets/s: 0.03553883420169848
Fwd_IAT_Min: 0.03231410572370882
Fwd_Bwd_IAT_Mean_Diff: 0.027967505005339552
Fwd_IAT_Mean: 0.02686560303086267
Fwd_IAT_Max: 0.026513242483825902
Total_IAT: 0.026420069190839154
Fwd_IAT_Total: 0.026344271022269834
Destination_Port: 0.02460041928292783
Bwd_Packets/s: 0.02340695198231803
Init_Win_bytes_backward: 0.01817452790215907
Total_Header_Length: 0.015315144957373938
Flow_Bytes/s: 0.015213608049957956
Flow_IAT_Std: 0.015167267410301263
Fwd_IAT_Std: 0.014159714624352664
Fwd_Bwd_Header_Length_Ratio: 0.01327302862868136
Fwd_Header_Length: 0.012566247696052706
Fwd_Bwd_Packets_per_s_Ratio: 0.010899833467612887
min_seg_size_forward: 0.010792969231798018
Avg_Fwd_Segment

In [30]:
selected_features = [
    'Source_Port',
    'Flow_Duration',
    'Flow_IAT_Max',
    'Flow_IAT_Min',
    'Flow_IAT_Mean',
    'Init_Win_bytes_forward',
    'Fwd_Packets/s',
    'Flow_Packets/s',
    'Fwd_IAT_Min',
    'Fwd_IAT_Mean',
    'Fwd_IAT_Max',
    'Total_IAT',
    'Fwd_IAT_Total',
    'Destination_Port',
    'Bwd_Packets/s',
    'Init_Win_bytes_backward',
    'Total_Header_Length',
    'Flow_Bytes/s',
    'Fwd_IAT_Std',
    'Fwd_Bwd_Header_Length_Ratio',
    'Fwd_Header_Length',
    'Fwd_Bwd_Packets_per_s_Ratio',
    'min_seg_size_forward',
    'Avg_Fwd_Segment_Size',
    'Flow_Bytes_Packets_per_s_Ratio',
    'Average_Packet_Size',
    'Fwd_Bwd_Packet_Length_Mean_Ratio',
    'Fwd_Packet_Length_Mean',
    'Avg_Fwd_Bwd_Segment_Size_Ratio',
    'Subflow_Fwd_Bytes',
    'Packet_Length_Std',
    'Fwd_Bwd_Packet_Length_Mean_Product',
    'Packet_Length_Mean'
]

# Create a new DataFrame with selected features
selected_X = X[selected_features]

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(selected_X, y, test_size=0.2, random_state=42)

# Create a list of classifiers
classifiers = [
    ("Decision Tree", DecisionTreeClassifier(random_state=42)),
    ("XGB Classifier", XGBClassifier(use_label_encoder=False, eval_metric="mlogloss", random_state=42)),
    ("KNN", KNeighborsClassifier()),
    ("Neural Network", MLPClassifier(max_iter=500, random_state=42)),
    ("Naive Bayes", GaussianNB())
]

# Train and evaluate each classifier
for name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred))


Decision Tree Accuracy: 0.6310
              precision    recall  f1-score   support

           0       0.59      0.62      0.61       591
           1       0.57      0.58      0.57       354
           2       0.65      0.63      0.64       800
           3       0.71      0.67      0.69       529

    accuracy                           0.63      2274
   macro avg       0.63      0.63      0.63      2274
weighted avg       0.63      0.63      0.63      2274

XGB Classifier Accuracy: 0.7476
              precision    recall  f1-score   support

           0       0.71      0.70      0.70       591
           1       0.79      0.67      0.72       354
           2       0.73      0.76      0.75       800
           3       0.79      0.83      0.81       529

    accuracy                           0.75      2274
   macro avg       0.76      0.74      0.75      2274
weighted avg       0.75      0.75      0.75      2274

KNN Accuracy: 0.5031
              precision    recall  f1-score   

In [36]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Create a list of classifiers
classifiers = [
    ("Random Forest", RandomForestClassifier(random_state=42)),
    ("Gradient Boosting", GradientBoostingClassifier(random_state=42)),
    ("AdaBoost", AdaBoostClassifier(random_state=42)),
    ("CatBoost", CatBoostClassifier(verbose=0, random_state=42)),
    ("LightGBM", lgb.LGBMClassifier(random_state=42)),
    ("SVM", SVC(random_state=42)),
    ("Logistic Regression", LogisticRegression(max_iter=1000, random_state=42))
]

# Train and evaluate each classifier
for name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred))


  import pandas.util.testing as tm
  config.update(yaml.load(text) or {})


Random Forest Accuracy: 0.6689
              precision    recall  f1-score   support

           0       0.62      0.66      0.64       591
           1       0.68      0.56      0.61       354
           2       0.68      0.68      0.68       800
           3       0.70      0.74      0.72       529

    accuracy                           0.67      2274
   macro avg       0.67      0.66      0.66      2274
weighted avg       0.67      0.67      0.67      2274

Gradient Boosting Accuracy: 0.6288
              precision    recall  f1-score   support

           0       0.60      0.54      0.57       591
           1       0.71      0.47      0.57       354
           2       0.62      0.67      0.65       800
           3       0.63      0.77      0.69       529

    accuracy                           0.63      2274
   macro avg       0.64      0.61      0.62      2274
weighted avg       0.63      0.63      0.62      2274

AdaBoost Accuracy: 0.4916
              precision    recall  f1-

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
