# NOTEBOOK 4

**TEAM MEMBERS :-** <br/>
**1. Lacey Hamilton**<br/>
**2. Megha Viswanath**<br/>
**3. Yena Hong**

## Multiclass Classification

During the course of our analysis, we encountered a challenge with the large volume of data that we had initially obtained. Specifically, we had 355,630 values, which proved to be too cumbersome to work with during multiclass classification modeling. The time taken to run each model was not feasible, and when we attempted to apply feature engineering, the time requirements were even more substantial. Additionally, we occasionally encountered "Out of Memory" errors, which further complicated the analysis process.

Approach:
To streamline our analysis, we selected one family from each type of malware and worked exclusively with those values. This approach allowed us to effectively manage the size of the dataset while still enabling us to produce reliable results. By reducing the volume of data, we also mitigated the frequency of memory errors, which facilitated a more seamless analysis process.

Conclusion:
In conclusion, our team had to overcome a significant hurdle in dealing with the large volume of data that we had initially acquired. However, by selecting a smaller subset of malware families to work with, we were able to improve the efficiency of our analysis while still producing meaningful results. This approach could serve as a useful template for future research endeavors that encounter similar challenges with large datasets and memory limitations.

In [139]:
## All Library imports for the entire project given here
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

### STEP 1: DATASET FORMATION

To create a comprehensive dataframe that we could work with, we had to combine multiple CSV files obtained from the CIC website. We accomplished this by using the pandas library in Python and the pandas.concat() function to concatenate the files along the rows. By doing so, we were able to effectively manage the data and produce meaningful results for our analysis.

In [140]:
def process_folder(folder_path, malware_type):
    all_files = []
    
    for file in os.listdir(folder_path):
        if file.endswith(".csv"):
            file_path = os.path.join(folder_path, file)
            df = pd.read_csv(file_path)
            df["Malware_Type"] = malware_type
            all_files.append(df)
    
    concatenated_df = pd.concat(all_files, ignore_index=True)
    return concatenated_df


In [141]:
folders_and_types = [
    ("Adware/", "ADWARE"),
    ("Ransomware/", "RANSOMWARE"),
    ("Scareware/", "SCAREWARE"),
    ("SMSmalware/", "SMSMALWARE")
]

In [142]:
all_data = []

for folder, malware_type in folders_and_types:
    processed_data = process_folder(folder, malware_type)
    all_data.append(processed_data)

df = pd.concat(all_data, ignore_index=True)


In [143]:
df.shape

(67289, 87)

In [144]:
df.head(2)

Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Malware_Type,Destination Port.1
0,172.217.0.238-10.42.0.211-443-54819-6,10.42.0.211,45389.0,172.217.0.238,7070.0,6.0,14-06-2017 04:22,15032328.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADWARE_DOWGIN,ADWARE,
1,172.217.1.170-10.42.0.211-443-51023-6,10.42.0.211,443.0,172.217.1.170,42520.0,6.0,14-06-2017 04:22,3606.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADWARE_DOWGIN,ADWARE,


### STEP 2 : PART OF VISUALIZATION

In Jupyter Notebooks 1 and 2, we have detailed our visualizations and the steps we took to clean the dataset. However, due to the large volume of data, we were unable to test one of the features we had designed, namely the destination IP address. This is a crucial feature as it reveals where the malware data is being downloaded to, potentially leading us back to the malware creator. To investigate this further, we developed a function that leverages the API from ipinfo.io/ to trace the location of the IP addresses collected. Only the public IP addresses were traced in this analysis.

In [146]:
# Check how many IP address's are there -
# Because the limit to use the API calls is only 5000
unique_values = df['Destination_IP'].nunique()
print(unique_values)

1908


In [147]:
# function to find the city from IP address 
import requests

def get_ip_location(ip_address):
    try:
        response = requests.get(f"https://ipinfo.io/{ip_address}?token=e8be65bb2d6c16")
        data = response.json()
        if "city" in data:
            location = data["city"]
        else:
            #Sometime the packet information is not enough to detect the City as well
            #It may only pin point the country
            location = "Not Traceable"
    except:
        # The given IP address may be private address in which case ascessing the geolocation would throw an error
        location = "Not Traceable"
    return location

#USe a cache so that you need not use another API call if the place is allready traced
ip_cache = {}

def get_ip_location_cached(ip_address):
    if ip_address in ip_cache:
        return ip_cache[ip_address]

    location = get_ip_location(ip_address)
    ip_cache[ip_address] = location
    return location

# Create a new column called "Destination" which gives the value of teh City to which the IP address was traced 
# or Not Traceable
df['Destination'] = df['Destination_IP'].apply(get_ip_location_cached)

In [152]:
# Get unique values of the 'Destination' column, excluding 'Not Traceable'
places = [value for value in df['Destination'].unique() if value != 'Not Traceable']

In [153]:
import folium
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut


# Initialize the geolocator
geolocator = Nominatim(user_agent="geoapi")

def get_location(city):
    try:
        location = geolocator.geocode(city)
        if location:
            return location.latitude, location.longitude
    except GeocoderTimedOut:
        return get_location(city)

# Get the coordinates for each place
coordinates = [get_location(place) for place in places]

# Remove places that couldn't be geolocated
coordinates = [coord for coord in coordinates if coord]

# Initialize the folium map
m = folium.Map()

# Add markers for each place
for coord, place in zip(coordinates, places):
    folium.Marker(
        location=coord,
        popup=place,
        icon=folium.Icon(color="blue", icon="info-sign"),
    ).add_to(m)

# Display the map directly in the notebook
m


### STEP3 : DATA CLEANING
The steps taken here are explained in Notebook 1.

In [115]:
## Drop the NaN values
df = df.dropna()
df.shape

(65092, 87)

In [116]:
df.head(2)

Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Malware_Type,Destination Port.1
0,172.217.0.238-10.42.0.211-443-54819-6,10.42.0.211,45389.0,172.217.0.238,7070.0,6.0,14-06-2017 04:22,15032328.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADWARE_DOWGIN,ADWARE,9475.212951
1,172.217.1.170-10.42.0.211-443-51023-6,10.42.0.211,443.0,172.217.1.170,42520.0,6.0,14-06-2017 04:22,3606.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADWARE_DOWGIN,ADWARE,9475.212951


In [145]:
## Remove all spaces from the column names, and replace them with _
df.columns = df.columns.str.strip().str.replace(' ', '_')

In [118]:
#Check if the number of same values between teh two columns equals the number of rows in the dataframe
if df.shape[0] == len(np.where(df.iloc[:, 40] == df.iloc[:, 61])[0]):
    print("The values in both the columns are same.")
    df = df.drop(df.columns[61], axis=1)
    print("Duplicate column are dropped.")
    print("Current Shape of Dataframe: ",df.shape)
else:
    print("The values do not match. Hence we can not assume both the columns have same values.")


The values do not match. Hence we can not assume both the columns have same values.


In [119]:
## Check for non numerical columns 
df.select_dtypes(exclude='number').columns

Index(['Flow_ID', 'Source_IP', 'Destination_IP', 'Timestamp',
       'Packet_Length_Std', 'CWE_Flag_Count', 'Label', 'Malware_Type'],
      dtype='object')

In [120]:
df.head(1)

Unnamed: 0,Flow_ID,Source_IP,Source_Port,Destination_IP,Destination_Port,Protocol,Timestamp,Flow_Duration,Total_Fwd_Packets,Total_Backward_Packets,...,Active_Std,Active_Max,Active_Min,Idle_Mean,Idle_Std,Idle_Max,Idle_Min,Label,Malware_Type,Destination_Port.1
0,172.217.0.238-10.42.0.211-443-54819-6,10.42.0.211,45389.0,172.217.0.238,7070.0,6.0,14-06-2017 04:22,15032328.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADWARE_DOWGIN,ADWARE,9475.212951


In [121]:
## 'Label' and 'Malware_Type' are the same , Label simply has an additional value of family.
## For the current working, we are not classifying the type of family , Hence dropping the column
df = df.drop("Label", axis=1)

## 'Flow_ID' is nothing but the combination of Destination IP - Source Ip - Destination Port - Source Port
## Allready available columns for these hence dropping them
df = df.drop("Flow_ID", axis=1)

## 'Source_IP' is not really giving any important information towards the detection of Malware
df = df.drop("Source_IP", axis=1)

## Similar to 'Source_IP', "Destination_IP" also is dropped
df = df.drop("Destination_IP", axis=1)

## "Timestamp" only gived information about the the date and time this was captured, not really connected to Malware
df = df.drop("Timestamp", axis=1)

## "Down/Up_Ratio" is an object time, but really contains only float number.
## Hence simply convert the column type
df['Down/Up_Ratio'] = df['Down/Up_Ratio'].astype(float)

In [126]:
df['Packet_Length_Std'] = df['Packet_Length_Std'].astype(float)
df['CWE_Flag_Count'] = df['CWE_Flag_Count'].astype(float)

In [127]:
# Should contain only the target variable
df.select_dtypes(exclude='number').columns

Index(['Malware_Type'], dtype='object')

### STEP 4: FEATURE ENGINEERING

The feature engineering described here aims to create new features that can help improve the performance of a machine learning model for detecting Android malware. They can provide additional information about the data, enabling the model to identify patterns and relationships between features more effectively. The process of feature engineering can be divided into four categories here:

1. Ratio features: These features are calculated by dividing one feature by another. The purpose of creating these features is to find relative differences between the features, which might help in identifying patterns related to malware. For example, the 'Fwd_Bwd_Packet_Length_Mean_Ratio' feature calculates the ratio of forward packet length mean to backward packet length mean. A small constant (epsilon) is added to the denominator to avoid division by zero.

2. Aggregated features: These features are created by combining two or more existing features. Aggregating features can provide additional insights and may reveal hidden relationships in the data. For example, 'Total_Packets' is created by adding 'Total_Fwd_Packets' and 'Total_Backward_Packets', which gives an overall count of packets in both directions.

3. Interaction features: These features are created by performing mathematical operations (e.g., subtraction, multiplication) on two or more existing features. Interaction features can help in finding nonlinear relationships between features. For example, 'Fwd_Bwd_Packet_Length_Mean_Diff' calculates the difference between forward and backward packet length means, which may reveal some trends related to malware traffic.

4. Statistical features: These features are calculated by applying statistical functions (e.g., mean, median, standard deviation) on a set of existing features. Statistical features help in summarizing the distribution of the data and can reveal important patterns or trends. For example, 'Packet_Length_Mean' calculates the mean of all packet length features.

In [128]:
########################################### FEATURE ENGINEERING ####################################################
data = df

# Create new features by calculating ratios
epsilon = 1e-8  # Small constant to avoid division by zero
data['Fwd_Bwd_Packet_Length_Mean_Ratio'] = data['Fwd_Packet_Length_Mean'] / (data['Bwd_Packet_Length_Mean'] + epsilon)
data['Fwd_Bwd_Packets_per_s_Ratio'] = data['Fwd_Packets/s'] / (data['Bwd_Packets/s'] + epsilon)
data['Fwd_Bwd_Header_Length_Ratio'] = data['Fwd_Header_Length'] / (data['Bwd_Header_Length'] + epsilon)
data['Fwd_Bwd_IAT_Mean_Ratio'] = data['Fwd_IAT_Mean'] / (data['Bwd_IAT_Mean'] + epsilon)
data['Flow_Bytes_Packets_per_s_Ratio'] = data['Flow_Bytes/s'] / (data['Flow_Packets/s'] + epsilon)
data['Avg_Fwd_Bwd_Segment_Size_Ratio'] = data['Avg_Fwd_Segment_Size'] / (data['Avg_Bwd_Segment_Size'] + epsilon)


In [129]:
# Aggregated features
data['Total_Packets'] = data['Total_Fwd_Packets'] + data['Total_Backward_Packets']
data['Total_Bytes'] = data['Total_Length_of_Fwd_Packets'] + data['Total_Length_of_Bwd_Packets']
data['Avg_Packet_Length'] = (data['Fwd_Packet_Length_Mean'] + data['Bwd_Packet_Length_Mean']) / 2
data['Total_IAT'] = data['Fwd_IAT_Total'] + data['Bwd_IAT_Total']
data['Total_Header_Length'] = data['Fwd_Header_Length'] + data['Bwd_Header_Length']

# Interaction features
data['Fwd_Bwd_Packet_Length_Mean_Diff'] = data['Fwd_Packet_Length_Mean'] - data['Bwd_Packet_Length_Mean']
data['Fwd_Bwd_Packet_Length_Mean_Product'] = data['Fwd_Packet_Length_Mean'] * data['Bwd_Packet_Length_Mean']
data['Fwd_Bwd_IAT_Mean_Diff'] = data['Fwd_IAT_Mean'] - data['Bwd_IAT_Mean']
data['Fwd_Bwd_IAT_Mean_Product'] = data['Fwd_IAT_Mean'] * data['Bwd_IAT_Mean']

# Statistical features
packet_length_features = ['Fwd_Packet_Length_Max', 'Fwd_Packet_Length_Min', 'Fwd_Packet_Length_Mean',
                          'Fwd_Packet_Length_Std', 'Bwd_Packet_Length_Max', 'Bwd_Packet_Length_Min',
                          'Bwd_Packet_Length_Mean', 'Bwd_Packet_Length_Std']

data['Packet_Length_Mean'] = data[packet_length_features].mean(axis=1)
data['Packet_Length_Median'] = data[packet_length_features].median(axis=1)
data['Packet_Length_Std'] = data[packet_length_features].std(axis=1)
data['Packet_Length_Range'] = data[packet_length_features].max(axis=1) - data[packet_length_features].min(axis=1)

In [130]:
## After feature engineering let us check the shape of the dataset:
data.shape

(65091, 99)

### STEP 5: ENCODING CATEGORICAL ATTRIBUTES

Before building a machine learning model, it is often necessary to preprocess the data to ensure that it is in the right format for the model. In this case, two features require encoding: 'Protocol' and 'Malware Type'. Encoding is the process of converting categorical variables into numerical values, which can be easily processed by machine learning algorithms.

1. One Hot Encoding for 'Protocol':<br/>
The 'Protocol' feature is a categorical variable with three unique values, representing the three different protocols used in the dataset. One Hot Encoding is a method used to convert a categorical variable into multiple binary columns, with each column representing a unique category (in this case, a unique protocol). Each row of the dataset will have a '1' in the column corresponding to the protocol it uses and a '0' in the other columns. This type of encoding is particularly useful when there is no ordinal relationship between the categories, meaning the categories cannot be sorted or ranked in a meaningful way. By using One Hot Encoding, the machine learning model can treat each protocol independently without assuming any inherent order.

2. Label Encoding for 'Malware Type':<br/>
The 'Malware Type' feature represents the type of malware present in the dataset. Unlike the 'Protocol' feature, the number of malware types can be large, and there may be an implicit order or hierarchy between the types. In this case, Label Encoding is more appropriate. Label Encoding is a method where each unique category is assigned a numerical value, typically an integer starting from 0. For example, if there are five malware types, they will be assigned values from 0 to 4. While this encoding method is more compact and straightforward, it may introduce an ordinal relationship between categories, which might not be suitable for all use cases. However, for 'Malware Type', it can be beneficial, as the model can learn the relationships between different types of malware.

By encoding the 'Protocol' and 'Malware Type' features appropriately, the machine learning model can better understand and process the dataset, leading to improved performance in detecting Android malware.

In [131]:
#Encode the categorical variables: Protocol and Label
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-hot encoding for the 'Protocol' column
one_hot_encoder = OneHotEncoder(sparse=False)
protocol_one_hot = one_hot_encoder.fit_transform(data['Protocol'].values.reshape(-1, 1))

# Create new column names for the one-hot encoded 'Protocol' features
protocol_columns = ['Protocol_' + str(i) for i in range(protocol_one_hot.shape[1])]

# Add the one-hot encoded 'Protocol' features to the DataFrame and drop the original column
data[protocol_columns] = pd.DataFrame(protocol_one_hot, index=data.index)
data = data.drop('Protocol', axis=1)

# Label encoding for the 'Label' column (target variable)
label_encoder = LabelEncoder()
data['Malware_Type'] = label_encoder.fit_transform(data['Malware_Type'])

### STEP 6: RESAMPLE AND SCALE THE DATA

Standard Scaler is used to normalize the data, ensuring that all features contribute equally to the model by transforming them to have a mean of 0 and a standard deviation of 1. <br/>
SMOTE is an oversampling technique applied to address class imbalance, generating synthetic samples for the minority class to improve the model's performance on underrepresented data, like detecting Android malware in a dataset dominated by benign apps. <br/>
By combining data scaling and SMOTE, we enhance the model's effectiveness and generalization capabilities in real-world scenarios.

In [155]:
################################################## SCALE THE DATA #########################################################

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Separate features and target
X = data.drop('Malware_Type', axis=1)
y = data['Malware_Type']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE
sm = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_train)

# Print the new class distribution
##print(pd.Series(y_train_resampled).value_counts())

In [134]:
# Scale the features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

### STEP 7: FEATURE SELECTION

For feature selection, we employed two different techniques: LASSO (Least Absolute Shrinkage and Selection Operator) and Random Forest. By comparing the results of both methods, we were able to identify the top 35 most important features that contributed significantly to detecting Android malware. Utilizing insights from both LASSO, a linear model with built-in feature selection, and Random Forest, an ensemble-based method that estimates feature importance through numerous decision trees, we ensured a comprehensive and robust selection of relevant features, ultimately enhancing the performance and interpretability of the final model.

In [156]:
# LASSO model for feature importance
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)

# Get feature importances
lasso_importances = lasso.coef_

# Create a list of tuples containing feature names and their corresponding coefficients
feature_coef = list(zip(df.columns[:-1], lasso_importances))

# Sort the feature coefficients in descending order of their absolute values
feature_coef_sorted = sorted(feature_coef, key=lambda x: abs(x[1]), reverse=True)

# Print the feature names and their corresponding coefficients in the sorted order
for feature, coef in feature_coef_sorted:
    print(feature, ':', coef)


Packet_Length_Std : -0.05397281914249613
Bwd_IAT_Total : 0.04277275949903097
Subflow_Bwd_Packets : 0.016567263046203932
PSH_Flag_Count : -0.0050922750581315285
Bwd_URG_Flags : 0.0010423896034828651
Subflow_Fwd_Bytes : -0.0009033420745324329
Fwd_Packet_Length_Max : 0.0008747241063513355
Fwd_Avg_Packets/Bulk : 0.0006358129548136472
Total_Fwd_Packets : -0.0005616105725574037
Bwd_Avg_Bytes/Bulk : 0.0004568652155747236
Destination_Port : 0.0004317848773109084
Total_Length_of_Bwd_Packets : 0.0004112035872513742
Flow_Duration : 0.0003538074889301432
URG_Flag_Count : 0.0003477748421156236
Destination_IP : -0.00030933720175187754
Bwd_Header_Length : -0.00029552376804126943
Fwd_Packets/s : 0.00028362145890123365
Fwd_Packet_Length_Mean : 0.0001861244973394131
Total_Length_of_Fwd_Packets : 0.00014615826254945233
Fwd_Packet_Length_Min : -6.072490939079805e-05
CWE_Flag_Count : -3.17961372812805e-05
Fwd_Header_Length : -1.9122949663568193e-05
Bwd_IAT_Min : 1.4778182507091342e-05
ACK_Flag_Count : 1.45

  positive)


In [135]:
#Random Forest for Feature Importance
from sklearn.ensemble import RandomForestClassifier

# Random Forest model for feature importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
rf_importances = rf.feature_importances_

# Pair feature names with their importances
feature_importance_pairs = list(zip(X.columns, rf_importances))

# Sort the feature importance pairs by importance (descending order)
sorted_feature_importance_pairs = sorted(feature_importance_pairs, key=lambda x: x[1], reverse=True)

# Print the sorted feature importance pairs
print("Feature importances (Random Forest):")
for feature, importance in sorted_feature_importance_pairs:
    print(f"{feature}: {importance}")

Feature importances (Random Forest):
Flow_Duration: 0.042665931657165944
Flow_IAT_Mean: 0.03479853736216587
Flow_IAT_Min: 0.03471225234717118
Source_Port: 0.03459711322402445
Flow_IAT_Max: 0.03275114516269383
Fwd_Packets/s: 0.03046033455091102
Flow_Packets/s: 0.030378897955566187
Init_Win_bytes_forward: 0.028697024239119198
Fwd_IAT_Min: 0.022882723792413455
Bwd_Packets/s: 0.02184826049959215
Average_Packet_Size: 0.02041855055148638
Fwd_Bwd_IAT_Mean_Diff: 0.020376565344438487
Fwd_IAT_Mean: 0.019592887119266725
Fwd_IAT_Max: 0.019156805629110867
Total_IAT: 0.018852039668801272
Fwd_IAT_Total: 0.018753018916520665
Destination_Port: 0.018363053913545636
Max_Packet_Length: 0.017464045382510188
Flow_Bytes_Packets_per_s_Ratio: 0.016802345804527885
Flow_Bytes/s: 0.016467662812375515
Init_Win_bytes_backward: 0.01640093845889432
Subflow_Fwd_Bytes: 0.01563972006282603
Packet_Length_Variance: 0.015379392644519067
Fwd_Bwd_IAT_Mean_Ratio: 0.014989883418684078
Total_Header_Length: 0.014646001761366401


In [136]:
selected_features = [
    'Source_Port',
    'Flow_Duration',
    'Flow_IAT_Max',
    'Flow_IAT_Min',
    'Flow_IAT_Mean',
    'Init_Win_bytes_forward',
    'Fwd_Packets/s',
    'Flow_Packets/s',
    'Fwd_IAT_Min',
    'Fwd_Bwd_IAT_Mean_Diff',
    'Fwd_IAT_Mean',
    'Fwd_IAT_Max',
    'Total_IAT',
    'Fwd_IAT_Total',
    'Destination_Port',
    'Bwd_Packets/s',
    'Init_Win_bytes_backward',
    'Total_Header_Length',
    'Flow_Bytes/s',
    'Flow_IAT_Std',
    'Fwd_IAT_Std',
    'Fwd_Bwd_Header_Length_Ratio',
    'Fwd_Header_Length',
    'Fwd_Bwd_Packets_per_s_Ratio',
    'min_seg_size_forward',
    'Avg_Fwd_Segment_Size',
    'Flow_Bytes_Packets_per_s_Ratio',
    'Average_Packet_Size',
    'Fwd_Bwd_Packet_Length_Mean_Ratio',
    'Fwd_Packet_Length_Mean',
    'Avg_Fwd_Bwd_Segment_Size_Ratio',
    'Subflow_Fwd_Bytes',
    'Packet_Length_Std',
    'Fwd_Bwd_Packet_Length_Mean_Product',
    'Packet_Length_Mean'
]

# Create a new DataFrame with selected features
selected_X = X[selected_features]

### STEP 8: MODEL CREATION AND PERFORMANCE COMPARISON

The five classifiers chosen for the task of detecting the type of malware:

1. Decision Tree: A decision tree is a simple, interpretable, and non-parametric classifier that recursively splits the input space into regions based on feature values. It constructs a tree-like structure, where each internal node represents a decision based on a feature value, and each leaf node represents the predicted class label. Decision trees can handle both continuous and categorical features and are easily visualized for better understanding.

2. XGBoost (Extreme Gradient Boosting) Classifier: XGBoost is an advanced implementation of gradient boosted trees, an ensemble learning technique that combines the predictions of multiple weak learners (typically shallow decision trees) in a weighted manner to improve the overall performance. XGBoost is known for its high accuracy, speed, and scalability, making it suitable for a wide range of applications, including malware detection.

3. LightGBM (Light Gradient Boosting Machine): LightGBM is another gradient boosting framework that employs tree-based learning algorithms. It is designed to be efficient and scalable for large datasets and high-dimensional data. LightGBM differs from other boosting algorithms by using a histogram-based algorithm, which reduces memory usage and speeds up training. It also supports categorical features and can handle imbalanced datasets effectively.

4. RandomForest: A RandomForest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (for classification) or mean prediction (for regression) of the individual trees. This method leverages the power of multiple trees to reduce overfitting and improve generalization, making it more robust and accurate than a single decision tree.

5. CatBoost (Categorical Boosting): CatBoost is a gradient boosting framework that focuses on improving the handling of categorical features. It uses an efficient method for encoding categorical variables called "ordered target statistics" or "ordered boosting," which reduces overfitting and provides better performance compared to traditional encoding methods. CatBoost is well-suited for datasets with a large number of categorical features, like malware detection tasks, where file properties, API calls, or network data might be categorical.

These classifiers were selected based on the preliminary results on the sample data, offering a diverse set of algorithms to ensure a robust and accurate model for Android malware detection.

In [138]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(selected_X, y, test_size=0.2, random_state=42, stratify=y)

# Convert label data to integers
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

# Create a list of classifiers
classifiers = [
    ("Decision Tree", DecisionTreeClassifier(random_state=42)),
    ("XGB Classifier", XGBClassifier(use_label_encoder=False, eval_metric="mlogloss", random_state=42)),
    ("LightGBM", lgb.LGBMClassifier(random_state=42)),
    ("Random Forest", RandomForestClassifier(random_state=42)),
    ("CatBoost", CatBoostClassifier(verbose=0, random_state=42)),
]

# Train and evaluate each classifier
for name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred))


Decision Tree Accuracy: 0.7184
              precision    recall  f1-score   support

           0       0.72      0.69      0.70      3210
           1       0.71      0.70      0.70      3586
           2       0.75      0.75      0.75      3166
           3       0.71      0.73      0.72      3057

    accuracy                           0.72     13019
   macro avg       0.72      0.72      0.72     13019
weighted avg       0.72      0.72      0.72     13019

XGB Classifier Accuracy: 0.7830
              precision    recall  f1-score   support

           0       0.81      0.75      0.78      3210
           1       0.74      0.83      0.78      3586
           2       0.89      0.77      0.83      3166
           3       0.73      0.78      0.75      3057

    accuracy                           0.78     13019
   macro avg       0.79      0.78      0.78     13019
weighted avg       0.79      0.78      0.78     13019

LightGBM Accuracy: 0.7515
              precision    recall  f1-sco

## CONCLUSION:
Random Forest performed the best for detecting the type of malware. And with our Feature Engineering, Sampling and Encoding we were able to build a model that gave an accuracy of 82%