# Malware Classification Using Machine Learning

## Introduction to Malware Analysis

Malware represents software specifically designed to cause damage or perform unauthorized actions on computer systems or networks. These malicious programs can be systematically categorized based on their characteristics, operational methods, and intended purposes, among various other factors. A malware category is commonly referred to as a **malware family**. Researchers and security professionals can explore detailed information about different malware families through resources like Malpedia. Notable examples include Emotet and WannaCry, which have gained significant attention in cybersecurity circles.


## Malware Classification Methodology

### Key Features for Malware Classification

When classifying malware, several critical features must be considered to accurately categorize different types of malicious software:

- **Behavior and Functionality**: The specific actions and capabilities of the malware
- **Delivery and Propagation Methods**: How the malware spreads and infects systems
- **Technical Characteristics**: Low-level attributes and implementation details

Traditional malware classification requires a comprehensive combination of static and dynamic analysis techniques, including time-intensive reverse engineering of malware binaries. This manual process can be extremely resource-intensive and time-consuming. Therefore, employing machine learning classifiers to assist in malware classification can dramatically accelerate the analysis process while maintaining accuracy.


In this section, we will implement a malware classifier based on the innovative technique explored in academic research, which investigates malware classification through the analysis of malware images. This approach represents a novel intersection of computer vision and cybersecurity.


## Malware Image Classification Approach

### Understanding the Image-Based Classification Method

While classifying malware based on images might initially appear counterintuitive, we will explore the dataset in the upcoming sections and discover why this approach proves remarkably effective. The concept leverages the visual representation of binary data to identify patterns and characteristics that distinguish different malware families.

For this module, training a classifier on images offers several significant advantages:

- **Safety**: We avoid handling potentially dangerous malicious binaries directly
- **Security**: By only processing images that represent these binaries, we eliminate the risk of accidentally infecting our system with malware
- **Educational Appropriateness**: This approach is more suitable for a learning environment compared to working directly with binary files
- **Efficiency**: Image-based analysis can be faster and more scalable than traditional reverse engineering methods


### Convolutional Neural Networks for Malware Classification

In the upcoming sections, we will explore the process of training a **Convolutional Neural Network (CNN)** to classify malware images. CNNs are particularly well-suited for this task because they excel at:

- **Pattern Recognition**: Identifying visual patterns and textures in image data
- **Feature Extraction**: Automatically learning relevant features from raw pixel data
- **Hierarchical Learning**: Building complex representations from simple visual elements
- **Spatial Relationships**: Understanding the spatial arrangement of features within images

This deep learning approach will enable us to automatically discover distinguishing characteristics between different malware families based on their visual representations, providing an efficient and accurate classification system.


# Malware Dataset and Data Exploration

## Introduction to the Malimg Dataset

The dataset of malware images we will be utilizing is the **malimg dataset**, which can be obtained from multiple sources. This dataset was originally proposed in academic research and represents a significant contribution to the field of malware analysis through visual representation.

### Dataset Acquisition

We can download and unpack the dataset using the following commands:

```bash
wget https://www.kaggle.com/api/v1/datasets/download/ikrambenabd/malimg-original -O malimg.zip
unzip malimg.zip
```


## Dataset Structure and Organization

The dataset consists of **9,339 image files** representing **25 different malware families**. The dataset is organized in folders, where each folder contains all samples for a single malware family. The folder name corresponds to the malware family's name:

```bash
ls malimg_paper_dataset_imgs

Adialer.C        C2LOP.P          Lolyda.AA3      'Swizzor.gen!I'
Agent.FYI        Dialplatform.B   Lolyda.AT        VB.AT
Allaple.A        Dontovo.A       'Malex.gen!J'     Wintrim.BX
Allaple.L        Fakerean         Obfuscator.AD    Yuner.A
'Alueron.gen!J'  Instantaccess   'Rbot!gen'
Autorun.K        Lolyda.AA1       Skintrim.N
'C2LOP.gen!g'    Lolyda.AA2      'Swizzor.gen!E'
```


## Understanding Malware Image Format

Each image contains a visual representation of a **PE (Portable Executable) file**, which is a Windows executable format. The images are grayscale in PNG format.

### Binary-to-Image Conversion Process

These images represent a direct visualization of the malware binaries. Each pixel in the image corresponds to a single byte in the binary file. The byte can have any value in the 0-255 range, and this exact value is represented by the corresponding pixel's brightness:

- **Byte value 0**: Results in a black pixel
- **Byte value 255**: Results in a white pixel  
- **Values in between**: Result in corresponding gray pixels

### Information Preservation

Each binary byte is fully encoded within the image, meaning the image can be used to exactly reconstruct the binary without any loss of information. Furthermore, the images can visibly convey patterns in the binary structure.

For instance, consider samples from the FakeRean malware family. We can observe distinct patterns in both malware images, demonstrating how visual analysis can reveal structural characteristics of different malware families.


## Dataset Exploration and Analysis

To familiarize ourselves with the dataset, let's start by creating a plot of the class distribution within it. This enables us to identify classes that are over- or underrepresented, which is crucial for understanding potential biases in our training data.

### Required Imports and Setup

To achieve this analysis, we will need the following imports as well as a base path to the folder containing the data:


In [None]:
import os
import matplotlib.pyplot as plt
import seaborn as sns

DATA_BASE_PATH = "./malimg_paper_dataset_imgs/"


### Computing Class Distribution

Afterward, we can iterate over all malware families and count the number of images within the corresponding folder to compute the overall class distribution:


In [None]:
# compute the class distribution
dist = {}
for mlw_class in os.listdir(DATA_BASE_PATH):
    mlw_dir = os.path.join(DATA_BASE_PATH, mlw_class)
    dist[mlw_class] = len(os.listdir(mlw_dir))


### Visualizing Class Distribution

Finally, we can create a barplot to visualize the class distribution using a custom color palette:


In [None]:
# plot the class distribution

# HTB Color Palette
htb_green = "#9FEF00"
node_black = "#141D2B"
hacker_grey = "#A4B1CD"

# data
classes = list(dist.keys())
frequencies = list(dist.values())

# plot
plt.figure(facecolor=node_black)
sns.barplot(y=classes, x=frequencies, edgecolor = "black", orient='h', color=htb_green)
plt.title("Malware Class Distribution", color=htb_green)
plt.xlabel("Malware Class Frequency", color=htb_green)
plt.ylabel("Malware Class", color=htb_green)
plt.xticks(color=hacker_grey)
plt.yticks(color=hacker_grey)
ax = plt.gca()
ax.set_facecolor(node_black)
ax.spines['bottom'].set_color(hacker_grey)
ax.spines['top'].set_color(node_black)
ax.spines['right'].set_color(node_black)
ax.spines['left'].set_color(hacker_grey)
plt.show()


### Analysis of Class Distribution

From the resulting diagram, we can identify which malware families are represented more than others, potentially skewing the model's performance. The visualization reveals:

- **Class Imbalance**: Some families like Allaple.A and Allaple.L have significantly higher frequencies
- **Underrepresented Classes**: Other families may have very few samples
- **Training Implications**: This imbalance could affect model performance and generalization

If the trained model does not provide the expected performance in terms of accuracy, number of false positives, and number of false negatives, we may want to fine-tune the dataset before training to ensure a more balanced class distribution. This could involve techniques such as:

- **Data Augmentation**: Creating additional samples for underrepresented classes
- **Class Balancing**: Using techniques like SMOTE or undersampling
- **Weighted Training**: Adjusting loss functions to account for class imbalance


# Data Preprocessing and Model Architecture

## Overview of Data Preprocessing

We need to prepare the data before we can feed the images to a CNN for training and inference. In particular, we need to split the data into two distinct datasets: a training and a test set. Furthermore, we need to apply the preprocessing functions expected by our model so the model can work on the images. Lastly, we must create DataLoaders that we can use during training and inference.

## Preparing the Datasets

To split the data into two distinct datasets, one for training and one for testing, we will use the library `split-folders`, which we can install with pip:

```bash
pip3 install split-folders
```

Afterward, we can use the following code to split the data accordingly. We will use an 80-20 split, meaning 80% of the data will be used for training and 20% for testing:


In [None]:
import splitfolders

DATA_BASE_PATH = "./malimg_paper_dataset_imgs/"
TARGET_BASE_PATH = "./newdata/"

TRAINING_RATIO = 0.8
TEST_RATIO = 1 - TRAINING_RATIO

splitfolders.ratio(input=DATA_BASE_PATH, output=TARGET_BASE_PATH, ratio=(TRAINING_RATIO, 0, TEST_RATIO))


After running the code once, a new directory `./newdata/` will be created containing three folders:

```bash
ls -la ./newdata/

total 0
drwxr-xr-x 1 t t  24 26. Nov 10:52 .
drwxr-xr-x 1 t t 160 26. Nov 10:52 ..
drwxr-xr-x 1 t t 498 26. Nov 10:52 test
drwxr-xr-x 1 t t 498 26. Nov 10:52 train
drwxr-xr-x 1 t t 498 26. Nov 10:52 val
```

The `test` folder contains the test dataset, the `train` folder contains the training dataset, and the `val` folder contains the validation dataset. In this case, we will not use a validation data set, which is why the validation data set is empty. We can confirm the 80-20 split by counting the number of files in each dataset:

```bash
find ./newdata/test/ -type f | wc -l
1880

find ./newdata/train/ -type f | wc -l
7459

find ./newdata/val/ -type f | wc -l
0
```

The split was successful, as we can see. We can now create DataLoaders for training and inference and apply the required preprocessing to the images.


## Applying Preprocessing and Creating DataLoaders

In the first step, let us define the preprocessing required for our model to read the data. For CNNs, this typically requires a resizing such that all input images are the same size and a normalization. Normalization ensures that the data is standardized before the data is fed to the model. This results in a model that is easier to train. In PyTorch, our preprocessing looks like this:


In [None]:
from torchvision import transforms

# Define preprocessing transforms
transform = transforms.Compose([
	transforms.Resize((75, 75)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])


Afterward, we can load the datasets from their corresponding folders and apply the preprocessing functions. We need to specify the root folder for each dataset in the `root` parameter and the preprocessing transform in the `transform` parameter. As we have discussed above, the root folders for the datasets are `./newdata/train/` and `./newdata/test/`, respectively.


In [None]:
from torchvision.datasets import ImageFolder
import os

BASE_PATH = "./newdata/"

# Load training and test datasets
train_dataset = ImageFolder(
	root=os.path.join(BASE_PATH, "train"),
    transform=transform
)

test_dataset = ImageFolder(
	root=os.path.join(BASE_PATH, "test"),
    transform=transform
)


Finally, we can create DataLoader instances, which we can use to iterate over the data for training and inference. We can supply a batch size and specify the number of workers to load the data in the `num_workers` parameter. This enables parallelization and will speed up the data handling:


In [None]:
from torch.utils.data import DataLoader

TRAIN_BATCH_SIZE = 1024
TEST_BATCH_SIZE = 1024

# Create data loaders
train_loader = DataLoader(
    train_dataset,
	batch_size=TRAIN_BATCH_SIZE,
    shuffle=True,
    num_workers=2
)
    
test_loader = DataLoader(
    test_dataset,
    batch_size=TEST_BATCH_SIZE,
    shuffle=False,
    num_workers=2
)


Let us take a look at one of the preprocessed images to see its effects:


In [None]:
import matplotlib.pyplot as plt

# HTB Color Palette
htb_green = "#9FEF00"
node_black = "#141D2B"
hacker_grey = "#A4B1CD"

# image
sample = next(iter(train_loader))[0][0]

# plot
plt.figure(facecolor=node_black)
plt.imshow(sample.permute(1,2,0))
plt.xticks(color=hacker_grey)
plt.yticks(color=hacker_grey)
ax = plt.gca()
ax.set_facecolor(node_black)
ax.spines['bottom'].set_color(hacker_grey)
ax.spines['top'].set_color(node_black)
ax.spines['right'].set_color(node_black)
ax.spines['left'].set_color(hacker_grey)
ax.tick_params(axis='x', colors=hacker_grey)
ax.tick_params(axis='y', colors=hacker_grey)
plt.show()


This is the raw malware image:

*[Image of static noise pattern]*

This is the resized and normalized image from our DataLoader that we will feed to the model:

*[Heatmap visualization with varying shades of blue indicating data intensity]*

The details can be roughly discerned from the raw image. However, many of the fine details have been lost.

After combining the above code into a single function, we end up with the following code:


In [None]:
from torchvision import transforms
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
import os

def load_datasets(base_path, train_batch_size, test_batch_size):
    # Define preprocessing transforms
    transform = transforms.Compose([
        transforms.Resize((75, 75)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    # Load training and test datasets
    train_dataset = ImageFolder(
        root=os.path.join(base_path, "train"),
        transform=transform
    )

    test_dataset = ImageFolder(
        root=os.path.join(base_path, "test"),
        transform=transform
    )

    # Create data loaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=train_batch_size,
        shuffle=True,
        num_workers=2
    )
    
    test_loader = DataLoader(
        test_dataset,
        batch_size=test_batch_size,
        shuffle=False,
        num_workers=2
    )

    n_classes = len(train_dataset.classes)
    return train_loader, test_loader, n_classes


Note that the function also returns the number of classes in the dataset. As we have mentioned before, the Malimg dataset consists of 25 classes, so we could omit this step and simply assume there are always 25 classes. However, by reading this information dynamically from the data itself, we can use the same code even after making changes to the dataset, either by removing one of the classes or adding new classes to the dataset.


# Model Architecture and Implementation

## Overview of the Model

The heart of any classifier is the model. As discussed previously, we will be using a CNN model. To speed up the training process, we will base our model on a pre-trained version of a well-established CNN called ResNet50.

## ResNet50 Architecture

The ResNet family of CNNs was proposed in 2015 in academic research. We will use a variant called ResNet50. This model is 50 layers deep, where it got its name, and consists of roughly 23 million parameters. This model is strong in image classification tasks, which perfectly fits our needs for malware classification.

### Transfer Learning Approach

To significantly speed up the training process, we will not start with randomly initialized weights but rather with a pre-trained ResNet50 model. Our code will download pre-trained weights and apply them to our model as a baseline. We will then run our training on the malware image dataset to fine-tune it for our purpose. This approach will save us training time in the magnitude of multiple days or even weeks.

### Weight Freezing Strategy

Furthermore, to further speed up the training process, we will freeze the weights of all ResNet layers except for the final one. Thus, during our training process, only the weights of the final layer will change. While this may reduce our classifier's performance, it will significantly benefit our training time and be a good trade-off for our simple proof-of-concept experiment. We will also adjust the final layer according to our needs. In particular, we may adjust the number of neurons in the final layer and fix the output size to the number of classes in our training data. This results in the following `MalwareClassifier` class:


In [None]:
import torch.nn as nn
import torchvision.models as models

HIDDEN_LAYER_SIZE = 1000

class MalwareClassifier(nn.Module):
    def __init__(self, n_classes):
        super(MalwareClassifier, self).__init__()
        # Load pretrained ResNet50
        self.resnet = models.resnet50(weights='DEFAULT')
        
        # Freeze ResNet parameters
        for param in self.resnet.parameters():
            param.requires_grad = False
        
        # Replace the last fully connected layer
        num_features = self.resnet.fc.in_features
        self.resnet.fc = nn.Sequential(
            nn.Linear(num_features, HIDDEN_LAYER_SIZE),
            nn.ReLU(),
            nn.Linear(HIDDEN_LAYER_SIZE, n_classes)
        )

    def forward(self, x):
        return self.resnet(x)


When initializing the model, we need to specify the number of classes. Since our dataset consists of 25 classes, we can initialize the model like so:


In [None]:
model = MalwareClassifier(25)


However, as discussed in the previous section, the advantage of dynamically setting the number of classes is that we can directly use it from the dataset. By combining the above code with the code from the previous section, we can take the number of classes from the dataset and initialize the model accordingly:


In [None]:
DATA_PATH = "./newdata/"
TRAINING_BATCH_SIZE = 1024
TEST_BATCH_SIZE = 1024

# Load datasets
train_loader, test_loader, n_classes = load_datasets(DATA_PATH, TRAINING_BATCH_SIZE, TEST_BATCH_SIZE)

# Initialize model
model = MalwareClassifier(n_classes)


# Model Training and Evaluation

## Overview of Training and Evaluation Process

After loading the datasets and initializing the model, let's finally discuss model training and evaluation to see how well our model performs.

## Model Training Implementation

Let us define a training function that takes a model, a training loader, and the number of epochs. We will then specify the loss function as `CrossEntropyLoss` and use the Adam optimizer. Afterward, we iterate the entire training data for each epoch and run the forward and backward passes. For a refresher on backpropagation and gradient descent, check out the Fundamentals of AI module.

The final training function looks like this:


In [None]:
import torch
import time

def train(model, train_loader, n_epochs, verbose=False):
    model.train()
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters())

    training_data = {"accuracy": [], "loss": []}
    
    for epoch in range(n_epochs):
        running_loss = 0
        n_total = 0
        n_correct = 0
        checkpoint = time.time() * 1000
        
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            _, predicted = outputs.max(1)
            n_total += labels.size(0)
            n_correct += predicted.eq(labels).sum().item()
            running_loss += loss.item()
        
        epoch_loss = running_loss / len(train_loader)
        epoch_duration = int(time.time() * 1000 - checkpoint)
        epoch_accuracy = compute_accuracy(n_correct, n_total)
        
        training_data["accuracy"].append(epoch_accuracy)
        training_data["loss"].append(epoch_loss)
        
        if verbose:
            print(f"[i] Epoch {epoch+1} of {n_epochs}: Acc: {epoch_accuracy:.2f}% Loss: {epoch_loss:.4f} (Took {epoch_duration} ms).")    
    
    return training_data


Note that much of the code within the training function keeps track of information about the training, such as time elapsed, accuracy, and loss.

Additionally, we will define a function to save the trained model to disk for later use:


In [None]:
def save_model(model, path):
	model_scripted = torch.jit.script(model)
	model_scripted.save(path)


## Model Evaluation

To evaluate the model, we will first define a function that runs the model on a single input and returns the predicted class:


In [None]:
def predict(model, test_data):
    model.eval()

    with torch.no_grad():
        output = model(test_data)
        _, predicted = torch.max(output.data, 1)

    return predicted


We set the model to evaluation mode using the call `model.eval()` and disable gradient calculation using `torch.no_grad()`. From there, we can write an evaluation function that iterates over the entire test dataset and evaluates the model's performance in terms of accuracy:


In [None]:
def compute_accuracy(n_correct, n_total):
    return round(100 * n_correct / n_total, 2)


def evaluate(model, test_loader):
    model.eval()

    n_correct = 0
    n_total = 0
    
    with torch.no_grad():
        for data, target in test_loader:
            predicted = predict(model, data)
            n_total += target.size(0)
            n_correct += (predicted == target).sum().item()

    accuracy = compute_accuracy(n_correct, n_total)  

    return accuracy


## Visualization Functions

Lastly, let us define a couple of helper functions that create simple plots for the training accuracy and loss per epoch, respectively:


In [None]:
import matplotlib.pyplot as plt

def plot(data, title, label, xlabel, ylabel):
    # HTB Color Palette
    htb_green = "#9FEF00"
    node_black = "#141D2B"
    hacker_grey = "#A4B1CD"

    # plot
    plt.figure(figsize=(10, 6), facecolor=node_black)
    plt.plot(range(1, len(data)+1), data, label=label, color=htb_green)
    plt.title(title, color=htb_green)
    plt.xlabel(xlabel, color=htb_green)
    plt.ylabel(ylabel, color=htb_green)
    plt.xticks(color=hacker_grey)
    plt.yticks(color=hacker_grey)
    ax = plt.gca()
    ax.set_facecolor(node_black)
    ax.spines['bottom'].set_color(hacker_grey)
    ax.spines['top'].set_color(node_black)
    ax.spines['right'].set_color(node_black)
    ax.spines['left'].set_color(hacker_grey)

    legend = plt.legend(facecolor=node_black, edgecolor=hacker_grey, fontsize=10)
    plt.setp(legend.get_texts(), color=htb_green)
    
    plt.show()

def plot_training_accuracy(training_data):
    plot(training_data['accuracy'], "Training Accuracy", "Accuracy", "Epoch", "Accuracy (%)")

def plot_training_loss(training_data):
    plot(training_data['loss'], "Training Loss", "Loss", "Epoch", "Loss")


## Complete Training and Evaluation Pipeline

After defining all helper functions, we can write a script that defines all parameters and runs the helper functions to load the data, initialize the model, train the model, save the model, and finally evaluate the model:


In [None]:
# data parameters
DATA_PATH = "./newdata/"

# training parameters
N_EPOCHS = 10
TRAINING_BATCH_SIZE = 512
TEST_BATCH_SIZE = 1024

# model parameters
HIDDEN_LAYER_SIZE = 1000
MODEL_FILE = "malware_classifier.pth"


# Load datasets
train_loader, test_loader, n_classes = load_datasets(DATA_PATH, TRAINING_BATCH_SIZE, TEST_BATCH_SIZE)

# Initialize model
model = MalwareClassifier(n_classes)

# Train model
print("[i] Starting Training...")  
training_information = train(model, train_loader, N_EPOCHS, verbose=True)

# Save model
save_model(model, MODEL_FILE)

# evaluate model
accuracy = evaluate(model, test_loader)
print(f"[i] Inference accuracy: {accuracy}%.")  

# Plot training details
plot_training_accuracy(training_information)
plot_training_loss(training_information)


## Training Results and Analysis

Running the final code, we can achieve an accuracy of 88.54% on the test dataset:

```bash
python3 main.py

[i] Epoch 1 of 10: Acc: 57.09% Loss: 1.4741 (Took 41128 ms).
[i] Epoch 2 of 10: Acc: 85.01% Loss: 0.4631 (Took 40630 ms).
[i] Epoch 3 of 10: Acc: 89.60% Loss: 0.2880 (Took 39567 ms).
[i] Epoch 4 of 10: Acc: 91.88% Loss: 0.2294 (Took 39464 ms).
[i] Epoch 5 of 10: Acc: 92.97% Loss: 0.2113 (Took 39367 ms).
[i] Epoch 6 of 10: Acc: 93.86% Loss: 0.1744 (Took 39172 ms).
[i] Epoch 7 of 10: Acc: 95.13% Loss: 0.1572 (Took 39804 ms).
[i] Epoch 8 of 10: Acc: 94.81% Loss: 0.1501 (Took 39092 ms).
[i] Epoch 9 of 10: Acc: 96.51% Loss: 0.1188 (Took 39328 ms).
[i] Epoch 10 of 10: Acc: 96.26% Loss: 0.1198 (Took 39125 ms).
[i] Inference accuracy: 88.54%.
```

During the training process, we can observe a steady increase in accuracy up until the final couple of epochs:

*[Line graph of training accuracy over epochs, showing an increase from 60% to 95%]*

While the final accuracy is not great, it is acceptable, provided our simple training setup. We have tweaked many parameters to favor training time instead of model performance. Keep in mind that the model's accuracy may vary depending on the random split of the datasets. Additionally, tweaking the parameters affects both training time and model performance. Feel free to play around with all the parameters the script defines to determine their effects.
