# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks.

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files.


In [22]:
# Import necessary libraries
import h5py
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import random

# Define custom dataset class
class VulnerabilityDataset(Dataset):
    def __init__(self, file_path):
        with h5py.File('data/student_dataset.hdf5', 'r') as f:
            self.labels = np.array(f['labels'])
            self.vectors = np.array(f['vectors'])
            self.sources = np.array(f['source'])

        # Reshape vectors from (1000,1,768) to (1000,768)
        self.vectors = self.vectors.reshape(self.vectors.shape[0], -1)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        vector = torch.tensor(self.vectors[idx], dtype=torch.float32)
        label = torch.tensor(self.labels[idx], dtype=torch.float32)
        return vector, label

# Load the dataset
dataset = VulnerabilityDataset("data/student_dataset.hdf5")
print("Dataset loaded successfully!")
print(f"Total samples: {len(dataset)}")

Dataset loaded successfully!
Total samples: 1000


###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

In [23]:
# Create a DataFrame to display samples
random_indices = random.sample(range(len(dataset)), 10)
samples = []

for idx in random_indices:
    vector, label = dataset[idx]
    samples.append({
        'Index': idx,
        'Source Length': len(dataset.sources[idx]),
        'Vector Shape': vector.shape,
        'Label': 'Vulnerable' if label.item() == 1 else 'Not Vulnerable'
    })

pd.DataFrame(samples)

Unnamed: 0,Index,Source Length,Vector Shape,Label
0,195,1055,"(768,)",Not Vulnerable
1,325,166,"(768,)",Not Vulnerable
2,226,3738,"(768,)",Vulnerable
3,448,685,"(768,)",Not Vulnerable
4,600,233,"(768,)",Not Vulnerable
5,766,1724,"(768,)",Vulnerable
6,182,1149,"(768,)",Vulnerable
7,939,876,"(768,)",Vulnerable
8,740,1395,"(768,)",Not Vulnerable
9,134,2171,"(768,)",Vulnerable


###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [24]:
# TODO: inspect and understand the loaded dataset

# Calculate dataset statistics
total_samples = len(dataset)
positive_samples = sum(dataset.labels)
negative_samples = total_samples - positive_samples
ratio = positive_samples / negative_samples

print(f"1. Total samples in dataset: {total_samples}")
print(f"2. Positive examples (vulnerable): {positive_samples}")
print(f"3. Vulnerable/Non-vulnerable ratio: {ratio:.2f}:1")

1. Total samples in dataset: 1000
2. Positive examples (vulnerable): 283
3. Vulnerable/Non-vulnerable ratio: 0.39:1


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel.

``` python
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

```

In [25]:
from torch import nn

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

class VulnPredictModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_stack = nn.Sequential(
            nn.Linear(768, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        x = self.flatten(x)
        pred = self.linear_stack(x)
        return pred

# Initialize and load the model
model = VulnPredictModel().to(device)
model.load_state_dict(torch.load("model_2023-03-28_20-03.pth"))
model.eval()
print("Model loaded successfully!")

Using cuda device
Model loaded successfully!


###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [26]:
# Create data loader
data_loader = DataLoader(dataset, batch_size=32, shuffle=False)

# Initialize counters
true_pos = 0
true_neg = 0
false_pos = 0
false_neg = 0

# Make predictions
with torch.no_grad():
    for vectors, labels in data_loader:
        vectors, labels = vectors.to(device), labels.to(device)
        outputs = model(vectors)
        predictions = (outputs > 0.3).float().squeeze()  # Added squeeze() for proper shape

        # Convert to numpy for easier counting
        predictions = predictions.cpu().numpy()
        labels = labels.cpu().numpy()

        # Update counters (fixed logic)
        true_pos += np.sum((predictions == 1) & (labels == 1))
        true_neg += np.sum((predictions == 0) & (labels == 0))
        false_pos += np.sum((predictions == 1) & (labels == 0))
        false_neg += np.sum((predictions == 0) & (labels == 1))

print(f"True Positives: {true_pos}")
print(f"True Negatives: {true_neg}")
print(f"False Positives: {false_pos}")
print(f"False Negatives: {false_neg}")

True Positives: 94
True Negatives: 704
False Positives: 13
False Negatives: 189


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [27]:
# Calculate metrics manually
accuracy = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)
precision = true_pos / (true_pos + false_pos) if (true_pos + false_pos) > 0 else 0
recall = true_pos / (true_pos + false_neg) if (true_pos + false_neg) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"Accuracy: {accuracy:.4f} ({(accuracy*100):.2f}%)")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")



Accuracy: 0.7980 (79.80%)
Precision: 0.8785
Recall: 0.3322
F1 Score: 0.4821


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?


## Task 7: Performance Metrics Analysis

### Accuracy vs. F1 Score Impact
- **Accuracy** measures overall correctness but can be misleading with imbalanced datasets
- **F1 Score** balances precision and recall, better for imbalanced cases like vulnerability detection

### Key Metric Focus
For vulnerability prediction, **recall** (true positive rate) is most critical because:
- Missing vulnerabilities (false negatives) is more dangerous than false alarms
- Security applications prioritize catching all potential threats

### Better Metrics for Vulnerability Prediction
The **F2 Score** (weights recall higher than precision) would be even better because:
1. It emphasizes reducing false negatives
2. In security contexts, false negatives are more costly than false positives.