# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks. 

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files. 


In [8]:
# TODO: import the necessary libraries to load the data from the specified path.
import h5py
import tensorflow as tf
import numpy as np
import pandas as pd
data = h5py.File("C:/Users/usha kiran.k/OneDrive/Desktop/cyds/material/lab/data_students/student_dataset.hdf5", 'r')

In [9]:
print(list(data.keys()))

['labels', 'source', 'vectors']


###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

In [16]:
# TODO: display 10 random samples from the loaded dataset
vectors = data['vectors']
labels = data['labels']

random_indices = np.random.choice(vectors.shape[0], 10, replace=False)

for idx in random_indices:
    print("Vector:", vectors[idx])
    print("Label:", labels[idx])
    print("-" * 40)

Vector: [[-1.77435160e-01 -6.05050564e-01 -1.87222695e+00 -2.16816783e-01
  -4.22865301e-02  1.39477170e+00 -2.04704928e+00 -1.39769959e+00
  -8.10957909e-01  2.04176903e+00 -1.11201012e+00  2.18376303e+00
  -4.69477266e-01 -1.29200101e+00  4.00903910e-01 -1.11643839e+00
  -6.76240385e-01  8.45095396e-01 -1.91237181e-01 -2.07130671e+00
   3.03808004e-01 -2.48111367e-01  1.15145934e+00  1.34148157e+00
   1.22134471e+00  6.84932351e-01 -3.83677155e-01  1.75673461e+00
   2.92540044e-01  1.51137829e+00  6.86896205e-01 -1.00152183e+00
  -6.07295930e-01 -8.65779996e-01  7.51211762e-01 -6.81445420e-01
  -3.29223454e-01 -1.30273247e+00 -7.11957097e-01 -7.52116516e-02
  -3.22232485e+00  9.14146379e-03  7.22815275e-01  8.58798444e-01
  -5.30240297e-01 -6.34389400e-01 -1.39112210e+00  4.35938865e-01
   2.01331511e-01  2.13268852e+00 -9.08417702e-02  1.64976072e+00
   2.24996716e-01 -7.04123080e-01 -7.39999354e-01 -8.51388514e-01
  -1.81230021e+00  9.59425449e-01  1.78591013e+00 -3.17791373e-01
  

In [24]:
import numpy as np

# Load the source field from the dataset
sources = data['source'][:]

# Decode from bytes to strings
decoded_sources = [s.decode('utf-8') for s in sources]

# Number of random samples to display
num_samples_to_show = 5
random_indices = np.random.choice(len(decoded_sources), size=num_samples_to_show, replace=False)

# Display only the source code
for i in random_indices:
    print(f"Sample index: {i}")
    print(decoded_sources[i])
    print("-" * 60)


Sample index: 245
recalculate_yres(RawXYZControls *controls)
{
    RawXYZArgs *args = controls->args;
    gint yres;

    if (controls->in_update || !args->xymeasureeq)
        return;

    yres = GWY_ROUND((args->ymax - args->ymin)/(args->xmax - args->xmin)
                     *args->xres);
    yres = CLAMP(yres, 2, 16384);
    set_adjustment_in_update(controls, GTK_ADJUSTMENT(controls->yres), yres);
}
------------------------------------------------------------
Sample index: 188
push_stop_watch()
{
  static struct timeval start_time, end_time;
  static bool start = true;
  if (start) {
    gettimeofday(&start_time, NULL);
    start = false;
    return 0;
  }

  gettimeofday(&end_time, NULL);
  int elapse_msec = (end_time.tv_sec - start_time.tv_sec) * 1000 +
    (int)((end_time.tv_usec - start_time.tv_usec) * 0.001);
  cerr << elapse_msec << " msec" << endl;
  start = true;
  return elapse_msec;
}
------------------------------------------------------------
Sample index: 562
ath10k_d

###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [31]:
# TODO: inspect and understand the loaded dataset
vectors = data['vectors'][:]
num_samples = vectors.shape[0]
print("Number of samples in the dataset:", num_samples)

labels = data['labels'][:]
num_positive = (labels == True).sum()
print("Number of positive (vulnerable) examples:", num_positive)

num_negative = (labels == 0).sum()
ratio = num_positive / num_negative
print("Vulnerable / Non-vulnerable ratio:", ratio)

Number of samples in the dataset: 1000
Number of positive (vulnerable) examples: 283
Vulnerable / Non-vulnerable ratio: 0.3947001394700139


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel. 

``` python 
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

```

In [25]:
import torch
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

class VulnPredictModel(nn.Module):
    # initialize the model architecture
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_stack = nn.Sequential(
            nn.Linear(768, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )

    # forward propagation
    def forward(self, x):
        pred = self.linear_stack(x)
        return pred

# Initialize the model and move it to the device
model = VulnPredictModel().to(device)

# Load the pre-trained weights
# Replace 'model_path.pth' with your actual model file path
model.load_state_dict(torch.load("C:/Users/usha kiran.k/OneDrive/Desktop/cyds/material/lab/model_2023-03-28_20-03.pth", map_location=device))

# Set the model to evaluation mode
model.eval()



Using cpu device


VulnPredictModel(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_stack): Sequential(
    (0): Linear(in_features=768, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=1, bias=True)
    (5): Sigmoid()
  )
)

###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [33]:

# Convert your vectors and labels to torch tensors and move to device
inputs = torch.tensor(vectors, dtype=torch.float32)
true_labels = torch.tensor(labels, dtype=torch.float32) # assuming 1/0 or True/False

# No gradient needed for evaluation
with torch.no_grad():
    # Get prediction probabilities from model (shape: [num_samples, 1])
    outputs = model(inputs).squeeze()  # shape: [num_samples]
    
    # Convert probabilities to binary predictions using 0.5 threshold
    preds = (outputs >= 0.5).float()
    
# Compute TP, TN, FP, FN
TP = ((preds == 1) & (true_labels == 1)).sum().item()
TN = ((preds == 0) & (true_labels == 0)).sum().item()
FP = ((preds == 1) & (true_labels == 0)).sum().item()
FN = ((preds == 0) & (true_labels == 1)).sum().item()

print(f"True Positives: {TP}")
print(f"True Negatives: {TN}")
print(f"False Positives: {FP}")
print(f"False Negatives: {FN}")


# todo: compute true positives, true negatives, false postives and false negatives.

True Positives: 20
True Negatives: 716
False Positives: 1
False Negatives: 263


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [34]:
# TODO: calculate accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) > 0 else 0
print(f"Accuracy: {accuracy:.4f}")
# TODO: calculate precision
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
print(f"Precision: {precision:.4f}")
# TODO: calculate recall
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
print(f"Recall: {recall:.4f}")
# TODO: calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
print(f"F1 Score: {f1_score:.4f}")

Accuracy: 0.7360
Precision: 0.9524
Recall: 0.0707
F1 Score: 0.1316


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?

1. Impact of accuracy vs. F1 score:
Accuracy measures overall correctness but can be misleading in imbalanced datasets by favoring the majority class. F1 score balances precision and recall, giving a better sense of performance on the minority (positive) class. A high accuracy with low F1 indicates many false negatives or false positives. Thus, F1 is often more informative when class imbalance exists.

2. Which metric to focus more on?
In vulnerability prediction, focusing on F1 score is more important because it balances catching actual vulnerabilities (recall) and avoiding false alarms (precision). Accuracy alone can be misleading if vulnerable cases are rare. Improving recall is critical to not miss vulnerabilities, while precision avoids wasting resources on false positives.

3. A better metric suitable for vulnerability prediction:
The Recall or Sensitivity metric is especially crucial because missing vulnerabilities (false negatives) can have severe consequences. Alternatively, using the Area Under the Precision-Recall Curve (AUPRC) can better capture model performance on imbalanced data, emphasizing the trade-off between precision and recall. These metrics prioritize detecting true vulnerabilities reliably.
