# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks. 

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files. 


In [6]:
!pip install h5py torch



In [7]:
# TODO: import the necessary libraries to load the data from the specified path.
import h5py
import torch
from torch.utils.data import Dataset, DataLoader

with h5py.File(r"C:\Users\srijo\Documents\DS\Cyber DS\material\lab\data_students\student_dataset.hdf5", 'r') as f:
    print(list(f.keys()))

class CodeDataset(Dataset):
    def __init__(self, file_path):
        self.file = h5py.File(r"C:\Users\srijo\Documents\DS\Cyber DS\material\lab\data_students\student_dataset.hdf5", 'r')
        self.vectors = self.file['vectors']
        self.labels = self.file['labels']

    def __len__(self):
        return len(self.vectors)

    def __getitem__(self, idx):
        x = torch.tensor(self.vectors[idx], dtype=torch.float32)
        y = torch.tensor(self.labels[idx], dtype=torch.long)
        return x, y

file_path = r"C:\Users\srijo\Documents\DS\Cyber DS\material\lab\data_students\student_dataset.hdf5"

dataset = CodeDataset(file_path)
loader = DataLoader(dataset, batch_size=4, shuffle=True)

# View a sample batch
for x_batch, y_batch in loader:
    print("X batch shape:", x_batch.shape)
    print("Y batch:", y_batch)
    break  # just show one batch


['labels', 'source', 'vectors']
X batch shape: torch.Size([4, 1, 768])
Y batch: tensor([1, 0, 0, 0])


  y = torch.tensor(self.labels[idx], dtype=torch.long)


###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

In [9]:
# TODO: display 10 random samples from the loaded dataset
import random
import pandas as pd

# Select 10 random indices
random_indices = random.sample(range(len(dataset)), 10)

# Prepare data for the table
samples = []
for idx in random_indices:
    x, y = dataset[idx]
    samples.append({
        "Index": idx,
        "Vector (first 5 values)": x.squeeze().numpy()[:5],  
        "Label": y.item()
    })


df = pd.DataFrame(samples)
#pd.set_option('display.max_colwidth', None)  # prevents truncation

display(df)

  y = torch.tensor(self.labels[idx], dtype=torch.long)


Unnamed: 0,Index,Vector (first 5 values),Label
0,174,"[0.5804418, 0.75389934, 0.14379884, 0.7272156,...",0
1,171,"[0.6437178, -0.11841801, -2.5057147, -1.390898...",0
2,164,"[0.3450719, 1.7583288, -0.265322, 0.7399738, 1...",0
3,190,"[0.81041217, 1.0358808, -0.2620668, -0.8361679...",0
4,12,"[0.27083325, -0.8489423, -0.537367, -0.3146219...",0
5,791,"[-0.47601297, -0.51335967, -0.84174156, -0.401...",1
6,351,"[-1.3893349, -0.14533298, -1.7307389, 0.456176...",0
7,574,"[1.3120141, 1.1071213, -0.8958371, -1.0750082,...",0
8,631,"[0.18595368, 0.35955852, 0.548162, 1.4417868, ...",1
9,462,"[3.4308567, 0.9149247, -2.5442762, 0.14785768,...",0


###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [11]:
# TODO: inspect and understand the loaded dataset
num_samples = len(dataset)
print(num_samples)

label_count = {0: 0, 1: 0}
for label in dataset.labels:
    label_count[label] += 1

print(f"Label 0 (non-vulnerable) count: {label_count[0]}")
print(f"Label 1 (vulnerable) count: {label_count[1]}")
print(f"Ratio of vulnerable/non-vulnerable: {label_count[1]/label_count[0]}")

1000
Label 0 (non-vulnerable) count: 717
Label 1 (vulnerable) count: 283
Ratio of vulnerable/non-vulnerable: 0.3947001394700139


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel. 

In [13]:

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")


Using cpu device


In [14]:

from torch import nn

class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

      # forward propagation
    def forward(self, x):
        pred = self.linear_stack(x)
        return pred
      

# TODO: intialize and load the model

model = VulnPredictModel()
model.to(device)
model_weights_path = r"C:\Users\srijo\Documents\DS\Cyber DS\material\lab\model_2023-03-28_20-03.pth"
model.load_state_dict(torch.load(model_weights_path, map_location=device))

<All keys matched successfully>

###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [16]:
from sklearn.metrics import confusion_matrix
# TODO: makethe prediction for all the samples in the validation set.
model.eval()
true_labels = []
predicted_labels = []

dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=False)

with torch.no_grad():  
    for data, labels in dataloader:
        
        data, labels = data.to(device), labels.to(device)

        
        predictions = model(data)

        
        predicted_classes = (predictions.squeeze() > 0.5).float()  # Sigmoid output > 0.5 is class 1

        
        true_labels.extend(labels.cpu().numpy())  
        predicted_labels.extend(predicted_classes.cpu().numpy())


true_labels = torch.tensor(true_labels)
predicted_labels = torch.tensor(predicted_labels)

tn, fp, fn, tp = confusion_matrix(true_labels.numpy(), predicted_labels.numpy()).ravel()
# todo: compute true positives, true negatives, false postives and false negatives.

print(f"True Positives (TP): {tp}")
print(f"True Negatives (TN): {tn}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")

True Positives (TP): 20
True Negatives (TN): 716
False Positives (FP): 1
False Negatives (FN): 263


  y = torch.tensor(self.labels[idx], dtype=torch.long)


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [18]:
# TODO: calculate accuracy
acc = (tp+tn)/(tp+tn+fp+fn)
print(f"Accuracy :{acc}")
# TODO: calculate precision
pre = (tp)/(tp+fp)
print(f"Precision :{pre}")
# TODO: calculate recall
recall = (tp)/(tp+fn)
print(f"Recall :{recall}")
# TODO: calculate F1-score
f1 = 2*(pre*recall)/(pre+recall)
print(f"F1 :{f1}")

Accuracy :0.736
Precision :0.9523809523809523
Recall :0.0706713780918728
F1 :0.13157894736842105


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?
