# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks. 

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files. 


In [1]:
# import the necessary libraries to load the data from the specified path.
import h5py
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch import nn
import pandas as pd
from pathlib import Path
import numpy as np

In [2]:
f = h5py.File("./data_students/student_dataset.hdf5")
print(list(f.keys()))

['labels', 'source', 'vectors']


In [3]:
dset = f['source']
print('dtype', dset.dtype)

dtype object


In [4]:
class VulnDataset(Dataset):
  def __init__(self, filepath):
    self.path = filepath
    self.file_handle = h5py.File(filepath)

  def __len__(self):
    return len(self.file_handle["labels"])

  def __getitem__(self, idx):
    return (self.file_handle["source"][idx], self.file_handle["labels"][idx])
  
  def __getposvuln__(self):
    count_true = list(self.file_handle["labels"]).count(True)
    return count_true


class PreprocessedDataset(Dataset):
  def __init__(self, filepath):
    self.path = filepath
    self.file_handle = h5py.File(filepath)

  def __len__(self):
    return len(self.file_handle["labels"])

  def __getitem__(self, idx):
    return (self.file_handle["vectors"][idx], self.file_handle["labels"][idx])
 
dataset_path = Path("data_students/student_dataset.hdf5")
val_dataset = VulnDataset("data_students/student_dataset.hdf5")
val_loader = DataLoader(val_dataset)

process_dataset = PreprocessedDataset("data_students/student_dataset.hdf5")
preproc_loader = DataLoader(process_dataset)

In [5]:
# print(val_dataset)
print(val_dataset.__getposvuln__())
print(len(preproc_loader.dataset))

283
1000


###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

In [6]:
# TODO: display 10 random samples from the loaded dataset
samples = val_dataset[:]
df_samples = pd.DataFrame(samples).transpose()
print(df_samples.head(10))

                                                   0      1
0  b'get_charcode(VMG_ uint argc)\r\n{\r\n    con...  False
1  b"find_open_file_info(char * id) {\n    unsign...  False
2  b'_openipmi_read (ipmi_openipmi_ctx_t ctx,\n  ...   True
3  b'camel_store_get_inbox_folder_sync (CamelStor...   True
4  b"locate_var_of_level_walker(Node *node,\n\t\t...  False
5  b'apply(ast_sent* s) {\n        if (s->get_nod...  False
6  b'addr_ston(const struct sockaddr *sa, struct ...   True
7  b'printStats(const RunSummary& sol, const Solv...  False
8  b'extendtimeline() {\n  if (timeline.recording...  False
9  b'Document(Conf& conf, Encodings& encodings, i...  False


###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [7]:
# TODO: inspect and understand the loaded dataset
length_dataset = val_dataset.__len__()
print(f'length of dataset is {length_dataset}')
count_vuln = val_dataset.__getposvuln__()
print(f'Number of positive examples in dataset: {count_vuln}')
print(f'The ratio vuln/non-vuln in dataset is {count_vuln/length_dataset}')

length of dataset is 1000
Number of positive examples in dataset: 283
The ratio vuln/non-vuln in dataset is 0.283


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel. 

``` python 
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

```

``` python
from torch import nn

class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

      # forward propagation
      def forward(self, x):
        pred = self.linear_stack(x)
        return pred
      

# TODO: intialize and load the model

```

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


In [9]:
class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

    # forward propagation
    def forward(self, x):
      pred = self.linear_stack(x)
      return pred

In [10]:
# TODO: intialize and load the model
model = VulnPredictModel()
model.load_state_dict(torch.load("model_2023-03-28_20-03.pth", map_location=torch.device('cpu')))
# model.eval()

<All keys matched successfully>

###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [11]:
# TODO: make the prediction for all the samples in the validation set.

predictions = []
false_pos, false_neg, true_pos, true_neg = 0, 0, 0, 0
loss, correct = 0, 0

with torch.no_grad():
    for ip, op in preproc_loader:
        ip, op = ip.to(device), op.to(device)
        pred = model(ip)
        predictions.extend(pred)
        op = torch.tensor(float(op))
        # print(pred, op)
        if pred - op >= 0.5:
            false_pos += 1
        elif pred - op >= 0:
            true_neg += 1
        elif pred - op >= -0.5:
            true_pos += 1
        else:
            false_neg += 1
        loss_func = nn.BCELoss()
        loss += loss_func(pred, np.reshape(op, (1,1,1))).item()
        correct += (pred.argmax(1) == op).type(torch.float).sum().item()
loss /= len(preproc_loader)
correct /= len(preproc_loader.dataset)
print(f"Test Error: \nAccuracy: {(100*correct):>0.1f}%, \nAverage loss: {loss:>8f} \n")

# print(predictions)

Test Error: 
Accuracy: 71.7%, 
Average loss: 0.567931 



In [12]:
# TODO: compute true positives, true negatives, false postives and false negatives.
print(f'True Positives: {true_pos}\nTrue Negatives: {true_neg}\nFalse Postives: {false_pos}\nFalse Negatives: {false_neg}')

True Positives: 20
True Negatives: 716
False Postives: 1
False Negatives: 263


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [13]:
# TODO: calculate accuracy

accuracy = (true_neg + true_pos) / (true_neg + true_pos + false_neg + false_pos)

# TODO: calculate precision

precision = true_pos / (true_pos + false_pos)

# TODO: calculate recall

recall = true_pos / (true_pos + false_neg)

# TODO: calculate F1-score

f1score = 2 * (precision * recall) / (precision + recall)

print(f'Accuracy {accuracy}\nPrecision {precision} \nRecall {recall} \nF1 Score {f1score}')


Accuracy 0.736
Precision 0.9523809523809523 
Recall 0.0706713780918728 
F1 Score 0.13157894736842105


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?
