# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks. 

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data/dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files. 


In [1]:
import h5py    
import numpy as np   
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import pandas as pd

In [2]:
f = h5py.File("data/dataset.hdf5",'r+')  
f.keys()

<KeysViewHDF5 ['labels', 'source', 'vectors']>

In [3]:
class CyberSecurityDataset(Dataset):
    def __init__(self, path):
        self.file = h5py.File(path, 'r+')
        
    def __len__(self):
        return len(self.file["labels"])
    
    def __getitem__(self, idx):
        return (self.file["source"][idx], self.file["labels"][idx])
    
    def __getlabels__(self):
        return self.file["labels"][()]
    
    def __getsources__(self):
        return self.file["source"][()]
    
class PreprocessDataset(Dataset):
    def __init__(self, path):
        self.file = h5py.File(path, 'r+')
        
    def __len__(self):
        return len(self.file["labels"])
    
    def __getvectors__(self):
        return (self.file["vectors"][()])
    
    def __getitem__(self, idx):
        return (torch.Tensor(self.file["vectors"][idx]), self.file["labels"][idx])
    

In [4]:
dataset = CyberSecurityDataset("data/dataset.hdf5")

###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

In [5]:
import random as randn
df=pd.DataFrame(columns=["source", "label"])

for i in range(10):
    index = randn.randint(0, dataset.__len__())
    df.loc[len(df.index)] = [dataset.__getitem__(index)[0], dataset.__getitem__(index)[1]] 
print(df)

                                              source  label
0  b'_gf_log_eh (const char *function, const char...   True
1  b'ASU_141( const int& h, const int& k, const i...  False
2  b"container() const\n{\n    // This method is ...  False
3  b'unconvert(obj)\nCcWnnObject obj;\n{\n    if ...  False
4  b'git_odb_object__free(void *object)\n{\n\tif ...  False
5  b"il_separator(c)\n    char c;\n{\n    if ((c ...  False
6  b'attributesSigned(DcmItem& item, DcmAttribute...  False
7  b'AddSharedLibNoSOName(std::string const& item...  False
8  b'xen_vcpu_notify_restore(void *data)\n{\n\t/*...  False
9  b'LoadNyquistEffect(wxString fname)\n{\n   Eff...  False


###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [14]:
# TODO: inspect and understand the loaded dataset
print(f"There are {dataset.__len__()} samples in the dataset")
print(f"There are {dataset.__getlabels__().sum()} positive examples in the dataset")
print(f"The ratio between positive and negative is {dataset.__getlabels__().sum()/(dataset.__len__()-dataset.__getlabels__().sum())*100}%")

There are 1000 samples in the dataset
There are 283 positive examples in the dataset
The ratio between positive and negative is 39.47001394700139%


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel. 

``` python 
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

```

``` python
from torch import nn

class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

      # forward propagation
      def forward(self, x):
        pred = self.linear_stack(x)
        return pred
      

# TODO: intialize and load the model

```

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


In [8]:
from torch import nn

class VulnPredictModel(nn.Module):
    # intialize the model architecture
    def __init__(self):
      super().__init__()
      self.flatten = nn.Flatten()
      self.linear_stack = nn.Sequential(
         nn.Linear(768, 64),
         nn.ReLU(),
         nn.Linear(64, 64),
         nn.ReLU(),
         nn.Linear(64, 1),
         nn.Sigmoid()
      )

      # forward propagation
    def forward(self, x):
        pred = self.linear_stack(x)
        return pred

In [9]:
model = VulnPredictModel()
model.load_state_dict(torch.load("model_2023-03-28_20-03.pth", map_location=torch.device('cpu'))) 
model.to(device)

VulnPredictModel(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_stack): Sequential(
    (0): Linear(in_features=768, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=1, bias=True)
    (5): Sigmoid()
  )
)

###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [10]:
# TODO: makethe prediction for all the samples in the validation set.
processed_dataset = PreprocessDataset("data/dataset.hdf5")
for i in range(processed_dataset.__len__()):
    data, label = processed_dataset.__getitem__(i)
    prediction = model(data)

# todo: compute true positives, true negatives, false postives and false negatives.
    
FN = 0
FP = 0
TP = 0
TN = 0
for i in range(processed_dataset.__len__()):
    data, label = processed_dataset.__getitem__(i)
    prediction = model(data)
    if prediction > 0.5:
        if label : 
            TP+=1
        else:
            FP+=1
    else:
        if label:
            FN+=1
        else:
            TN+=1



In [11]:
print(f"The number of false positive is {FP}")
print(f"The number of false negative is {FN}")
print(f"The number of true positive is {TP}")
print(f"The number of true negative is {TN}")

The number of false positive is 1
The number of false negative is 263
The number of true positive is 20
The number of true negative is 716


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [12]:
# TODO: calculate accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) > 0 else 0
# TODO: calculate precision
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
# TODO: calculate recall
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
# TODO: calculate F1-score
f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

In [13]:
print(f"Accuracy: {accuracy}" )
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")

Accuracy: 0.736
Precision: 0.9523809523809523
Recall: 0.0706713780918728
F1 Score: 0.13157894736842105


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?


Accuracy mesures the overall performance of a model whereas F1 score focuses on the impact of false positive and false negative.
In this problem, there is a huge impact in case of false negative sample because that leads to a vulnerable system. Thus, F1 is more appropriate.