# PART 1 - BENCHMARKING ANALYSIS

For the first part of the project, we are going to load a pretrained model from HuggingFace, the CyberPeace-Institute/SecureBERT-NER, load a dataset, the DNRTI, and evaluate the model using the dataset

Let's start with imports

In [85]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from torch.utils.data import DataLoader
from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
import time
import numpy as np

Defining the model and tokenizer loading function

In [86]:
def load_pretrained_ner_model(model_name="CyberPeace-Institute/SecureBERT-NER"):
    """
    Load a pre-trained model
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    return tokenizer, model

Defining dataset loading function

In [87]:
def load_dataset(file_path, delimiter=" "):
    """
    Load a dataset
    """
    sentences = []
    labels = []
    current_sentence = []
    current_labels = []
    
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip() == "":
                if current_sentence:
                    sentences.append(current_sentence)
                    labels.append(current_labels)
                    current_sentence = []
                    current_labels = []
            else:
                splits = line.strip().split(delimiter)
                if len(splits) >= 2:
                    token, label = splits[0], splits[-1]
                    current_sentence.append(token)
                    current_labels.append(label)
        # Add the last sentence - it doesn't have newline after
        if current_sentence:
            sentences.append(current_sentence)
            labels.append(current_labels)
    
    return sentences, labels

Because the model and the dataset have different tags/labels, we map the dataset tag to the model tag by creating a dictionary. The dataset's OffAct and Way tags are arbitrarily chosen to be the model's ACT, and the dataset's Exp tag is chosen to be the model's VULID tag. If there is no corresponding tag, the default value will be 'O'

In [88]:
tag_mapping = {
    # Dataset tag : Model tag
    "HackOrg": "APT",
    "SecTeam": "SECTEAM",
    "Idus": "IDTY",
    "Org": "IDTY",
    "OffAct": "ACT", # duplicate arbitrary choice
    "Way": "ACT", # duplicate arbitrary choice
    "Exp": "VULID", # duplicate arbitrary choice
    "Tool": "MAL",
    "SamFile": "FILE",
    "Time": "TIME",
    "Area": "LOC"
}

def map_tags(labels, tag_mapping, default_tag="O"):
    """
    Map the dataset's tags to the pre-trained model's tags based on the mapping.
    Unmapped tags are set to the default_tag.
    """
    mapped_labels = []
    for label_seq in labels:
        mapped_seq = []
        for i, label in enumerate(label_seq):
            mapped_label = tag_mapping.get(label[2:], default_tag)
            if mapped_label != default_tag:
                mapped_label = f"{label[:2]}{mapped_label}"
            mapped_seq.append(mapped_label)
        mapped_labels.append(mapped_seq)
    return mapped_labels


We now preprocess, tokenize the data, and align it

In [89]:
def preprocess_data(tokenizer, sentences, mapped_labels, label_list, max_length=128):
    """
    Tokenize the dataset and align labels with tokens.
    """
    tokenized_inputs = tokenizer(
        sentences,
        truncation=True,
        is_split_into_words=True,
        return_offsets_mapping=True,
        padding=True,
        max_length=max_length
    )

    label_to_id = {label: i for i, label in enumerate(label_list)}
    
    aligned_labels = []
    for i, label in enumerate(mapped_labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens or padding
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # Start of a new word, assign label
                label_ids.append(label_to_id.get(label[word_idx], -100))
            else:
                # Same word, assign -100 - subword tokens ignored
                label_ids.append(-100)

            previous_word_idx = word_idx

        aligned_labels.append(label_ids)
    
    tokenized_inputs["labels"] = aligned_labels
    return tokenized_inputs

In the project's definition, we stated that the model's ACT, OS and TOOL tags are all valid predictions for the OffAct and Way dataset tags, and VULID and VULNAME are valid predictions for Exp dataset tag. Because of that, we define that if the model predicted for example a TOOL or a OS tag, it is equivalent that it predicted a ACT tag. And because above we arbitrarily chose the dataset-model mapping to be ACT, we also convert TOOL and OS predictions to be ACT - as it is a correct prediction. The same holds true for VULNAME, we map it to VULID.
Also, by definition, the are some mapped classes in the model that are not mapped in our dataset, so we will want to ignore those predictions, so we define those classes as mapping to 'O'.

In [90]:
tag_mapping_correction = {
    # Model tag : Model tag
    "TOOL": "ACT",
    "OS": "ACT",
    "VULNAME": "VULID",
    "DOM": "O",
    "ENCR": "O",
    "IP": "O",
    "URL": "O",
    "MD5": "O",
    "PROT": "O",
    "EMAIL": "O",
    "SHA1": "O",
    "SHA2": "O"
}

def fix_unmapped_model_labels(pred_label: str) -> str:
    prefix = pred_label[:2]
    postfix = pred_label[2:]
    fixed_label = tag_mapping_correction.get(postfix, postfix)
    if fixed_label != "O":
        fixed_label = f"{prefix}{fixed_label}"
    return fixed_label

def align_predictions(predictions, labels, label_list):
    """
    Align the predictions with the labels, ignoring special tokens.
    """
    true_labels = []
    true_preds = []
    for i, label in enumerate(labels):
        current_true = []
        current_pred = []
        for j, lab in enumerate(label):
            if lab != -100:
                current_true.append(label_list[lab])
                pred_label = label_list[predictions[i][j]]
                current_pred.append(fix_unmapped_model_labels(pred_label))
        true_labels.append(current_true)
        true_preds.append(current_pred)
    
    return true_labels, true_preds

Torch dataset wrapper

In [91]:
class NERDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings["input_ids"])

Finally, we define an evaluation function that evaluates the model and outputs metrics.

In [92]:
def evaluate_model(model, tokenized_data, label_list, batch_size=32):
    """
    Run the model on the dataset and evaluate.
    """
    model.eval()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    dataset = NERDataset(tokenized_data)
    dataloader = DataLoader(dataset, batch_size=batch_size)
    
    all_predictions = []
    all_labels = []
    latencies = []
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels']
            
            # Measure start time
            start_time = time.time()
            
            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits  # Shape: (batch_size, seq_length, num_labels)
            preds = torch.argmax(logits, dim=2)
            
            # Measure end time
            end_time = time.time()
            latency = end_time - start_time
            latencies.append(latency)
            
            preds = preds.detach().cpu().numpy()
            all_predictions.append(preds)
            all_labels.append(labels.numpy())
    
    all_labels = np.concatenate(all_labels, axis=0)
    all_predictions = np.concatenate(all_predictions, axis=0)
    
    # Align predictions and labels
    true_labels, true_preds = align_predictions(all_predictions, all_labels, label_list)
    
    # Calculate average latency
    average_latency = np.mean(latencies)
    
    # Calculate metrics using seqeval
    precision = precision_score(true_labels, true_preds, average='macro')
    recall = recall_score(true_labels, true_preds, average='macro')
    f1 = f1_score(true_labels, true_preds, average='macro')
    report = classification_report(true_labels, true_preds)
    
    print(f"Average Latency: {average_latency:.4f} seconds per batch")
    print("\nEvaluation Metrics:")
    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1-Score:  {f1:.4f}")
    print("\nDetailed Classification Report:")
    print(report)



Let's start to run our code, starting by loading pre-trained model and tokenizer

In [93]:
tokenizer, model = load_pretrained_ner_model()

Loading the dataset

In [94]:
dataset_path = "data/DNRTI/test.txt"

sentences, labels = load_dataset(dataset_path)

Applying tag mapping to align dataset's labels with model's labels

In [95]:
mapped_labels = map_tags(labels, tag_mapping, default_tag="O")

Defining the model's label list

In [96]:
model_labels = model.config.id2label
label_list = [model_labels[i] for i in range(len(model_labels))]

Preprocessing the data

In [97]:
tokenized_data = preprocess_data(tokenizer, sentences, mapped_labels, label_list)

And calling the evaluation function to print our metrics!

In [98]:
evaluate_model(model, tokenized_data, label_list)

Average Latency: 0.9159 seconds per batch

Evaluation Metrics:
Precision: 0.7227
Recall:    0.7058
F1-Score:  0.7088

Detailed Classification Report:
              precision    recall  f1-score   support

         ACT       0.35      0.52      0.42       250
         APT       0.79      0.58      0.67       369
        FILE       0.92      0.70      0.79       248
        IDTY       0.71      0.77      0.74       266
         LOC       0.80      0.78      0.79       216
         MAL       0.59      0.63      0.61       315
     SECTEAM       0.84      0.89      0.87       152
        TIME       0.81      0.79      0.80       169
       VULID       0.69      0.70      0.70       132

   micro avg       0.68      0.68      0.68      2117
   macro avg       0.72      0.71      0.71      2117
weighted avg       0.71      0.68      0.69      2117

