This notebook demonstrates the procedure for training the multi-label models during the 2023 TRAM effort.

Only the annotations produced during this effort can be adapted for multi-label modeling. First, we will load the annotations.

In [None]:
import pandas as pd

data = pd.read_json('multi_label.json')
all_labels = data['labels'].explode().dropna().unique()
data

Unnamed: 0,sentence,labels,doc_title
0,title: NotPetya Technical Analysis – A Triple ...,[],NotPetya Technical Analysis A Triple Threat F...
1,Executive Summary This technical analysis prov...,[],NotPetya Technical Analysis A Triple Threat F...
2,For more information on CrowdStrike’s proactiv...,[],NotPetya Technical Analysis A Triple Threat F...
3,NotPetya combines ransomware with the ability ...,[],NotPetya Technical Analysis A Triple Threat F...
4,It spreads to Microsoft Windows machines using...,[T1210],NotPetya Technical Analysis A Triple Threat F...
...,...,...,...
19173,[2] Eclypsium Blog - TrickBot Now Offers 'Tric...,[],AA21076A TrickBot Malware
19174,"Initial Version March 24, 2021:",[],AA21076A TrickBot Malware
19175,Added MITRE ATT&CK Technique T1592.003 used fo...,[],AA21076A TrickBot Malware
19176,Added new MITRE ATT&CKs and updated Table 1,[],AA21076A TrickBot Malware


In [None]:
import transformers
import torch

mode: 'bert or gpt' = 'bert'
cuda = torch.device('cuda')

if mode == 'bert':
    model = transformers.BertForSequenceClassification.from_pretrained(
        "allenai/scibert_scivocab_uncased",
        num_labels=len(all_labels),
        output_attentions=False,
        output_hidden_states=False,
    )
    tokenizer = transformers.BertTokenizer.from_pretrained("allenai/scibert_scivocab_uncased", max_length=512)
elif mode == 'gpt':
    model = transformers.GPT2ForSequenceClassification.from_pretrained(
        "gpt2",
        num_labels=len(all_labels),
        output_attentions=False,
        output_hidden_states=False,
    )
    tokenizer = transformers.GPT2Tokenizer.from_pretrained("gpt2", max_length=512)
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id
else:
    raise ValueError(f"mode must be one of bert or gpt, but is {mode = !r}")

model.train().to(cuda)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.txt:   0%|          | 0.00/228k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31090, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

For single-label modeling, we represented the labels using one hot encoding. The representation here is similar, except each row can have more than one `1` if it represents a multi-label instance. Some rows will not have any `1`s if they represent a negative sample.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer as MLB
from sklearn.model_selection import train_test_split

mlb = MLB()
mlb.fit([[c] for c in all_labels])

train, test = train_test_split(data, test_size=0.2, shuffle=True)

def load_data(x, y, batch_size=10):
    x_len, y_len = x.shape[0], y.shape[0]
    assert x_len == y_len
    for i in range(0, x_len, batch_size):
        slc = slice(i, i + batch_size)
        yield x[slc].to(cuda), y[slc].to(cuda)

def tokenize(instances: 'list[str]'):
    return tokenizer(instances, return_tensors='pt', padding='max_length', truncation=True, max_length=512).input_ids

def encode_labels(labels):
    """:labels: should be the `labels` column (a Series) of the DataFrame"""
    return torch.Tensor(mlb.transform(labels.to_numpy()))

In [None]:
x_train = tokenize(train['sentence'].tolist())
x_train

tensor([[  102,  2289,  5290,  ...,     0,     0,     0],
        [  102,  4624,   137,  ...,     0,     0,     0],
        [  102,  1352,   115,  ...,     0,     0,     0],
        ...,
        [  102,   121,   755,  ...,     0,     0,     0],
        [  102,   256,   241,  ...,     0,     0,     0],
        [  102, 10037,   862,  ...,     0,     0,     0]])

In [None]:
y_train = encode_labels(train['labels'])
y_train

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [None]:
from statistics import mean

from tqdm import tqdm
from torch.optim import AdamW

optim = AdamW(model.parameters(), lr=2e-5, eps=1e-8)

for epoch in range(6):
    epoch_losses = []
    for x, y in tqdm(load_data(x_train, y_train, batch_size=10)):
        model.zero_grad()
        out = model(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int), labels=y)
        epoch_losses.append(out.loss.item())
        out.loss.backward()
        optim.step()
    print(f"epoch {epoch + 1} loss: {mean(epoch_losses)}")

1535it [09:33,  2.68it/s]


epoch 1 loss: 0.05419413093265134


1535it [09:36,  2.66it/s]


epoch 2 loss: 0.02837962201198848


1535it [09:37,  2.66it/s]


epoch 3 loss: 0.023639297901564778


1535it [09:37,  2.66it/s]


epoch 4 loss: 0.01821894876947869


1535it [09:37,  2.66it/s]


epoch 5 loss: 0.013511397138845437


1535it [09:37,  2.66it/s]

epoch 6 loss: 0.010214085078359117





In [None]:
from sklearn.metrics import precision_recall_fscore_support as calculate_score

model.eval()

x_test = tokenize(test['sentence'].tolist())

batch_size = 20
preds = []

with torch.no_grad():
    for i in range(0, x_test.shape[0], batch_size):
        x = x_test[i : i + batch_size].to(cuda)
        out = model(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int))
        preds.extend(out.logits.to('cpu'))


In [None]:
binary_preds = torch.vstack(preds).sigmoid().gt(.5).to(int)

predicted = pd.Series(mlb.inverse_transform(binary_preds)).apply(frozenset).reset_index(drop=True)
actual = test['labels'].apply(frozenset).reset_index(drop=True)
results = pd.concat({'preds': predicted, 'actual': actual}, axis=1)

results

Unnamed: 0,preds,actual
0,(T1140),(T1140)
1,(),()
2,(T1027),(T1027)
3,(),()
4,(T1027),(T1027)
...,...,...
3831,(),()
3832,(),()
3833,(),()
3834,(),()


While the formulae for precision, recall, and F1 are the same for multi-label evaluation, the procedure for counting true positives, false positives, and false negatives is not.

Where $P$ is the set of predicted labels for a given instance, and $A$ is the set of actual labels for the same instance, the true positives are the labels in $P \cap A$, the false positives are those in $P - A$, and the false negatives are those in $A - P$.

To give an example, if the actual labels for a sample are $\{a, c\}$, and the model predicts $\{c, d\}$, that is a true positive for $c$, a false positive for $d$, and a false negative for $a$.

In [None]:
tp = results.apply((lambda r: r.preds & r.actual), axis=1).explode().value_counts()
fp = results.apply((lambda r: r.preds - r.actual), axis=1).explode().value_counts()
fn = results.apply((lambda r: r.actual - r.preds), axis=1).explode().value_counts()

support = actual.explode().value_counts().rename('#')

counts = pd.concat({'tp': tp, 'fp': fp, 'fn': fn}, axis=1).fillna(0).astype(int)

p = counts.tp.div(counts.tp + counts.fp).fillna(0)
r = counts.tp.div(counts.tp + counts.fn).fillna(0)
f1 = (2 * p * r) / (p + r)
scores = pd.concat({'P': p, 'R': r, 'F1': f1}, axis=1).fillna(0).sort_values(by='F1', ascending=False)

# calculate macro scores
scores.loc['(macro)'] = scores.mean()

# calculate micro scores
micro = counts.sum()
scores.loc['(micro)', 'P'] = mP = micro.tp / (micro.tp + micro.fp)
scores.loc['(micro)', 'R'] = mR = micro.tp / (micro.tp + micro.fn)
scores.loc['(micro)', 'F1'] = (2 * mP * mR) / (mP + mR)

scores.join(support)

Unnamed: 0,P,R,F1,#
T1056.001,0.833333,0.833333,0.833333,12.0
T1574.002,1.0,0.7,0.823529,20.0
T1548.002,1.0,0.666667,0.8,6.0
T1140,0.807229,0.72043,0.761364,93.0
T1047,0.818182,0.692308,0.75,13.0
T1055,0.719298,0.706897,0.713043,58.0
T1053.005,0.764706,0.65,0.702703,20.0
T1218.011,0.75,0.642857,0.692308,14.0
T1003.001,0.84,0.583333,0.688525,36.0
T1059.003,0.730159,0.638889,0.681481,72.0


In [None]:
# Save the trained model and tokenizer
model.save_pretrained("saved_model")
tokenizer.save_pretrained("saved_model")

# Save the label binarizer (used to decode predictions later)
import joblib
joblib.dump(mlb, "saved_model/label_binarizer.pkl")


['saved_model/label_binarizer.pkl']