# ARINC Fingerprinting BERT Multi Labels Class Classifier

Since Huggingface only implemented single class classification (with loss function `CrossEntropyLoss` used), we need to modify a bit to use our own loss function (i.e. `BCEWithLogitsLoss`). 

Also, `sigmoid` is chosen instead of `softmax` at the final layer because it ensure multi-class availability.

For more details you can check [Transformer for Multi-Label](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)


Import related libraries:

In [176]:
!pip install transformers
!pip install torch

'''Train with PyTorch.'''
# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
import torch.utils.data as data

# BERT Related Libraries
from transformers import BertTokenizer, BertForSequenceClassification

# Python
import pandas as pd
import numpy as np
import os
import time


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Declaring machine learning parameters:

In [177]:
# ML Parameters
lr = 1e-2
epoch = 5
batch_size = 64


Data Source:

In [178]:
train_path = "/content/drive/MyDrive/dataset/features.csv"
labels_path = "/content/drive/MyDrive/dataset/labels.csv"
####
train_df = pd.read_csv(train_path)
# train_df.set_index('ProductId',drop=True,inplace=True)
train_df.drop(columns=['ProductId'],inplace=True)
train_df.rename(columns = {'MarketingDescription_DE':'text'}, inplace = True)

labels_df = pd.read_csv(labels_path)
labels_df.drop(columns=['ProductId'],inplace=True)

# labels_df.set_index('ProductId',drop=True,inplace=True)

texts = pd.concat([train_df,labels_df],axis=1)

In [179]:
train_df

Unnamed: 0,text
0,produktreihe netgear gigabit unmanaged switche...
1,hochwertiges flexibles patchkabel paar gesamta...
2,verpassen geniessen lifecam cinema hochauflöse...
3,rj45 patchkabel cat 6a anwendungen 10 gbit eth...
4,vorhangschloss abus safe code 78 lässt tresor ...
...,...
5978,44mm chalk link bracelet small
5979,wechsel armbändern kompatible armband probleml...
5980,silikon case magsafe apple speziell iphone 12 ...
5981,silikon case magsafe apple speziell iphone 12 ...


Create one data accessor (for PyTorch to read the data above easily):

In [180]:
class SentenceDataset(data.Dataset):

    def __init__(self, database):
        self.database = database

    def __len__(self):
        #return self.database.shape[0]
        return 1000

    def __getitem__(self, idx):
        
        # return the sentence
        i = self.database["text"][idx]
        # return the label array
        label = self.database.loc[idx, labels_df.columns.tolist()]
        label = np.array(label, dtype=float)
        
        return i, label


In [181]:
labels_df.columns.tolist()

['2542',
 '3352',
 '4061',
 '1997',
 '3621',
 '3907',
 '1622',
 '3896',
 '4216',
 '4049',
 '4151',
 '3697',
 '3769',
 '3900',
 '3329',
 '3354',
 '1855',
 '3105',
 '2693',
 '4441',
 '3412',
 '3956',
 '4383',
 '1006',
 '4308',
 '3323',
 '4379',
 '4428',
 '3787',
 '1772',
 '484',
 '4423',
 '4385',
 '2393',
 '4543',
 '3786',
 '4545',
 '3967',
 '4430',
 '4542',
 '4419',
 '3374',
 '4425',
 '4386',
 '1780',
 '4424',
 '4307',
 '3351',
 '4418',
 '4429',
 '4422',
 '810',
 '4417',
 '1839',
 '2848',
 '4416',
 '2983',
 '4643',
 '4306',
 '4421',
 '1771',
 '3530',
 '4420',
 '4427',
 '4426',
 '2740',
 '2955',
 '3680',
 '1500',
 '4540',
 '3679',
 '2122',
 '4589',
 '1464',
 '326',
 '3678',
 '1463',
 '2710',
 '3139',
 '1520',
 '4608',
 '750',
 '4148',
 '2969',
 '2966',
 '4146',
 '4147',
 '2709',
 '2965',
 '1517',
 '701',
 '3502',
 '3503',
 '3138',
 '3501',
 '4486',
 '2202',
 '2203',
 '2967']

Prepare Data Training Set and Testing Set:

In [182]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load training dataset
dataset = SentenceDataset(texts)
print("Total: %i" % len(dataset))

# Split training and validation set
train_len = int(0.7*len(dataset))
valid_len = len(dataset) - train_len
TrainData1, ValidationData1 = random_split(dataset,[train_len, valid_len])
print("Training: %i / Testing: %i" %(len(TrainData1), len(ValidationData1)))

# Load into Iterator (each time get one batch)
train_loader = data.DataLoader(TrainData1, batch_size=batch_size, shuffle=True,drop_last=False, num_workers=0)
test_loader = data.DataLoader(ValidationData1, batch_size=batch_size, shuffle=True,drop_last=False, num_workers=0)


Total: 1000
Training: 700 / Testing: 300


In [183]:
classes

<function pandas.core.dtypes.common.classes>

Create model instance:

In [184]:
from pandas.core.dtypes.common import classes
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# hard code the label dimension to be 6 (because the data has 6 classes)
num_labels = 99

# Define model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)
model.to(device)

# Define tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define optimizer
#optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)
optimizer = optim.AdamW(model.parameters(), lr=lr)

# Define Loss function
criterion = nn.BCEWithLogitsLoss()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Preparation of traning and validation set:

Training and Testing Functions:

In [185]:
###########################
# Train with training set #
###########################
def train(model, iterator, optimizer, criterion, device):
    
    model.train()     # Enter Train Mode
    train_loss = 0

    for batch_idx, (sentences, labels) in enumerate(iterator):
        
        print(sentences)
        
        # tokenize the sentences
        encoding = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
        input_ids = encoding['input_ids']
        attention_mask = encoding['attention_mask']

        # move to GPU if necessary
        input_ids, labels = input_ids.to(device), labels.to(device)
        
        # generate prediction
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)  # NOT USING INTERNAL CrossEntropyLoss
        
        # compute gradients and update weights
        loss = criterion(outputs.logits, labels) # BCEWithLogitsLoss has sigmoid
        loss.backward()
        optimizer.step()

        # accumulate train loss
        train_loss += loss
        
    # print completed result
    print('train_loss: %f' % (train_loss))
    return train_loss


#############################
# Validate with testing set #
#############################
def test(model, iterator, optimizer, criterion, device):

    model.eval()     # Enter Evaluation Mode
    correct = 0
    total = 0

    with torch.no_grad():
        for batch_idx, (sentences, labels) in enumerate(iterator):
            
            # tokenize the sentences
            encoding = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
            input_ids = encoding['input_ids']
            attention_mask = encoding['attention_mask']
            
            # move to GPU if necessary
            input_ids, labels = input_ids.to(device), labels.to(device)
            
            # generate prediction
            outputs = model(input_ids, attention_mask=attention_mask)  # NOT USING INTERNAL CrossEntropyLoss
            prob = outputs.logits.sigmoid()   # BCEWithLogitsLoss has sigmoid
            
            # record processed data count
            total += (labels.size(0)*labels.size(1))

            # take the index of the highest prob as prediction output
            THRESHOLD = 0.7
            prediction = prob.detach().clone()
            prediction[prediction > THRESHOLD] = 1
            prediction[prediction <= THRESHOLD] = 0
            correct += prediction.eq(labels).sum().item()
    
    # print completed result
    acc = 100.*correct/total
    print('correct: %i  / total: %i / test_acc: %f' % (correct, total, acc))
    return acc


Acutal execution:

- Run `training()` and `test()` for `epoch` times


In [None]:
for e in range(epoch):
    
    print("===== Epoch %i =====" % e)
    
    # training
    print("Training started ...")
    train(model, train_loader, optimizer, criterion, device)

    # validation testing
    print("Testing started ...")
    test(model, test_loader, optimizer, criterion, device)



===== Epoch 0 =====
Training started ...
('lightning usb kamera adapter einfach fotos videos digitalkamera ipad iphone lightning anschluss laden brillanten retina display gemeinsam freunden anschauen lightning usb kamera adapter angeschlossen öffnet ipad iphone automatisch fotos app auswählen fotos videos laden möchtest alben organisiert synchronisieren ipad iphone pc mac fotos videos ipad iphone fotoarchiv computer hinzugefügt lightning usb kamera adapter unterstützt gängige fotoformate jpeg raw sd hd videoformate 264 mpeg 4 erfordert ios 9 2 neuer', 'design details wichtigste surface keyboard detail sorgfältig durchdacht mattgrauen lackierung passt perfekt surface surface mouse ergänzt organisierten arbeitsplatz ideal solide gefühl tastendruck arbeit angenehm tastatur verbindet einfach drahtlose bluetooth verbindung surface akkulaufzeit tastenhub höhe winkel abstand genau festgelegt tippgeschwindigkeit genauigkeit verbessern optimiertes feedback rückstellkraft versehentliche tastenan