<a href="https://colab.research.google.com/github/yinhao0424/reuster/blob/master/ReustersMultilabelClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Multilabel Classification on Reusters News with DistilBert

In this notebook, I will train a transformer model for the Multilabel text classification problem.
#### Data:
- The data is reusters21578 dataset
- The data is preprocessed to groupe 135 topics into 9 categories, which are:
  - money-fx
  - ship
  - interest
  - economic_indicator
  - currency
  - commodity
  - energy
  - acq
  - earn

#### Language Model
- Reference: 
  - paper: [DistilBERT, a distilled version of BERT](https://arxiv.org/pdf/1910.01108.pdf)
  - huggingface: [Transformers Tutorial](https://huggingface.co/transformers/notebooks.html)
  - github: 
    - [Transformers](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb)
    - [Fine Tuning Transformer for MultiLabel Text Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)


In [1]:
# a specific version of transformaer has been used 
! pip install -q transformers==3.0.2

[K     |████████████████████████████████| 778kB 7.7MB/s 
[K     |████████████████████████████████| 890kB 58.4MB/s 
[K     |████████████████████████████████| 3.0MB 55.2MB/s 
[K     |████████████████████████████████| 1.1MB 47.8MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [2]:
! nvidia-smi

Tue Dec 29 03:09:33 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
import numpy as np
import pandas as pd
from sklearn import metrics
from tqdm import tqdm

import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertTokenizer, DistilBertModel

import warnings
warnings.simplefilter('ignore')
import logging
logging.basicConfig(level=logging.ERROR)

In [4]:
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")
device

device(type='cuda')

In [5]:
def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        acc_list.append(tmp_a)
    return np.mean(acc_list)

In [6]:
# Prepare Training Data
train_data = pd.read_csv("/content/drive/MyDrive/data/reuters/reuters_multilabel_train.csv")

# convert labels to list
from ast import literal_eval
train_data.labels = train_data.labels.apply(literal_eval)

# extract texts and labels
train_data = train_data[['texts','labels']].copy()

train_data.head()

Unnamed: 0,texts,labels
0,u.s. economic data key to debt futures outlook...,"[0, 0, 1, 1, 0, 0, 0, 0, 0]"
1,bank of british columbia 1st qtr jan 31 netope...,"[0, 0, 0, 0, 0, 0, 0, 0, 1]"
2,restaurant associates inc <ra> 4th qtr jan 3sh...,"[0, 0, 0, 0, 0, 0, 0, 0, 1]"
3,michigan general corp <mgl> 4th qtrshr loss 1....,"[0, 0, 0, 0, 0, 0, 0, 0, 1]"
4,"usx <x> proved oil, gas reserves fall in 1986u...","[0, 0, 0, 0, 0, 1, 1, 0, 0]"


In [7]:
# Parse Testing Data
test_data = pd.read_csv("/content/drive/MyDrive/data/reuters/reuters_multilabel_test.csv")

# convert labels to list
from ast import literal_eval
test_data.labels = test_data.labels.apply(literal_eval)

# extract texts and labels
test_data = test_data[['texts','labels']].copy()

test_data.head()

Unnamed: 0,texts,labels
0,hospital corp says it received 47 dlr a share ...,"[0, 0, 0, 0, 0, 0, 0, 1, 0]"
1,beverly enterprises <bev> sets regular dividen...,"[0, 0, 0, 0, 0, 0, 0, 0, 1]"
2,treasury's baker says floating exchange rate s...,"[1, 0, 0, 0, 0, 0, 0, 0, 0]"
3,"crude oil netbacks up sharply in europe, u.s.c...","[0, 0, 0, 0, 0, 0, 1, 0, 0]"
4,treasury's baker says system needs stabilitytr...,"[1, 0, 0, 0, 0, 0, 0, 0, 0]"


### Preparing the Dataset and Dataloader


In [8]:
# Sections of config
# Defining some key variables that will be used later on in the training
MAX_LEN = 128
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 4
EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', truncation=True, do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [9]:
class MultiLabelDataset(Dataset):
#   map-style datasets: a dataset, when accessed with dataset[idx], could read the idx-th image 
#   and its corresponding label from a folder on the disk.
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.texts
        self.targets = self.data.labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [10]:
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

training_set = MultiLabelDataset(train_data, tokenizer, MAX_LEN)
testing_set = MultiLabelDataset(test_data, tokenizer, MAX_LEN)

TRAIN Dataset: (7775, 2)
TEST Dataset: (3019, 2)


In [11]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

### Creating the Neural Network for Fine Tuning

In [12]:
class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(768, 9)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.Tanh()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

model = DistilBERTClass()
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




DistilBERTClass(
  (l1): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_featu

In [13]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [14]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [15]:
def train(epoch):
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%2000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [16]:
for epoch in range(EPOCHS):
    train(epoch)

1it [00:00,  3.55it/s]

Epoch: 0, Loss:  0.7082439064979553


1944it [02:12, 14.66it/s]


In [17]:
def validation(testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [18]:
outputs, targets = validation(testing_loader)

final_outputs = np.array(outputs) >=0.5

755it [00:16, 45.01it/s]


In [19]:

accuracy = metrics.accuracy_score(targets, final_outputs)
f1_score_micro = metrics.f1_score(targets, final_outputs, average='micro')
f1_score_macro = metrics.f1_score(targets, final_outputs, average='macro')
print(f"Accuracy Score = {accuracy}")
print(f"F1 Score (Micro) = {f1_score_micro}")
print(f"F1 Score (Macro) = {f1_score_macro}")

Accuracy Score = 0.909572706194104
F1 Score (Micro) = 0.9430100755667505
F1 Score (Macro) = 0.8789685170242958


In [20]:
val_hamming_loss = metrics.hamming_loss(targets, final_outputs)
val_hamming_score = hamming_score(np.array(targets), np.array(final_outputs))

print(f"Hamming Score = {val_hamming_score}")
print(f"Hamming Loss = {val_hamming_loss}")

Hamming Score = 0.9366236060505686
Hamming Loss = 0.013323028228626108


In [21]:

output_model_file = '/content/drive/MyDrive/data/reuters/pytorch_distilbert_news.bin'
output_vocab_file = '/content/drive/MyDrive/data/reuters/vocab_distilbert_news.bin'

torch.save(model, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('Saved')

Saved
