# Text classification with BERT in PyTorch

## Data

Let's first get our data. Our corpus is collected from NTSB reports. It includes the narrative report of aviation accidents. In this notebook, we want to classifiy the injury level of the accident through narrative reports. 

In [1]:
import pandas as pd
mypath = "data/NTSB"
DataDict = pd.read_csv(mypath+'/eADMSPUB_DataDictionary.csv')
narratives = pd.read_csv(mypath+'/narratives.csv')
events = pd.read_csv(mypath+'/events.csv')
# select key columns
events_select = events[['ev_id','ev_highest_injury','ev_year']]
# Merge table
data = pd.merge(narratives, events_select, on=['ev_id'])

data_select = data[data['ev_year']<=2008]
data_select = data_select[data_select['ev_highest_injury'] != 'UNKN']
data_select = data_select[data_select['ev_highest_injury'] != 'NONE']
data_final = data_select.dropna(subset=['ev_highest_injury','narr_accf'])
# data_group = data_select.groupby(['ev_id','Aircraft_Key'])['Occurrence_No','Occurrence_Code','Phase_of_Flight','Subj_Code','damage','ev_highest_injury','ev_year'].agg(lambda x: x.tolist()).reset_index()
# data_final = pd.merge(data_group, narratives, on=['ev_id', 'Aircraft_Key'])
# data_final['label_injury'] = data_final.ev_highest_injury.apply(lambda x: x[0])
# data_final = data_final.dropna(subset=['narr_accf'])

In [2]:
print(len(data_final))
data_final.head(2)

28683


Unnamed: 0,ev_id,Aircraft_Key,narr_accp,narr_accf,narr_cause,narr_inc,lchg_date,lchg_userid,ev_highest_injury,ev_year
0,20001208X07734,1,"HISTORY OF FLIGHT\r\n\r\nOn April 2, 1997, at...","During an IFR flight at night, the pilot repor...",The pilot's failure to maintain aircraft contr...,,12/08/00 12:15:45,dbo,FATL,1997
1,20021008X05297,1,"On September 29, 2002, at 2100 central dayligh...",The instructional flight sustained substantial...,The inadequate fuel management by the certifie...,,03/13/03 11:39:26,SULP,SERS,2002


In [3]:
from tqdm import tqdm
import pandas as pd
import pickle
import numpy as np
import re
import nltk
import string

import pandas as pd
import numpy as np
from tqdm import tqdm

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
MAX_LENGTH = 100
data_final['narr_accf'] = data_final['narr_accf'].replace('\\n', '', regex=True)
data_final['narr_accf'] =  data_final['narr_accf'].replace('\\r', '', regex=True)
data_filter = data_final[[len(x.split('.'))<=MAX_LENGTH for x in data_final['narr_accf']]].reset_index()
for i, row in tqdm(data_filter.iterrows(), total=data_filter.shape[0], position=0):
    sentence = data_filter.at[i,'narr_accf']
    sentence = re.sub('[^A-Za-z0-9.-]+', ' ', sentence)
    #Remove whitespaces
    sentence = sentence.strip()
    sentence = nltk.word_tokenize(sentence)
    # Remove stop words
    sentence = [i for i in sentence if not i in stop_words]
    # Stemming and Lemmatization
    data_buffer = []
    for word in sentence:
        word_ = word
#         word_ = stemmer.stem(word)
        word_ = lemmatizer.lemmatize(word_)
        data_buffer.append(word_)
    data_filter.loc[i,'narr_accf'] = ' '.join(data_buffer)
data_final = data_filter

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
100%|██████████| 28683/28683 [00:59<00:00, 481.31it/s]


First, we split up the data into a train, development and test portion. 

In [4]:
import ndjson
import time
start = time.time()
from sklearn.model_selection import train_test_split

texts = data_final["narr_accf"].tolist()
labels = data_final["ev_highest_injury"].tolist()
    
rest_texts, test_texts, rest_labels, test_labels = train_test_split(texts, labels, test_size=0.1, random_state=1)
train_texts, dev_texts, train_labels, dev_labels = train_test_split(rest_texts, rest_labels, test_size=0.1, random_state=1)

print("Train size:", len(train_texts))
print("Dev size:", len(dev_texts))
print("Test size:", len(test_texts))

Train size: 23232
Dev size: 2582
Test size: 2869


In [5]:
texts[0]

'During IFR flight night pilot reported Air Route Traffic Control Center controller lost alternator switched standby generator . He requested lower altitude cloud lost cockpit lighting . He reported loss compass looking clear area . As controller attempting provide no-gyro vector nearest airport pilot reported various problem flight instrument including altimeter stated know whether could fly straight level . He reported altimeter working still instrument meteorological condition lost vacuum pump . He told controller know bank indicator DG HSI providing conflicting information . The pilot subsequently could maintain heading provided controller consistent altitude profile next several minute . His last transmission said descent trying pull . Radar data showed series 360-degree left turn followed turn right . The last turn left computed 5.487 g load factor 80-degree angle bank . The first turn right 4.213 g bank angle 76 degree . The vertical stabilizer horizontal stabilizer outboard sec

Next, we need to determine the number of labels in our data. We'll map each of these labels to an index. In problem, there are three labels: Fatl, minor and seriours. We can also find the distribution of three labels through counting it. 

In [6]:
target_names = list(set(labels))
label2idx = {label: idx for idx, label in enumerate(target_names)}
print(label2idx)
from collections import Counter
print(Counter(labels))
print(Counter(test_labels))
print(Counter(train_labels))

{'MINR': 0, 'SERS': 1, 'FATL': 2}
Counter({'FATL': 11939, 'MINR': 9501, 'SERS': 7243})
Counter({'FATL': 1210, 'MINR': 934, 'SERS': 725})
Counter({'FATL': 9646, 'MINR': 7724, 'SERS': 5862})


## Baseline

Let's train a baseline model for our task. In this way we have something to compare BERT's performance to. As our baseline, we choose a simple random foreset classifier from Scikit-learn. We use grid search to find the optimal settings for its hyperparameter. At the end of this process, we find that our best baseline classifier obtains an accuracy of 70.04%.

In [7]:
%%time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('rf', RandomForestClassifier())
])

parameters = {'rf__n_estimators': [10, 50, 100]}

best_classifier = GridSearchCV(pipeline, parameters, cv=5, verbose=1)
best_classifier.fit(train_texts, train_labels)
best_predictions = best_classifier.predict(test_texts)

baseline_accuracy = np.mean(best_predictions == test_labels)
print("Baseline accuracy:", baseline_accuracy)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Baseline accuracy: 0.695364238410596
CPU times: user 4min 20s, sys: 276 ms, total: 4min 20s
Wall time: 4min 20s


## BERT

Now we move to BERT. The team at [HuggingFace](https://github.com/huggingface) has developed a great Python library, [transformers](https://github.com/huggingface/transformers), with implementations of an impressive number of transfer-learning models in PyTorch and Tensorflow. It makes finetuning these models pretty easy. Let's first install this library. 

You really need a GPU to finetune BERT. Still, to make sure this code runs on any machine we'll let PyTorch determine whether a GPU is available.

In [8]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Initializing a model

Google has made available a range of BERT models for us to experiment with. For English, there is a choice between three models: `bert-large-uncased` is the largest model that will likely give the best results. Its smaller siblings are `bert-base-uncased` and `bert-base-cased`, which are more practical to work with. For Chinese there is `bert-base-chinese`, and for the other languages we have `bert-base-multilingual-uncased` and `bert-base-multilingual-cased`. 

Uncased means that the training text has been lowercased and accents have been stripped. This is usually better, unless you know that case information is important for your task, such as with Named Entity Recognition. 

In our example, we're going to investigate sentiment analysis on English. We'll therefore use the English BERT-base model.

In [9]:
BERT_MODEL = "bert-base-uncased"

Each model comes with its own tokenizer. This tokenizer splits texts into [word pieces](https://github.com/google/sentencepiece). In addition, we'll tell the tokenizer it should lowercase the text, as we're going to work with the uncased model. 

In [10]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(BERT_MODEL)

A full BERT model consists of a common, pretrained core, and an extension on top that depends on the particular NLP task. After all, the output of a sequence classification model, where we have just one prediction for every sequence, looks very different from the output of a sequence labelling or question answering model. As we're looking at sentiment classification, we're going to use the pretrained BERT model with a final layer for sequence classification on top.

In [11]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(BERT_MODEL, num_labels = len(label2idx))
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### Preparing the data

Next we need to prepare our data for BERT. We'll present every document as a BertInputItem object, which contains all the information BERT needs: 

- a list of input ids. Take a look at the logging output to see what this means. Every text has been split up into subword units, which are shared between all the languages in the multilingual model. When a word appears frequently enough in a combined corpus of all languages, it is kept intact. If it is less frequent, it is split up into subword units that do occur frequently enough across all languages. This allows our model to process every text as a sequence of strings from a finite vocabulary of limited size. Note also the first `[CLS]` token. This token is added at the beginning of every document. The vector at the output of this token will be used by the BERT model for its sequence classification tasks: it serves as the input of the final, task-specific part of the neural network.
- the input mask: the input mask tells the model which parts of the input it should look at and which parts it should ignore. In our example, we have made sure that every text has a length of 100 tokens. This means that some texts will be cut off after 100 tokens, while others will have to be padded with extra tokens. In this latter case, these padding tokens will receive a mask value of 0, which means BERT should not take them into account for its classification task. 
- the segment_ids: some NLP task take several sequences as input. This is the case for question answering, natural language inference, etc. In this case, the segment ids tell BERT which sequence every token belongs to. In a text classification task like ours, however, there's only one segment, so all the input tokens receive segment id 0.
- the label id: the id of the label for this document.

In [12]:
%%time
import logging
import numpy as np

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

MAX_SEQ_LENGTH=100

class BertInputItem(object):
    """An item with all the necessary attributes for finetuning BERT."""

    def __init__(self, text, input_ids, input_mask, segment_ids, label_id):
        self.text = text
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        

def convert_examples_to_inputs(example_texts, example_labels, label2idx, max_seq_length, tokenizer, verbose=0):
    """Loads a data file into a list of `InputBatch`s."""
    
    input_items = []
    examples = zip(example_texts, example_labels)
    for (ex_index, (text, label)) in enumerate(examples):

        # Create a list of token ids
        input_ids = tokenizer.encode(f"[CLS] {text} [SEP]")
        if len(input_ids) > max_seq_length:
            input_ids = input_ids[:max_seq_length]

        # All our tokens are in the first input segment (id 0).
        segment_ids = [0] * len(input_ids)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding = [0] * (max_seq_length - len(input_ids))
        input_ids += padding
        input_mask += padding
        segment_ids += padding

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        label_id = label2idx[label]

        input_items.append(
            BertInputItem(text=text,
                          input_ids=input_ids,
                          input_mask=input_mask,
                          segment_ids=segment_ids,
                          label_id=label_id))

        
    return input_items

train_features = convert_examples_to_inputs(train_texts, train_labels, label2idx, MAX_SEQ_LENGTH, tokenizer, verbose=0)
dev_features = convert_examples_to_inputs(dev_texts, dev_labels, label2idx, MAX_SEQ_LENGTH, tokenizer)
test_features = convert_examples_to_inputs(test_texts, test_labels, label2idx, MAX_SEQ_LENGTH, tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 1min 6s, sys: 63.9 ms, total: 1min 6s
Wall time: 1min 6s


Finally, we're going to initialize a data loader for our training, development and testing data. This data loader puts all our data in tensors and will allow us to iterate over them during training.

In [13]:
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

def get_data_loader(features, max_seq_length, batch_size, shuffle=True): 

    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
    data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)

    dataloader = DataLoader(data, shuffle=shuffle, batch_size=batch_size)
    return dataloader

BATCH_SIZE = 16

train_dataloader = get_data_loader(train_features, MAX_SEQ_LENGTH, BATCH_SIZE, shuffle=True)
dev_dataloader = get_data_loader(dev_features, MAX_SEQ_LENGTH, BATCH_SIZE, shuffle=False)
test_dataloader = get_data_loader(test_features, MAX_SEQ_LENGTH, BATCH_SIZE, shuffle=False)

### Evaluation method

Now it's time to write our evaluation method. This method takes as input a model and a data loader with the data we would like to evaluate on. For each batch, it computes the output of the model and the loss. We use this output to compute the obtained precision, recall and F-score. During training, we will print the simple numbers. When we evaluate on the test set, we will output a full classification report.

In [14]:
def evaluate(model, dataloader):
    model.eval()
    
    eval_loss = 0
    nb_eval_steps = 0
    predicted_labels, correct_labels = [], []

    for step, batch in enumerate(tqdm(dataloader, desc="Evaluation iteration", position=0)):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        with torch.no_grad():
            tmp_eval_loss, logits = model(input_ids, attention_mask=input_mask,
                                          token_type_ids=segment_ids, labels=label_ids, return_dict=False)
        outputs = np.argmax(logits.cpu().detach().numpy(), axis=1)
        label_ids = label_ids.cpu().numpy()
        
        predicted_labels += list(outputs)
        correct_labels += list(label_ids)
        
        eval_loss += tmp_eval_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    
    correct_labels = np.array(correct_labels)
    predicted_labels = np.array(predicted_labels)
        
    return eval_loss, correct_labels, predicted_labels

### Training

Now it's time to start training. We're going to use the AdamW optimizer with a base learning rate of 5e-5, and train for a maximum of 100 epochs. Here are some additional things to note: 

- Gradient Accumulation allows us to keep our batches small enough to fit into the memory of our GPU, while getting the advantages of using larger batch sizes. In practice, it means we sum the gradients of several batches, before we perform a step of gradient descent. 
- We use the WarmupLinearScheduler to vary our learning rate during the training process. First, we're going to start with a small learning rate, which increases linearly during the warmup stage. Afterwards it slowly decreases again.

In [15]:
from transformers import AdamW, get_linear_schedule_with_warmup

GRADIENT_ACCUMULATION_STEPS = 1
NUM_TRAIN_EPOCHS = 20
LEARNING_RATE = 5e-5
WARMUP_PROPORTION = 0.1
MAX_GRAD_NORM = 5

num_train_steps = int(len(train_dataloader.dataset) / BATCH_SIZE / GRADIENT_ACCUMULATION_STEPS * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(WARMUP_PROPORTION * num_train_steps)

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

optimizer = AdamW(optimizer_grouped_parameters, lr=LEARNING_RATE, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_train_steps)




We're finally ready to train our model. At each epoch, we're going to train it on our training data and evaluate it on the development data. We keep a history of the loss, and stop training when the loss on the development set doesn't improve for a certain number of steps (we call this number our `patience`). Whenever the development loss of our model improves, we save it. 

In [16]:
%%time
import torch
import os
from tqdm import trange
from tqdm import tqdm
from sklearn.metrics import classification_report, precision_recall_fscore_support

OUTPUT_DIR = "tmp/"
MODEL_FILE_NAME = "pytorch_model.bin"
PATIENCE = 5

loss_history = []
no_improvement = 0
for current_epoch in tqdm(range(NUM_TRAIN_EPOCHS), desc="Epoch"):
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(tqdm(train_dataloader, desc="Training iteration", position=0)):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        outputs = model(input_ids, attention_mask=input_mask, token_type_ids=segment_ids, labels=label_ids)
        loss = outputs[0]

        if GRADIENT_ACCUMULATION_STEPS > 1:
            loss = loss / GRADIENT_ACCUMULATION_STEPS

        loss.backward()
        tr_loss += loss.item()

        if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)  
            
            optimizer.step()
            optimizer.zero_grad()
            scheduler.step()
            
    dev_loss, test_correct, test_predicted = evaluate(model, dev_dataloader)
    
    print("Loss history:", loss_history)
    print("Dev loss:", dev_loss)
    print("Bert Accuracy:", np.mean(test_predicted == test_correct))
    
    if len(loss_history) == 0 or dev_loss < min(loss_history):
        no_improvement = 0
        model_to_save = model.module if hasattr(model, 'module') else model
        output_model_file = os.path.join(OUTPUT_DIR, MODEL_FILE_NAME)
        torch.save(model_to_save.state_dict(), output_model_file)
    else:
        no_improvement += 1
    
    if no_improvement >= PATIENCE: 
        print("No improvement on development set. Finish training.")
        break
        
    
    loss_history.append(dev_loss)

Training iteration: 100%|██████████| 1452/1452 [04:24<00:00,  5.50it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 16.85it/s]


Loss history: []
Dev loss: 0.6300921010566346
Bert Accuracy: 0.7215336948102247


Training iteration: 100%|██████████| 1452/1452 [04:25<00:00,  5.48it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 17.08it/s]


Loss history: [0.6300921010566346]
Dev loss: 0.6174459494190452
Bert Accuracy: 0.7219209914794733


Training iteration: 100%|██████████| 1452/1452 [04:26<00:00,  5.45it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 16.98it/s]


Loss history: [0.6300921010566346, 0.6174459494190452]
Dev loss: 0.5984634027620892
Bert Accuracy: 0.7269558481797056


Training iteration: 100%|██████████| 1452/1452 [04:25<00:00,  5.47it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 17.00it/s]
Epoch:  20%|██        | 4/20 [18:28<1:13:50, 276.92s/it]

Loss history: [0.6300921010566346, 0.6174459494190452, 0.5984634027620892]
Dev loss: 0.7412531508339776
Bert Accuracy: 0.6967467079783114


Training iteration: 100%|██████████| 1452/1452 [04:25<00:00,  5.48it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 17.01it/s]
Epoch:  25%|██▌       | 5/20 [23:03<1:09:01, 276.11s/it]

Loss history: [0.6300921010566346, 0.6174459494190452, 0.5984634027620892, 0.7412531508339776]
Dev loss: 0.7229733126766887
Bert Accuracy: 0.7261812548412083


Training iteration: 100%|██████████| 1452/1452 [04:25<00:00,  5.48it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 16.98it/s]
Epoch:  30%|███       | 6/20 [27:38<1:04:19, 275.65s/it]

Loss history: [0.6300921010566346, 0.6174459494190452, 0.5984634027620892, 0.7412531508339776, 0.7229733126766887]
Dev loss: 0.9126466329634926
Bert Accuracy: 0.7137877614252517


Training iteration: 100%|██████████| 1452/1452 [04:25<00:00,  5.47it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 17.02it/s]
Epoch:  35%|███▌      | 7/20 [32:13<59:39, 275.35s/it]  

Loss history: [0.6300921010566346, 0.6174459494190452, 0.5984634027620892, 0.7412531508339776, 0.7229733126766887, 0.9126466329634926]
Dev loss: 1.0832219621925443
Bert Accuracy: 0.6932610379550735


Training iteration: 100%|██████████| 1452/1452 [04:25<00:00,  5.48it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 16.81it/s]
Epoch:  35%|███▌      | 7/20 [36:47<1:08:20, 315.40s/it]

Loss history: [0.6300921010566346, 0.6174459494190452, 0.5984634027620892, 0.7412531508339776, 0.7229733126766887, 0.9126466329634926, 1.0832219621925443]
Dev loss: 1.1932067844878744
Bert Accuracy: 0.6835786212238575
No improvement on development set. Finish training.
CPU times: user 30min 30s, sys: 6min 13s, total: 36min 44s
Wall time: 36min 47s





### Evaluation

Let's now evaluate the model on some documents it has never seen. We'll load our best model and have it predict the labels for all documents in our data. We'll compute its precision, recall and F-score for the training, development and test set and print a full classification report for the test set.

In [17]:
%%time
model_state_dict = torch.load(os.path.join(OUTPUT_DIR, MODEL_FILE_NAME), map_location=lambda storage, loc: storage)
model = BertForSequenceClassification.from_pretrained(BERT_MODEL, state_dict=model_state_dict, num_labels = len(target_names))
model.to(device)

model.eval()

_, train_correct, train_predicted = evaluate(model, train_dataloader)
_, dev_correct, dev_predicted = evaluate(model, dev_dataloader)
_, test_correct, test_predicted = evaluate(model, test_dataloader)

print("Training performance:", precision_recall_fscore_support(train_correct, train_predicted, average="micro"))
print("Development performance:", precision_recall_fscore_support(dev_correct, dev_predicted, average="micro"))
print("Test performance:", precision_recall_fscore_support(test_correct, test_predicted, average="micro"))

bert_accuracy = np.mean(test_predicted == test_correct)

print(classification_report(test_correct, test_predicted, target_names=target_names))

Evaluation iteration: 100%|██████████| 1452/1452 [01:26<00:00, 16.82it/s]
Evaluation iteration: 100%|██████████| 162/162 [00:09<00:00, 16.86it/s]
Evaluation iteration: 100%|██████████| 180/180 [00:10<00:00, 16.87it/s]


Training performance: (0.8176222451790633, 0.8176222451790633, 0.8176222451790633, None)
Development performance: (0.7269558481797056, 0.7269558481797056, 0.7269558481797055, None)
Test performance: (0.7249912861624259, 0.7249912861624259, 0.724991286162426, None)
              precision    recall  f1-score   support

        MINR       0.68      0.74      0.71       934
        SERS       0.52      0.40      0.45       725
        FATL       0.85      0.91      0.88      1210

    accuracy                           0.72      2869
   macro avg       0.68      0.68      0.68      2869
weighted avg       0.71      0.72      0.72      2869

CPU times: user 1min 27s, sys: 20.5 s, total: 1min 48s
Wall time: 1min 48s


In [18]:
print("Result for random forest classifier")
print(classification_report(test_labels, best_predictions, target_names=target_names))

Result for random forest classifier
              precision    recall  f1-score   support

        MINR       0.77      0.92      0.84      1210
        SERS       0.61      0.84      0.71       934
        FATL       0.71      0.14      0.24       725

    accuracy                           0.70      2869
   macro avg       0.70      0.63      0.59      2869
weighted avg       0.70      0.70      0.64      2869



We see that BERT obtains an accuracy of around 72% on the test data.