<a href="https://colab.research.google.com/github/stele-and-rivers-001/study-series-nlp-1/blob/main/Open_source_model_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

In this study we will continue researching ways to improve our AI text classification model performance with smaller datasets. In Study I, we focused on data augmentation as well as different parameters within the fastai library to optimize our model performance. We were able to improve the accuracy of our model to 81% from 60.5% at baseline.

In Study II, we are going to explore the Hugging Face transformers library and utilize transformer model architectures for our text classification problem. The Hugging Face transformers library is a bit more verbose than the Fastai library that we used in our first study, but using it should be fairly intuitive if you're already familiar with Fastai's library. Hugging Face is a computer application company offering an open-source machine learning platform with a large and supportive community of AI enthusiasts sharing their knowledge and discoveries. It is commonly described as a GitHub for the machine learning community.

https://huggingface.co/

We will explore the effects of *transfer learning* on model performance and then take it a step further with the ULMFit method on a language model. Transfer learning involves leveraging knowledge gained from training a model on a task and applying it to a different but related task or domain. In this study, we will take an NLP model trained on the English language and fine tune it on our job title and industry data for a text classification task. The ULMFit approach, developed by Jeremy Howard of fast.ai, involves an extra step where the pre-trained model is fine tuned on task specific data *as a language model* (we'll provide more detail, but at a high level: language models predict the next word, which is different than the classification-style models that we've been using) and then fine tuned again for the text classification task.

Hugging Face's open-source platform has a variety of different models to choose from, whereas in Study I we only used the AWD-LSTM model. This time we will look at Hugging Face's popular BERT model, and different variations of the BERT base model.

BERT uses a transformer type architecture, whereas AWD-LSTM is a recurrent neural network (RNN) based on long short-term memory cells. BERT is also bidirectional, meaning it considers the context from both the left and right sides of each word, which allows it to capture deeper contextual information compared to traditional unidirectional models like LSTMs.

## Install libraries and import data

In [None]:
#hide
!pip install -Uqq fastai
!pip install -Uqq datasets
!pip install -Uqq evaluate
!pip install -Uqq scikit-learn
!pip install -Uqq accelerate==0.27.2
!pip install -Uqq tokenizers
!pip install -Uqq transformers==4.39.0 -Uqq torch==2.0.1 -Uqq torchvision

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/542.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m348.2/542.0 kB[0m [31m10.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     

In [None]:
#hide
from pathlib import Path
from google.colab import drive, files
from fastai.text.all import *
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, random_split
from transformers import BertTokenizer, BertForSequenceClassification, BertForMaskedLM, AdamW, RobertaTokenizer, RobertaForSequenceClassification, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from datasets import Dataset, load_dataset
import evaluate
import os

Note on the below cell: you can find links to the datasets that we used <a href="https://drive.google.com/drive/folders/1lHfx-bCiTGn3m6wwPvv5IIbwf96FqN-B?usp=drive_link">here</a>.

To run this notebook, you can 1) add the datasets to your own google drive and use the code to mount your drive below, OR 2) just access them via a path to the correct folder, as shown below. <b>The below code will not work without adding the data to the proper folder.</b>

In [None]:
## note that we have the files loaded temporarily into our working directory
!ls ./*.csv

./mask_df.csv  ./test_data.csv	./training_data.csv


In [None]:
## APPROACH 1 - code for mounting Google drive below (commented out)
# drive.mount('/content/drive')

## APPROACH 2 - path to folder
data_path = Path('./')

Quick refresher on our data: This dataset is a csv file with two columns "label" and "text". "Label" is the column of categories in which the value of each row of text is classified. For this study, we have 8 categories and 950 total data items. For simplicity, we asked ChatGPT to provide 125 job titles in 8 different industries. In this text classification study, we will split the data into training and validation sets and keep an unseen dataset of size 200 as our test data. The final test will be to categorize the unseen job titles and measure the accuracy of the model. This will reflect how the model performs in a real-life scenario where it needs to classify new data.

In Study I, we used data augmentation to optimize our model performance with the limited data. Let's load the final cleaned and augmented dataset to maintain consistency and get the best performance possible.

In [None]:
#hide
train_df = pd.read_csv(data_path / 'mask_df.csv')
preds_df = pd.read_csv(data_path / 'test_data.csv', sep='|')
train_df.columns
# train_df.dtypes

Index(['label', 'text'], dtype='object')

Below we will take a look at the data format and some sample data points. Example: Location Scout Manager is a job in the drama and arts sector.

In [None]:
#hide
train_df.columns = ['label','text']
preds_df.columns = ['label','text']
train_df.head()

Unnamed: 0,label,text
0,education,Assessment Specialist
1,drama_arts,Location Scout Manager
2,healthcare,Health Information Technician
3,technology,Technical Recruiter
4,finance,Financial Advisor Associate


In [None]:
#hide
preds_df.head()

Unnamed: 0,label,text
0,education,Education Technology Specialist
1,technology,Incident Response Analyst
2,retail_hospitality,Spa Operations Manager
3,marketing_advertising,Data Analyst
4,drama_arts,Hair Assistant


In [None]:
#hide
train_df.describe()

Unnamed: 0,label,text
count,3640,3640
unique,8,1411
top,retail_hospitality,[MASK] [MASK]
freq,515,91


In [None]:
#hide
## show unique labels to ensure no typos or missing categories
unique_labels = train_df['label'].unique()
label_counts = train_df['label'].value_counts()
print(unique_labels)

['education' 'drama_arts' 'healthcare' 'technology' 'finance'
 'marketing_advertising' 'retail_hospitality' 'legal']


## Study I Baseline and Top Performer

Let's take a quick look at the baseline AWD-LSTM model and the top performing fine-tuned model from Study I for reference.

Our baseline model had a validation accuracy of ~66% and a test accuracy of ~61%. We were able to significantly improve these accuracy scores through data augmentation methods to a test accuracy of 81%.



## Hugging Face

### Preprocessing and Tokenizer

THe Hugging Face library requires slightly different data input than what we've seen thus far. Let's format the data.

First, we need to encode our labels to numerical values. We will use the LabelEncoder  class to fit numerical values to the "labels" column in the dataset, creating a dictionary-like mapping between unique labels and integer values. We want to ensure the same mapping on both the training and test data.

In [None]:
## map the same integer values from the train_df to the preds_df
def map_labels(train_df):
    ## Initialize a LabelEncoder
    le = preprocessing.LabelEncoder()
    ## Fit LabelEncoder on the labels in train_df
    le.fit(train_df['label'].tolist())
    ## Create a mapping dictionary from labels to integers
    label_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    return label_mapping

label_mapping = map_labels(train_df)

train_df['label'] = train_df['label'].map(label_mapping)
preds_df['label'] = preds_df['label'].map(label_mapping)
## ensure proper integer format of labels
preds_df['label'] = preds_df['label'].astype(int)

In [None]:
#hide
train_df.head()

Unnamed: 0,label,text
0,1,Assessment Specialist
1,0,Location Scout Manager
2,3,Health Information Technician
3,7,Technical Recruiter
4,2,Financial Advisor Associate


In [None]:
#hide
preds_df.head()

Unnamed: 0,label,text
0,1,Education Technology Specialist
1,7,Incident Response Analyst
2,6,Spa Operations Manager
3,5,Data Analyst
4,0,Hair Assistant


In [None]:
## hugging face refers to the train/valid datasets as train/test, this is why we are calling the final dataset "preds"
## we will still test on unseen data at the end
## set random seed to 42 to avoid variations due to randomness on the data splits
train_df, test_df = train_test_split(train_df, test_size=0.2, random_state=42)

Hugging Face uses Datasets which require special formatting, as opposed to fastai, which accepts pandas dataframes as inputs. We've written a quick helper function below to create a Dataset.

In [None]:
## create a function to convert dataframes to hugging face datasets
def create_dataset(df):
    return Dataset.from_pandas(df)

In [None]:
## create datasets on all three dataframes
train_dataset = create_dataset(train_df)
test_dataset = create_dataset(test_df)
preds_dataset = create_dataset(preds_df)

In [None]:
## create a function to preprocess the text column data with the tokenizer
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, return_tensors="pt")

In [None]:
## create a function to tokenize the datasets and apply the preprocess function in batches
def tokenize_dataset(dataset):
    return dataset.map(preprocess_function, batched=True)

In [None]:
## create a function to make predictions using argmax on the logits
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

### BERT

Hugging Face has a substantial library of models to choose from. This easy to use library allows users to filter based on task ("text classification") or keyword search to find a specific subject matter based model.

Let's start with a popular transformer model architecture from Hugging Face: BERT. BERT is a great starting point for many NLP tasks. Using bidirectionality, BERT is pretrained on two different but related NLP tasks: masked language modeling (MLM) and next sentence prediction (NSP).

MLM: Trains a model to predict a masked (hidden) word in a sentence based on its context. In the previous study, we used data masking to augment our dataset. MLM takes this practice and then trains a model to predict the masked token.

NSP: The goal is to train the model to predict whether a given pair of sentences in a text corpus are consecutive or not.

In [None]:
model_name = 'bert-base-uncased'

Hugging Face provides the class AutoTokenizer to select the most appropriate tokenizer based on the model. We will explore tokenizer options in greater depth later on.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
## data collator combines samples into batches while padding sequences to ensure uniform length within each batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
tokenized_train = tokenize_dataset(train_dataset)
tokenized_test = tokenize_dataset(test_dataset)
tokenized_preds = tokenize_dataset(preds_dataset)

Map:   0%|          | 0/2912 [00:00<?, ? examples/s]

Map:   0%|          | 0/728 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

We will use the BertForSequenceClassification class to load the model as that is the task we are fine-tuning for. We have 8 labels to choose from based on our unique values in the training data.

In [None]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=8)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Here we will set our training arguments. For this dataset and task, we found the following parameter values to be optimal. It is important for users to explore these as they will vary based on the dataset and task at hand. Having an understanding of these arguments is vital to creating an effective model. Hugging Face offers quite a few options at this stage, we will focus on just a few basics here.

For the complete list of training arguments, see the documentation: https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/trainer#transformers.TrainingArguments

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=1e-5,
    evaluation_strategy = "epoch",
    logging_strategy="epoch"
)

optimizer = AdamW(
    model.parameters(),
    lr=training_args.learning_rate,
    weight_decay=training_args.weight_decay
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    optimizers=(optimizer, None)
)



In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.974,0.506258,0.84478
2,0.434,0.365673,0.899725
3,0.3121,0.335833,0.901099
4,0.243,0.320072,0.902473


TrainOutput(global_step=1456, training_loss=0.4907518061962757, metrics={'train_runtime': 185.3266, 'train_samples_per_second': 62.851, 'train_steps_per_second': 7.856, 'total_flos': 37384339257216.0, 'train_loss': 0.4907518061962757, 'epoch': 4.0})

A 90% validation accuracy looks like a great start. Remember in the previous study, we had some overfitting issues which led to a steep drop in validation set accuracy on our test set. We still need to check the unseen data.

Below we will save the model so that it can be loaded for future applications without having to retrain.

In [None]:
## save the model
filename = 'bert_base_model'
model.save_pretrained(filename)

We can create a classification report to study more about how our model is performing. Let's run it on our training dataset:

In [None]:
## make predictions on the training data
preds = trainer.predict(tokenized_train)
# print(preds)
## process raw predictions
preds = np.argmax(preds[:3][0],axis=1)
## extract the "Ground Truth" labels from the dataframe
GT = train_df['label'].tolist()
## compare the correct labels to the predictions
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.98      0.91      0.94       385
           1       0.96      0.95      0.95       352
           2       0.96      0.92      0.94       325
           3       0.92      0.96      0.94       385
           4       0.99      0.91      0.95       399
           5       0.86      0.93      0.90       271
           6       0.82      0.91      0.87       411
           7       0.97      0.93      0.95       384

    accuracy                           0.93      2912
   macro avg       0.93      0.93      0.93      2912
weighted avg       0.93      0.93      0.93      2912



Precision: The ratio of true positive predictions to the total number of positive predictions made by the model. It measures the accuracy of positive predictions. A high precision indicates that the model is making fewer false positive predictions. It is calculated as:
Precision = True Positives/(False Positives+True Positives)

Recall: Recall, or sensitivity or true positive rate, is the ratio of true positive predictions to the total number of actual positive instances in the data. It measures the ability of the model to correctly identify positive instances. A high recall indicates that the model is capturing most of the positive instances in the data. It is calculated as:
Precision = True Positives/(False Negatives+True Positives)


F1_Score: The harmonic mean of precision and recall. It provides a balance between precision and recall and is often used as a single metric to evaluate the overall performance of a classification model. F1-score reaches its best value at 1 and worst value at 0. It is calculated as:
F1 = 2 x ((Precision x Recall)/(Precision + Recall))

Support: The amount of items with that label in the dataset.

We can see how the model performs on each label. It has a 98% precision on label 0 (which from our mapping is "arts_drama") and has the lowest precision on label 6, or "retail_hospitality". This tells us that the model incorrectly predicts a label of retail_hospitality more frequently than incorrect predictions of other labels. With this knowledge, you could further augment the data in retail_hospitality, or focus efforts on adjusting the training data to achieve better results. Using the classification report is a nice way to get deeper insight into the model's performance.

In [None]:
## create a classification report on the validation, or "test" data
preds = trainer.predict(tokenized_test)
preds = np.argmax(preds[:3][0],axis=1)
GT = test_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.94      0.86      0.90       105
           1       0.94      0.90      0.92        88
           2       0.95      0.95      0.95        75
           3       0.91      0.96      0.93       100
           4       0.99      0.90      0.94        86
           5       0.84      0.83      0.84        59
           6       0.75      0.91      0.83       104
           7       0.95      0.90      0.93       111

    accuracy                           0.90       728
   macro avg       0.91      0.90      0.90       728
weighted avg       0.91      0.90      0.90       728



Finally we will test the BERT model on our unseen data. This is the true test of performance for our study. Ideally, the dropoff from train/validate accuracy to test accuracy will not be significant, but we now have a solid foundation of knowledge to adjust if any overfitting is identified.

In [None]:
## create a classification report on the test, or "preds" data
preds = trainer.predict(tokenized_preds)
preds = np.argmax(preds[:3][0],axis=1)
GT = preds_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84        22
           1       1.00      0.89      0.94        36
           2       0.83      0.95      0.88        20
           3       0.96      0.93      0.95        28
           4       0.79      0.96      0.87        24
           5       0.82      0.64      0.72        22
           6       0.91      0.91      0.91        22
           7       0.89      0.92      0.91        26

    accuracy                           0.89       200
   macro avg       0.88      0.88      0.88       200
weighted avg       0.89      0.89      0.88       200



An 89% test accuracy is very impressive. The BERT model outperforms the AWD-LSTM model by 8% and only drops-off from the validation accuracy by 3%. The *transformer* model architecture is showing promising results.

In [None]:
## add our predictions to a df and save as csv
bert_preds_df = preds_df.copy()
bert_preds_df['predicted_labels'] = preds
bert_preds_df.head()

Unnamed: 0,label,text,predicted_labels
0,1,Education Technology Specialist,1
1,7,Incident Response Analyst,7
2,6,Spa Operations Manager,6
3,5,Data Analyst,7
4,0,Hair Assistant,0


In [None]:
## if you'd like to save the results to a CSV
bert_preds_df.to_csv('bert_preds_mask_df.csv', index=False)

### RoBERTa

RoBERTa documentation:
https://huggingface.co/docs/transformers/en/model_doc/roberta#transformers.RobertaForSequenceClassification

RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates. It has the same architecture but uses a byte-level BPE as a tokenizer.

In Study I, we tried a character based tokenizer, which resulted in lower prediction accuracy than the default fastai subword tokenizer. We also tried two different open-source tokenizers from Hugging Face, which showed similar performance.

Hugging Face has three main types of tokenizers available for use:

1. Byte-Pair Encoding: BPE relies on a pre-tokenizer that splits the training data into words. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer and uses a different pretraining scheme.

2. WordPiece: Similar to BPE, WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary. The BERT tokenizer falls into this category.

3. SentencePiece: The first two tokenizers assume spaces separate words. SentencePiece does not. It treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary. The XLNet tokenizer is an example of this category.

The processes in which these tokenizers operate is fairly complex. For additional information, see the associated documentation on tokenizers from Hugging Face:
https://huggingface.co/docs/transformers/en/tokenizer_summary

In [None]:
model_name = 'cardiffnlp/twitter-roberta-base-emotion'

In [None]:
## reset the tokenizer for the updated model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_train = tokenize_dataset(train_dataset)
tokenized_test = tokenize_dataset(test_dataset)
tokenized_preds = tokenize_dataset(preds_dataset)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

config.json:   0%|          | 0.00/768 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Map:   0%|          | 0/2912 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/728 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
model = RobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-emotion", num_labels=8, ignore_mismatched_sizes=True)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-emotion and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([4, 768]) in the checkpoint and torch.Size([8, 768]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([4]) in the checkpoint and torch.Size([8]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=1e-5,
    evaluation_strategy = "epoch",
    logging_strategy="epoch"
)

optimizer = AdamW(
    model.parameters(),
    lr=training_args.learning_rate,
    weight_decay=training_args.weight_decay
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    optimizers=(optimizer, None)
)



In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8996,0.487897,0.853022
2,0.5237,0.523616,0.869505
3,0.3911,0.415109,0.888736
4,0.3323,0.397602,0.887363


TrainOutput(global_step=1456, training_loss=0.536658397087684, metrics={'train_runtime': 189.8957, 'train_samples_per_second': 61.339, 'train_steps_per_second': 7.667, 'total_flos': 95777583562752.0, 'train_loss': 0.536658397087684, 'epoch': 4.0})

In [None]:
## if you'd like to save the model
filename = 'roberta_model'
model.save_pretrained(filename)

In [None]:
preds = trainer.predict(tokenized_train)
preds = np.argmax(preds[:3][0],axis=1)
GT = train_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.97      0.90      0.93       385
           1       0.99      0.93      0.96       352
           2       0.98      0.91      0.94       325
           3       0.77      0.96      0.85       385
           4       0.97      0.92      0.95       399
           5       0.90      0.89      0.89       271
           6       0.87      0.89      0.88       411
           7       0.96      0.94      0.95       384

    accuracy                           0.92      2912
   macro avg       0.93      0.92      0.92      2912
weighted avg       0.93      0.92      0.92      2912



In [None]:
preds = trainer.predict(tokenized_test)
preds = np.argmax(preds[:3][0],axis=1)
GT = test_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.95      0.84      0.89       105
           1       0.96      0.92      0.94        88
           2       0.93      0.92      0.93        75
           3       0.71      0.95      0.81       100
           4       0.94      0.91      0.92        86
           5       0.88      0.83      0.85        59
           6       0.86      0.86      0.86       104
           7       0.97      0.87      0.92       111

    accuracy                           0.89       728
   macro avg       0.90      0.89      0.89       728
weighted avg       0.90      0.89      0.89       728



In [None]:
preds = trainer.predict(tokenized_preds)
preds = np.argmax(preds[:3][0],axis=1)
GT = preds_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.81      0.77      0.79        22
           1       0.97      0.78      0.86        36
           2       0.88      0.75      0.81        20
           3       0.90      0.93      0.91        28
           4       0.65      0.92      0.76        24
           5       0.76      0.59      0.67        22
           6       0.83      0.91      0.87        22
           7       0.86      0.96      0.91        26

    accuracy                           0.83       200
   macro avg       0.83      0.83      0.82       200
weighted avg       0.84      0.83      0.83       200



In [None]:
roberta_preds_df = preds_df.copy()
roberta_preds_df['predicted_labels'] = preds
roberta_preds_df.head()

Unnamed: 0,label,text,predicted_labels
0,1,Education Technology Specialist,1
1,7,Incident Response Analyst,4
2,6,Spa Operations Manager,6
3,5,Data Analyst,7
4,0,Hair Assistant,0


In [None]:
roberta_preds_df.to_csv('roberta_preds_mask_df.csv', index=False)

The RoBERTa model has a test accuracy of 83%. While this is higher than our best performing model from Study I, it is 6% lower than the BERT model performance. This provides evidence for the BERT pre-training method and WordPiece tokenizer being optimal for our problem and dataset. Be sure to try different tokenizers and models on your data. Next we will look at a BERT model pre-trained in our domain.

### Industry Classification API

From Hugging Face user **sampathkethineedi**, we will load the industry-classification-api model which is a BERT Model that has been fine-tuned to classify a business description into one of 62 industry tags. Per the documentation, it was trained on 7000 samples of Business Descriptions and associated labels of companies in India. While this is not a perfect 1:1 relation to our task, it is a BERT model already fine-tuned on classifying businesses into an industry.

See the documentation:
https://huggingface.co/sampathkethineedi/industry-classification-api

Instead of the 62 output labels it was fine-tuned for sentiment analysis, we will be replacing the last layer of the model to use a different set of labels to match our data. This process requires setting the *ignore_mismatched_sizes* parameter to True. Since the model was already fine-tuned or another task, it will be interesting to see its performance on our task. While related, it is not exactly the same, with different labels and quantity. Later on, we will take this a step further by utilizing the ULMFit technique.

In [None]:
model_name = 'sampathkethineedi/industry-classification-api'
num_labels = 8
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels, ignore_mismatched_sizes=True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sampathkethineedi/industry-classification-api and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([62, 768]) in the checkpoint and torch.Size([8, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([62]) in the checkpoint and torch.Size([8]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_train = tokenize_dataset(train_dataset)
tokenized_test = tokenize_dataset(test_dataset)
tokenized_preds = tokenize_dataset(preds_dataset)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/2912 [00:00<?, ? examples/s]

Map:   0%|          | 0/728 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=1e-5,
    evaluation_strategy = "epoch",
    logging_strategy="epoch"
)

optimizer = AdamW(
    model.parameters(),
    lr=training_args.learning_rate,
    weight_decay=training_args.weight_decay
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    optimizers=(optimizer, None)
)



In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8431,0.45646,0.855769
2,0.431,0.389198,0.888736
3,0.3194,0.326877,0.907967
4,0.247,0.318786,0.907967


TrainOutput(global_step=1456, training_loss=0.46009954515394275, metrics={'train_runtime': 174.595, 'train_samples_per_second': 66.714, 'train_steps_per_second': 8.339, 'total_flos': 37384339257216.0, 'train_loss': 0.46009954515394275, 'epoch': 4.0})

In [None]:
#hide
filename = 'ind_class_bert_model'
model.save_pretrained(filename)

In [None]:
preds = trainer.predict(tokenized_train)
#print(preds)
preds = np.argmax(preds[:3][0],axis=1)
GT = train_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.98      0.91      0.94       385
           1       0.99      0.92      0.96       352
           2       0.97      0.92      0.95       325
           3       0.93      0.96      0.94       385
           4       0.98      0.93      0.95       399
           5       0.89      0.92      0.90       271
           6       0.77      0.94      0.85       411
           7       0.98      0.94      0.96       384

    accuracy                           0.93      2912
   macro avg       0.94      0.93      0.93      2912
weighted avg       0.94      0.93      0.93      2912



In [None]:
preds = trainer.predict(tokenized_test)
preds = np.argmax(preds[:3][0],axis=1)
GT = test_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.95      0.86      0.90       105
           1       0.98      0.91      0.94        88
           2       0.97      0.93      0.95        75
           3       0.91      0.96      0.94       100
           4       0.97      0.91      0.94        86
           5       0.91      0.83      0.87        59
           6       0.72      0.93      0.82       104
           7       0.95      0.91      0.93       111

    accuracy                           0.91       728
   macro avg       0.92      0.90      0.91       728
weighted avg       0.92      0.91      0.91       728



In [None]:
preds = trainer.predict(tokenized_preds)
preds = np.argmax(preds[:3][0],axis=1)
GT = preds_df['label'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.90      0.86      0.88        22
           1       0.94      0.86      0.90        36
           2       0.82      0.90      0.86        20
           3       0.93      0.93      0.93        28
           4       0.88      0.96      0.92        24
           5       0.82      0.64      0.72        22
           6       0.95      0.91      0.93        22
           7       0.81      1.00      0.90        26

    accuracy                           0.89       200
   macro avg       0.88      0.88      0.88       200
weighted avg       0.89      0.89      0.88       200



In [None]:
ind_class_preds_df = preds_df.copy()
ind_class_preds_df['predicted_labels'] = preds
ind_class_preds_df.head()

Unnamed: 0,label,text,predicted_labels
0,1,Education Technology Specialist,1
1,7,Incident Response Analyst,7
2,6,Spa Operations Manager,6
3,5,Data Analyst,7
4,0,Hair Assistant,0


In [None]:
ind_class_preds_df.to_csv('ind_class_preds_mask_df.csv', index=False)

Using a BERT model that has been pretrained on job industry data gives us the same test accuracy as the regular BERT model at 89%. However, you can see a slightly lower loss and higher validation accuracy during training. This difference could be more significant with more data or on another dataset. It is important to remember that not all models work the same on all datasets! This method could be an easy way to get your model a boost.

## ULMFit fastai

Universal Language Model Fine-tuning for Text Classification (ULMFit) is a transfer learning method which has been proven to improve performance for NLP models. Here we will go back to the fastai library and apply this method on the AWD-LSTM model architecture, to further train the language model on domain specific data before fine-tuning for our text classification problem. By introducing domain specific data at this stage, it prepares our model and allows it to adjust outer model layers to the specific characteristics of the domain. With a smaller final dataset, this middle step should help the model match the performance of similar models working with larger datasets. See the paper by Jeremy Howard and Sebastian Ruder for an depth look at how ULMFit compares to other training methods for NLP models.

https://arxiv.org/abs/1801.06146

In [None]:
from fastai.text.all import *

Fastai handles tokenization and numericalization automatically for us within the TextDataLoaders object. This object saves time by saving tokenized documents in a temporary folder (avoid repeat work) and running operations in parallel. We can set values for batch size and sequence length but we will let fastai pick the optimal values for us.

In [None]:
## be sure to use the full training dataset (we split it into train/test when working with transformers)
mask_df = train_df.copy()
dls_lm = TextDataLoaders.from_df(mask_df, text_col='text', valid_pct=0.1, seed=42, is_lm=True)

  self.pid = os.fork()


  self.pid = os.fork()


Let's take a look at the first two sequences.

In [None]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,xxbos xxmaj investment [ mask ] xxbos xxmaj museum xxmaj educator xxbos xxmaj legal xxmaj secretary xxbos [ mask ] xxmaj analyst xxbos xxmaj hospice xxmaj nurse xxbos xxmaj xxunk xxmaj manager xxbos [ mask ] xxmaj coordinator xxbos xxup it xxmaj deployment xxmaj specialist xxbos xxup it xxmaj configuration xxmaj manager xxbos [ mask ] xxmaj asset xxmaj management xxmaj specialist xxbos xxmaj merchandise xxmaj planner xxbos xxmaj art [ mask,xxmaj investment [ mask ] xxbos xxmaj museum xxmaj educator xxbos xxmaj legal xxmaj secretary xxbos [ mask ] xxmaj analyst xxbos xxmaj hospice xxmaj nurse xxbos xxmaj xxunk xxmaj manager xxbos [ mask ] xxmaj coordinator xxbos xxup it xxmaj deployment xxmaj specialist xxbos xxup it xxmaj configuration xxmaj manager xxbos [ mask ] xxmaj asset xxmaj management xxmaj specialist xxbos xxmaj merchandise xxmaj planner xxbos xxmaj art [ mask ]
1,xxmaj investigator xxbos xxmaj wound [ mask ] xxmaj nurse xxbos xxmaj online [ mask ] xxmaj manager xxbos xxup it xxmaj business xxmaj analyst xxbos xxmaj virtual xxmaj reality xxmaj educational [ mask ] xxmaj creator xxbos [ mask ] xxmaj activities xxmaj coordinator xxbos xxmaj compliance xxmaj auditor xxbos xxmaj sales xxmaj manager xxbos xxmaj geriatric xxmaj nurse xxmaj practitioner xxbos xxmaj legal xxmaj assistant xxbos xxmaj sound xxmaj mixer,investigator xxbos xxmaj wound [ mask ] xxmaj nurse xxbos xxmaj online [ mask ] xxmaj manager xxbos xxup it xxmaj business xxmaj analyst xxbos xxmaj virtual xxmaj reality xxmaj educational [ mask ] xxmaj creator xxbos [ mask ] xxmaj activities xxmaj coordinator xxbos xxmaj compliance xxmaj auditor xxbos xxmaj sales xxmaj manager xxbos xxmaj geriatric xxmaj nurse xxmaj practitioner xxbos xxmaj legal xxmaj assistant xxbos xxmaj sound xxmaj mixer xxbos


Now we fine-tune the language model. The default loss function is cross-entropy loss. We'll make sure to calculate the *Perplexity* metric, which is common in NLP for language models. Perplexity is the exponential of the loss. Accuracy in this case measures how often the model is able to predict the next word.

In [None]:
## create the language model learner object
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()])

We will use *fit_one_cycle* in place of *fine_tune* so that we can save at intervals. Note that this process will take a while to run with larger datasets.

The *language_model_learner* class (in the above cell) automatically calls freeze when using a pretrained model (default), so this will only train the embeddings, or part of the model that contains randomly initialized weights (embeddings) for words that are in our new training vocab, but aren't in the pretrained model vocab. When running the code below, users can specify a learning rate or let fastai decide the learning rate value. We found letting fastai handle this was optimal, but depending on your task, it may make sense to specify a lower or higher learning rate.

In [None]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.999808,4.986378,0.296875,146.405136,00:05


In [None]:
## save the model
filename = '1epoch'
learn.save(filename)

Path('models/1epoch.pth')

In [None]:
## load the model
learn = learn.load('1epoch')

Now let's unfreeze and continue to fine-tune.

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.304725,3.528572,0.411765,34.075275,00:03
1,3.460022,2.315832,0.560662,10.133351,00:02
2,2.944287,2.064255,0.592371,7.879424,00:02
3,2.635334,1.920541,0.607996,6.824649,00:03
4,2.425909,1.822892,0.616268,6.189732,00:03
5,2.315758,1.832099,0.626379,6.246984,00:02
6,2.197302,1.760237,0.634191,5.813812,00:02
7,2.101868,1.72851,0.632812,5.632258,00:02
8,2.022819,1.715351,0.633732,5.558628,00:03
9,1.960566,1.713612,0.634191,5.548969,00:04


Let's save the model again except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the *encoder*.

In [None]:
## save the model
filename = 'finetuned'
learn.save_encoder(filename)

We have now trained the language model. This is main difference between ULMFit and simply fine-tuning. Now all that is left is to fine-tune this model for our text classification task.

In [None]:
## change the vocab to our trained lm vocab
dls_clas = TextDataLoaders.from_df(mask_df, vocab=dls_lm.vocab, text_col='text', label_col='label', valid_pct=0.1)

  self.pid = os.fork()


  self.pid = os.fork()


In [None]:
## show batch, should include both text and label
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,xxbos [ mask ] [ mask ] xxmaj engineer in xxmaj test [ mask ],technology
1,xxbos xxmaj software xxmaj development xxmaj engineer [ mask ] xxmaj test ( sdet ),technology
2,xxbos xxmaj software xxmaj development xxmaj engineer [ mask ] xxmaj test ( sdet ),technology


Create the classification learner

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
                                metrics=accuracy)

In [None]:
## load the encoder
learn = learn.load_encoder('finetuned')

Use *gradual unfreezing* to remove a few layers at a time. This prevents catastrophic forgetting, where the model forgets older tasks in order to learn newer ones.

In [None]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,time
0,1.552967,1.133472,0.631868,00:06


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


The *freeze_to* method freezes all except the last number of parameter groups specified. The *fit_one_cycle* method gradually increases and decreases the learning rate. The slice portion creates the range from which to use.

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,1.124627,0.723443,0.769231,00:05


  self.pid = os.fork()
  self.pid = os.fork()


In [None]:
## unfreeze more
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.846155,0.56997,0.81044,00:06


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


In [None]:
## unfreeze the whole model
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.681956,0.492608,0.851648,00:04
1,0.631067,0.474776,0.854396,00:04


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


In [None]:
## let's define a helper function for measuring the model's performance on our test set
def test_set_accuracy(test_df,learn,test_df_col_name='text'):
  test_dls = learn.dls.test_dl(test_df[test_df_col_name])
  ## grab the vocab from our learner so that we can map to text
  label_mapping = learn.dls.vocab[1]
  ## make predictions on the test dataset
  preds, _ = learn.get_preds(dl=test_dls)
  ## NOTE df.copy() is good practice, otherwise you're potentially modifying the original object
  preds_df = test_df.copy()
  ## argmax finds the predicted value (prediction=max value) for each multi-category prediction vector
  preds_df['predictions'] = preds.argmax(dim=-1)
  ## convert integer predictions to label values
  preds_df['predicted_label'] = preds_df['predictions'].map(lambda x: label_mapping[x])
  ## simple accuracy calc using pandas - TRUE/FALSE evaluates to 1/0 when using .mean()
  ## so taking average is a handy shortcut for calculating accuracy
  accuracy = (preds_df['predicted_label'] == preds_df['label']).mean()
  print(f"Accuracy: {accuracy}")

In [None]:
test_set_accuracy(preds_df,learn)

  self.pid = os.fork()
  self.pid = os.fork()


Accuracy: 0.795


In [None]:
## save the model if desired
filename = 'jobs_ulmfit'
learn.save(filename)

We found the ULMFit method with the fastai library and AWD-LSTM architecture to have a final test accuracy of ~80% which is very similar to our best performance using the fastai library in the first study. Unfortunately, ULMFit did not optimize our performance for this task. That doesn't mean it wont work well for other datasets and tasks. It has been proven as an effective method to train a classification task. Perhaps it works better for longer input strings like the Kaggle movie sentiment review task referenced in the fastai course notebook provided in Study I. Regardless of performance on this task, it is a useful method to know how to leverage, which is why we chose to include it in this study.

## Conclusion

In conclusion, we were able to further boost our model performance from study I by 8% using the BERT transformer model instead of the AWD-LSTM model. Overall, we have now increased performance from baseline by nearly 30%. Not all NLP models are built the same way, and certain architectures will work better based on the data and problem at hand. In our case, the transformers architecture was a better fit for predicting the correct labels for our text classification problem. While the fastai library is a bit simpler and more user-friendly, having knowledge and comfortability with other options is valuable. We have now explored working with various model architectures, have showed data augmentation techniques, and have shown how variations in training parameters and tokenizers can affect mdoel performance. We can also reference the ULMfit method for potential use in the future. In the next study, instead of doing more model fine-tuning, we will look at working with Large Language Models like the GPT model series to see how they compare with what we've engaged with so far.

As a reminder, this series is meant to be a survey of techniques for improving performance of NLP classification tasks with small datasets.

Stay tuned for more case studies like this one and if there is something you would like to chat about, feel free to reach out at:

shane@stelerivers.com