<a href="https://colab.research.google.com/github/stele-and-rivers-001/study-series-nlp-1/blob/main/Deep_learning_and_data_augmentation_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before beginning, we need to give a special thank you to Jeremy Howard and the team at fast.ai. This study was modeled after the lessons taught in their excellent Practical Deep Learning for Coders course.

Check out the course <a href="https://course.fast.ai/">here</a> - you won't regret it!

# Introduction

Having a large, clean dataset is an often helpful component to training and validating an AI model.

- Larger datasets provide more diverse examples for the model to learn from. This logically leads to better pattern recognition and better performance.

- Small datasets can lead to *overfitting*, where the model memorizes all of the data within the training set, but performs poorly on data that it didn't see during training.

- Increased complexity of the data allows the model to learn more robust representations of the data, improving feature recognition and becoming more resistant to noise or outliers. It also learns to identify bias and correct skewed data.

In general, training and validation data are very important topics - <a href="https://www.fast.ai/posts/2017-11-13-validation-sets.html">this is a great starting point</a> for learning more.

Gaining access to a large, cleaned dataset can be difficult, especially when working in a previously unexplored domain. The world of AI is still relatively new and not every subject has a documented study to reference. For example, the underlying models that power ChatGPT were trained on over 45 terabytes of text data, including the entirety of Wikipedia--clearly not a realistic option for everyday AI practitioners. Luckily, companies like fast.ai and Hugging Face have libraries of pretrained models available for use. These models can then be leveraged to a problem's specific domain via *transfer learning* (note: we'll compare this approach with using general models like GPT later in this series). Although the data requirements are much smaller for transfer learning, it still takes time to gather properly labeled and formatted data. So, are there ways for practitioners to do more with less?

In this study, we are going to take a look at how to improve the performance of a text classification model with a small dataset with <b>less than 1,000 items</b>. Since the focus will be on improving model performance by changing techniques around the same dataset, we won't dive too deeply into the weeds with all of the AI topics and terminology. We will show important portions of the code and outputs at certain stages, but will skip over some of the boilerplate lines of code to make this an accessible read for everyone. In the future with this series, We hope to explore many other subjects at a deeper level.

## Install libraries, import data, instantiate dataloaders object

In [None]:
## installing fastai library
! pip install -Uqq fastai

In [None]:
## importing packages
from pathlib import Path
from fastai.text.all import *
import pandas as pd

Note on the below cell: you can find links to the datasets that we used <a href="https://drive.google.com/drive/folders/10oeVWwIHQ8pqbZktgwnegyNid_I3w-ho">here</a>.

To run this notebook, you can 1) add the datasets to your own google drive and use the code to mount your drive below, OR 2) just access them via a path to the correct folder, as shown below. <b>The below code will not work without adding the data to the proper folder.</b>

In [None]:
## note that we have the files loaded temporarily into our working directory
!ls ./*.csv

./test_data.csv  ./training_data.csv


In [None]:
## APPROACH 1 - code for mounting Google drive below (commented out)
# from google.colab import drive
# drive.mount('/content/drive')

## APPROACH 2 - path to folder
data_path = Path('./')

Let's import our training data. This dataset is a csv file with two columns: "label" and "text". Label is the category in which the text is classified. For this case study, we have 8 categories and 950 total data items. For simplicity, we asked ChatGPT to provide 125 job titles in 8 different industries. In this text classification study, we will split the data into training and validation sets and keep an unseen dataset of size 200 as our test data. The final test will be for our model to predict the unseen job titles, which we'll use to measure model accuracy. This will reflect how the model might perform in a real-life scenario where it needs to classify data into our categories.

In [None]:
train_df = pd.read_csv(data_path / 'training_data.csv', sep='|')
test_df = pd.read_csv(data_path / 'test_data.csv', sep='|')
train_df.columns

Index(['label', 'text'], dtype='object')

Below, we will take a look at the data format and some sample data points. Example: "Location Scout Manager" is a job in the "Drama & Arts" sector.

In [None]:
# train_df.columns = ['label','text']
# test_df.columns = ['label','text']
train_df.head()

Unnamed: 0,label,text
0,education,Assessment Specialist
1,drama_arts,Location Scout Manager
2,healthcare,Health Information Technician
3,technology,Technical Recruiter
4,finance,Financial Advisor Associate


It's important to make sure that our data is clean. Properly labeled data without missing values helps avoid outliers or errors in the weights of the model. This is a major issue for most companies trying to integrate AI into their tech stack. Access to all the data in the world doesn't help if the data is not labeled. Without a properly labeled and organized dataset, there is no way to train the model. For an example of scale, the popular Kaggle computer science company offers a dataset of movie reviews with 100,000 reviews labeled with whether they were positive or negative. Think about how long it would take to manually label just 1,000 items, let alone 100,000. Over time as data is added, this problem will fade. Getting started, however, is a tall task.

After cleaning the data, we found 22 duplicates which were removed (we used ChatGPT to generate our sample data, so this isn't shocking). We separated another 200 into our test dataset and are left with 728 items in the train and validate sets.

In [None]:
train_df.describe()

Unnamed: 0,label,text
count,728,728
unique,8,728
top,retail_hospitality,Assessment Specialist
freq,103,1


In [None]:
## show unique labels to ensure no typos or missing categories
unique_labels = train_df['label'].unique()
print(unique_labels)

['education' 'drama_arts' 'healthcare' 'technology' 'finance'
 'marketing_advertising' 'retail_hospitality' 'legal']


 We will set our random seed to 42 to ensure that we're removing the effect of randomness when we compare model performance across various techniques. Randomness could have a larger effect with a smaller sample size. Think different shuffling, starting points, etc. These issues might fade with 100,000 data points, but for this scenario we want to ensure randomness is limited.

In [None]:
dls = TextDataLoaders.from_df(train_df, label_col='label', text_col='text', seed=42)

In [None]:
## here's a sample batch from our TextDataLoaders
x,y = dls.one_batch()
dls.show_batch()

Unnamed: 0,text,category
0,xxbos xxmaj software xxmaj development xxmaj engineer xxunk xxmaj xxunk ( xxunk ),technology
1,xxbos xxmaj xxunk / xxmaj xxunk ( xxunk / xxup xxunk ),healthcare
2,xxbos xxmaj xxunk xxmaj xxunk xxmaj educational xxmaj content xxmaj xxunk,education
3,xxbos xxmaj xxunk and xxmaj xxunk ( xxunk ) xxmaj analyst,finance
4,xxbos xxmaj digital xxmaj xxunk xxmaj technician ( xxunk ),drama_arts
5,xxbos xxmaj certified xxmaj xxunk xxmaj assistant ( xxunk ),healthcare
6,xxbos xxmaj xxunk xxmaj xxunk xxmaj general ( xxunk ),legal
7,xxbos xxmaj certified xxmaj medical xxmaj assistant ( xxunk ),healthcare
8,xxbos xxmaj chief xxmaj financial xxmaj officer ( xxunk ),finance


## Baseline performance: create a learner object using fastai and fine-tune a model

Training time! We're now going to fine-tune the <a href="https://arxiv.org/pdf/1708.02182.pdf">AWD_LSTM</a> model.

This is our first attempt at fine tuning the model without any alterations to the data or learner parameters, so we will treat this as our baseline performance to (hopefully) improve upon.

In [None]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4)

epoch,train_loss,valid_loss,accuracy,time
0,2.172464,1.980982,0.386207,00:02


epoch,train_loss,valid_loss,accuracy,time
0,1.710424,1.79129,0.551724,00:01
1,1.567279,1.410868,0.62069,00:01
2,1.457842,1.212973,0.662069,00:01
3,1.376143,1.136232,0.655172,00:01


A validation accuracy of ~66% is a decent start, but there is plenty of room for improvement. The validation loss of > 1 is also very high and is a sign of potential overfitting.

Separately, taking a look at the following code block will be helpful in understanding the function for testing accuracy that comes after it.

In [None]:
## create a test dataloader from the test_df 'text' column by calling the test_dl method on the learner object
test_dls = learn.dls.test_dl(test_df['text'])
## retrieve the vocab and select the second element, which is the labels
## predictions will be in integer format (1-8), not the text strings that the labels are currently formatted in the dataset
## label_mapping will be used to map integer predictions to the text label values, see samples below
label_mapping = learn.dls.vocab[1]

print("Label Mapping:")
for idx, label in enumerate(label_mapping):
    print(f"Index {idx}: Label '{label}'")

Label Mapping:
Index 0: Label 'drama_arts'
Index 1: Label 'education'
Index 2: Label 'finance'
Index 3: Label 'healthcare'
Index 4: Label 'legal'
Index 5: Label 'marketing_advertising'
Index 6: Label 'retail_hospitality'
Index 7: Label 'technology'


In [None]:
## let's define a helper function for measuring the model's performance on our test set
def test_set_accuracy(test_df,learn,test_df_col_name='text'):
  test_dls = learn.dls.test_dl(test_df[test_df_col_name])
  ## grab the vocab from our learner so that we can map to text
  label_mapping = learn.dls.vocab[1]
  ## make predictions on the test dataset
  preds, _ = learn.get_preds(dl=test_dls)
  ## NOTE df.copy() is good practice, otherwise you're potentially modifying the original object
  preds_df = test_df.copy()
  ## argmax finds the predicted value (prediction=max value) for each multi-category prediction vector
  preds_df['predictions'] = preds.argmax(dim=-1)
  ## convert integer predictions to label values
  preds_df['predicted_label'] = preds_df['predictions'].map(lambda x: label_mapping[x])
  ## simple accuracy calc using pandas - TRUE/FALSE evaluates to 1/0 when using .mean()
  ## so taking average is a handy shortcut for calculating accuracy
  accuracy = (preds_df['predicted_label'] == preds_df['label']).mean()
  print(f"Accuracy: {accuracy}")

In [None]:
test_set_accuracy(test_df)

Accuracy: 0.605


When we test the baseline model on the unseen test data, we see our accuracy is a bit lower at 60.5%. Let's try to improve it by introducing some variance into our dataset.

## Data augmentation

Let's focus on data augmentation techniques first. Data augmentation artificially increases the size of the dataset by applying various transformations to the existing data. These techniques introduce diversity into the data, which is important when working with a smaller dataset.

In [None]:
from fastai.text.all import *

### Technique: duplicate the data

One easy method to increase the size of the dataset is to duplicate it. While this sounds simple enough, it can lead to overfitting. It might show better performance in the test and validation sets, but our final test invovles using the model on brand new unseen data. Simply feeding it more amounts of the same thing might not help much in the long run.

After playing around with different sizes of duplication, we found 5 to be the preferred multiplier, with 3 training epochs. Feel free to test out different combinations duplication sizes and training epochs to see if you can achieve a better result.

In [None]:
## first duplicate the dataset multiple times to allow for unaltered data post augmentation
num_copies = 5
dup_df = pd.concat([train_df.copy()]*num_copies, ignore_index=True)
dup_df.describe()

Unnamed: 0,label,text
count,3640,3640
unique,8,728
top,retail_hospitality,Assessment Specialist
freq,515,5


In [None]:
dls = TextDataLoaders.from_df(dup_df, label_col='label', text_col='text', seed=42)



In [None]:
## now we fine tune AWD_LSTM with our new approach
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(3)

epoch,train_loss,valid_loss,accuracy,time
0,1.576521,0.79703,0.817308,00:04


epoch,train_loss,valid_loss,accuracy,time
0,0.930728,0.416555,0.892857,00:05
1,0.741077,0.287379,0.916209,00:04
2,0.618313,0.250166,0.932692,00:04


Just by duplicating the data five times we have increased our training accuracy to ~93%. This improvement looks great, but remember, we have a final unseen set of test data. While the model performance jumped significantly on the test and validation sets, we most likely have overfitting occuring and will see that come to light when we run this model on our test set.

In [None]:
test_set_accuracy(test_df,learn)

Accuracy: 0.76


As you can see, our test accuracy improved to 76%, compared to our baseline of ~61%. This is still a huge improvement from the baseline. It's clear that some overfitting happening when considering our training-set accuracy of ~93% and our test set accuracy of 76%. Data augmentation is an effective way to correct this issue.

Disclaimer: Be careful with duplicating too many times. In other studies, I have found duplicating the dataset worsens performance. It works well in this scenario, but that is not always the case! Different datasets and different tasks behave differently. This study is an exploration of the different tools and techniques available when working with limited data quantity.

### Technique: data masking

In [None]:
from fastai.text.all import *

Data masking (with tokens) introduces noise and variability into the training data. It selectively obscures or modifies portions of the input text data with a special token [MASK] to prevent the model from memorizing specific patterns. It encourages the model to learn more robust and generalizable representations of language.

There are a few parameters to experiement with in this function. For conciseness, I will show the ones I found to be optimal. A mask probability of 25% on each token while executing on the dataset 4 times gives the best performance. With a small dataset, it would not make sense to only mask one time through, as it will alter our limited amount of text data. This way, we ensure a mix of unaltered data and masked data. Masking would work well on longer strings of text such as a sentence of 10+ words where the key words are then identified. On a word or short phrase, masking could hide the entire token. This is why we will keep one set of our data unchanged and concatenate the masked data, increasing our overall size.

In [None]:
## introduce data masking to increase size of dataset and force model to focus on important words and interpret others
import random

def mask_text(sentence, mask_token="[MASK]", mask_prob=0.25):
    words = sentence.split()
    masked_words = []
    for word in words:
        ## Apply masking with probability mask_prob
        if random.random() < mask_prob:
            masked_words.append(mask_token)
        else:
            masked_words.append(word)
    return " ".join(masked_words)

## set the dataset
original_data = train_df.copy()

## Function to apply text masking and create augmented rows
def augment_data_with_text_masking(data, num_augmented_rows=4, mask_token="[MASK]", mask_prob=0.25):
    augmented_data = []
    for index, row in data.iterrows():
        original_text = row['text']
        for _ in range(num_augmented_rows):
            masked_text = mask_text(original_text, mask_token=mask_token, mask_prob=mask_prob)
            augmented_data.append({'text': masked_text, 'label': row['label']})
    return pd.DataFrame(augmented_data)

## Apply text masking and create augmented dataset
augmented_data = augment_data_with_text_masking(original_data)

## Concatenate original dataset and augmented dataset
mask_dataset = pd.concat([original_data, augmented_data], ignore_index=True)

## Create masked dataset
mask_df = mask_dataset.copy()

In [None]:
## show some sample [MASK] tokens on the df
mask_df.tail()

Unnamed: 0,label,text
3635,finance,Financial Risk Consultant
3636,legal,Public Defender
3637,legal,Public Defender
3638,legal,Public Defender
3639,legal,Public [MASK]


In [None]:
#hide
mask_df.describe()

Unnamed: 0,label,text
count,3640,3640
unique,8,1434
top,retail_hospitality,[MASK] [MASK]
freq,515,106


In [None]:
dls = TextDataLoaders.from_df(mask_df, label_col='label', text_col='text', seed=42)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4)



epoch,train_loss,valid_loss,accuracy,time
0,1.682657,1.111586,0.65522,00:04


epoch,train_loss,valid_loss,accuracy,time
0,1.263887,0.820902,0.717033,00:05
1,1.0945,0.683007,0.778846,00:04
2,0.974719,0.625035,0.806319,00:04
3,0.890392,0.60584,0.817308,00:05


In [None]:
test_set_accuracy(test_df,learn)

Accuracy: 0.8


While our validation accuracy is only ~82%, the test accuracy dropoff is minimal at 80%. We have resolved the overfitting issue and improved test performance slightly.

### Technique: random insertion

In [None]:
from fastai.text.all import *

Let's try a few more data augmentation techniques. Next up, random insertion.

Random insertion introduces variability into the training data by randomly inserting additional words into sequences. This helps increase diversity and improve performance on unseen data. It helps expose the model to a wider range of linguistic structures and patterns, making it more adaptable.

Here we will apply an insertion probability of 25% with an insertion value of 1. After adjusting the insertion value, 1 was found to be the best which intuitively makes sense. Inserting more than 1 word at a time with short word or phrase in the data would be too drastic and the data would lose meaning. We will apply the augmentation function to the duplicated dataset so that we include the improved performance from that technique.

In [None]:
#hide
import random
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
## implement random insertion to add noise
## n = number of insertions
def random_insertion(sentence, n=1, seed=42):
    if seed is not None:
        random.seed(seed)

    words = word_tokenize(sentence)
    for _ in range(n):
        word_to_insert = random.choice(words)
        words.insert(random.randint(0, len(words)), word_to_insert)
    return ' '.join(words)

## Apply random insertion to a subset of values in the DataFrame
rand_ins_df = dup_df.copy()
apply_insertion_probability = 0.25  ## Adjust this probability as needed
for index, row in rand_ins_df.iterrows():
    if random.random() < apply_insertion_probability:
        row['text'] = random_insertion(row['text'])

rand_ins_df.reset_index(drop=True, inplace=True)
rand_ins_df.describe()

Unnamed: 0,label,text
count,3640,3640
unique,8,730
top,retail_hospitality,Manager Influencer Marketing Manager
freq,515,5


In [None]:
## sample of what random insertion looks like in the data
rand_ins_df.head()

Unnamed: 0,label,text
0,education,Assessment Assessment Specialist
1,drama_arts,Location Scout Manager
2,healthcare,Technician Health Information Technician
3,technology,Technical Technical Recruiter
4,finance,Financial Advisor Associate


In [None]:
dls = TextDataLoaders.from_df(rand_ins_df, label_col='label', text_col='text', seed=42)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4)



epoch,train_loss,valid_loss,accuracy,time
0,1.562065,0.773424,0.778846,00:05


epoch,train_loss,valid_loss,accuracy,time
0,0.906862,0.492547,0.848901,00:05
1,0.748815,0.324971,0.895604,00:07
2,0.597869,0.24611,0.910714,00:06
3,0.517832,0.252703,0.914835,00:06


In [None]:
test_set_accuracy(test_df,learn)

Accuracy: 0.785


Random insertion leads to a test accuracy between duplication and masking. Try applying different probabilities to the function and see how it changes performance. For example, we tested duplicating the dataset 8 times and applying a 25% insertion probability which gave us the highest validation accuracy, but a lower test accuracy. This goes to show that validation accuracy is not the most important metric to optimize.

### Technique: random deletion

In [None]:
from fastai.text.all import *

Random deletion does... you guessed it, the opposite of random insertion. On a dataset with short words or phrases, this could have an adverse affect by removing too much information. Let's check it out anyways:

In [None]:
## implement random deletion to focus on most important words
## p =  prob that a word is deleted
def random_deletion(sentence, p=0.25):
    words = word_tokenize(sentence)
    words = [word for word in words if random.uniform(0, 1) > p]
    return ' '.join(words)

In [None]:
rand_del_df = dup_df.copy()
rand_del_df['text'] = rand_del_df['text'].apply(random_deletion)

In [None]:
dls = TextDataLoaders.from_df(rand_del_df, label_col='label', text_col='text', seed=42)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4)



epoch,train_loss,valid_loss,accuracy,time
0,1.73814,1.16945,0.614011,00:05


epoch,train_loss,valid_loss,accuracy,time
0,1.278887,0.912356,0.690934,00:04
1,1.144541,0.721705,0.759615,00:05
2,1.021388,0.668705,0.788462,00:04
3,0.918061,0.674042,0.782967,00:07


We were correct, the accuracy drops off. Still, let's check the test accuracy as 78% is still pretty good and could be a sign of less overfitting.

In [None]:
test_set_accuracy(test_df,learn)

Accuracy: 0.78


Nearly no dropoff between the accuracy values. Similarly to masking, random deletion seems to eliminate overfitting with this dataset.

### Technique: random swapping

In [None]:
from fastai.text.all import *

Random swapping picks a pair of words and swaps them within a sentence or phrase. This creates variation in the order of the data while maintaining the overall context of the text. Different word arrangements could be important with our limited data.

In [None]:
## implement random swap to create variations in word order
## n = number of words to be swapped (pairs)
## ensure at least 3 words in the object before performing swapping
def random_swap(sentence, n=2, seed=42):
    random.seed(seed)
    words = word_tokenize(sentence)

    ## Check if there are at least 3 words
    if len(words) >= 3:
        for _ in range(n):
            idx1, idx2 = random.sample(range(len(words)), 2)
            words[idx1], words[idx2] = words[idx2], words[idx1]
        return ' '.join(words)
    else:
        ## Return the original sentence if there are less than 3 words
        return sentence

In [None]:
rand_swap_df = dup_df.copy()
rand_swap_df['text'] = rand_swap_df['text'].apply(random_swap)

In [None]:
dls = TextDataLoaders.from_df(rand_swap_df, label_col='label', text_col='text', seed=42)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(3)



epoch,train_loss,valid_loss,accuracy,time
0,1.654608,0.852613,0.769231,00:04


epoch,train_loss,valid_loss,accuracy,time
0,1.062303,0.501862,0.868132,00:05
1,0.852094,0.326176,0.916209,00:04
2,0.705763,0.314054,0.931319,00:05


This is pretty solid performance at ~93% accuracy.

In [None]:
test_set_accuracy(test_df,learn)

Accuracy: 0.725


But the test accuracy of ~73% is lower than what we hoping for. This is okay! Not all of these techniques will be optimal for every dataset. This is an example of why it is important to run multiple experiments--it's hard to predict what will work best for your application.

Besides data augmentation, there are many other tips and tricks to improve performance within the fastai library. Let's take our best performing model and try some of these.


## Technique: weight decay

In [None]:
from fastai.text.all import *


Up first we will add weight decay to the learner. Weight decay is a regularization technique used during training to prevent overfitting. It works by adding a penalty term to the loss function during training. Essentially, weight decay encourages the model to learn simpler patterns by penalizing large parameter values.

We found weight decay to be counterproductive on the base unaltered dataset, so let's try adding it to the top performing dataset thus far.

In [None]:
## Add weight decay to the optimizer
dls = TextDataLoaders.from_df(mask_df, label_col='label', text_col='text', seed=42)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.opt_func = partial(Adam, wd=1e-4)
learn.fine_tune(4)



epoch,train_loss,valid_loss,accuracy,time
0,1.655415,1.062761,0.657967,00:06


epoch,train_loss,valid_loss,accuracy,time
0,1.237568,0.821037,0.723901,00:04
1,1.092565,0.639611,0.802198,00:04
2,0.954315,0.577896,0.815934,00:05
3,0.868232,0.570183,0.818681,00:04


In [None]:
test_set_accuracy(test_df,learn)

Accuracy: 0.81


While adding weight decay doesn't drastically change things, it does give a slight boost to our performance, hitting an all time high of 81%.

## Technique: batch size

In [None]:
from fastai.text.all import *

A *batch* refers to a set of training examples that are processed together during one iteration of the training algorithm. Instead of updating the model's parameters after processing each individual example (which would be computationally inefficient), batches allow for more efficient processing by updating the parameters once per batch.

The batch size is the number of training examples processed in one iteration. For example, a batch size of 32 means that the model will process 32 training examples at a time before updating its parameters.

Larger batch sizes generally lead to faster training because they exploit more parallelism and utilize hardware more efficiently. However, larger batch sizes may require more memory, and they may not generalize as well as smaller batch sizes. Smaller batch sizes can lead to slower training but may generalize better and allow for more exploration of the parameter space.

Fastai uses an automated process to set the batch size when creating the dataloaders object (dls). This depends on the dataset and type of model being used. We can adjust the batch size manually to see how that affects performance.

After playing around with different batch sizes, we found the fastai default to provide the best performance. We will use our best performing dls from masking and include weight decay on the sample code, with a batch size of 12.

In [None]:
dls = TextDataLoaders.from_df(mask_df, label_col='label', text_col='text', seed=42, bs=12)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.opt_func = partial(Adam, wd=1e-4)
learn.fine_tune(4)



epoch,train_loss,valid_loss,accuracy,time
0,1.526844,1.033561,0.657967,00:07


epoch,train_loss,valid_loss,accuracy,time
0,1.197656,0.774163,0.741758,00:09
1,0.954373,0.618713,0.815934,00:08
2,0.73233,0.535005,0.826923,00:09
3,0.762653,0.51464,0.837912,00:09


In [None]:
test_set_accuracy(test_df,learn)

Accuracy: 0.805


## Technique: adjust the tokenizer

In [None]:
from fastai.text.all import *

Fastai uses a default tokenizer, which works fairly well. However, you can get changes in model performance by implementing different tokenizers. Other libraries like Hugging Face offer a variety of tokenizers to choose from. We want to use all of the resources at our disposal, leave no stone unturned! Since our data consists of short phrases or words, let's try a character based tokenizer (the fastai default is a subword tokenizer). This will break up each character into its own token, increasing the number of tokens per input.

In [None]:
! pip install -Uqq tokenizers
! pip install -Uqq transformers torch torchvision

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m779.1/779.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m85.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.2/176.2 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.1/168.1 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 2.7.14 requires torch<2.3,>=1.10, but you have torch 2.3.0 which is incompatible.
torchaudio 2.2.1+cu121 requires torch==2.2.1, but you have torch 2.3.0 which is incompatible.
torchtext 0.17.1 requires t

In [None]:
class CharacterTokenizer():

    def __call__(self, items):

        ## List where we temporarly store the tokens as they are being parsed.
        final_list = []

        ## We don't want to mess with the special fastai tokens
        special_chars = ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj']

        ## Break up string into words, if word in special_chars dont touch it. Otherwise break up each
        ## word into each character.
        for words in items:
            tmp = []
            for word in words.split():
                if word not in special_chars:
                    for char in word:
                        tmp.append(char)
                else:
                    tmp.append(word)
            ## tmp has each token
            ## We need to put the tmp list into another list to generate a generator below
            final_list.append(tmp)

        ## Returns a generator
        return (t for t in final_list)

In [None]:
## Create an instance of CharacterTokenizer
tokenizer = CharacterTokenizer()

## Tokenize the text in the DataFrame
tokenized_texts = tokenizer(mask_df['text'])

## Print the original and tokenized texts
# for original_text, tokenized_text in zip(df['text'], tokenized_texts):
#     print(f"Original Text: {original_text}")
#     print(f"Tokenized Text: {tokenized_text}")
#     print()

In [None]:
dls = TextDataLoaders.from_df(mask_df, label_col='label', text_col='text', text_func=CharacterTokenizer, seed=42)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.opt_func = partial(Adam, wd=1e-4)
learn.fine_tune(4)

  self.pid = os.fork()


epoch,train_loss,valid_loss,accuracy,time
0,1.692136,1.106152,0.662088,00:04


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


epoch,train_loss,valid_loss,accuracy,time
0,1.237917,0.84071,0.718407,00:05
1,1.081058,0.67246,0.796703,00:04
2,0.953455,0.597669,0.81456,00:04
3,0.881926,0.588762,0.807692,00:05


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


In [None]:
test_set_accuracy(test_df,learn)

  self.pid = os.fork()
  self.pid = os.fork()


Accuracy: 0.795


The character tokenizer performs slightly worse than fastai's default. Maybe going letter by letter was a bit too extreme.

## Technique: BERT tokenizer

Next we will try a BERT tokenizer. It is specifically designed to preprocess text data for the BERT model, but I found that it is an effective subword tokenizer for our dataset as well. It will break larger words into subwords and create multiple tokens from that larger word. Think about job titles in the medical field, like Speech Pathologist, or Radiologist. Maybe breaking these words into subwords like "Speech" "Path" "ologist" and "Radi" "ologist" will enable better pattern recognition.

I also listed the XLNet tokenizer for those that want another option.

In [None]:
from transformers import AutoTokenizer, XLNetTokenizer

def custom_tokenizer(df):
    ## Load a pretrained tokenizer (e.g., BERT tokenizer)
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    # tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased")

    ## Tokenize each text entry
    tokenized_texts = []
    for text in df['text']:
        encoding = tokenizer.tokenize(text)
        tokenized_texts.append(encoding)

    return tokenized_texts

In [None]:
tokenized_texts = custom_tokenizer(mask_df)

## Displaying the first batch of tokenized texts
batch_size = 8
for i in range(batch_size):
    print("Tokenized Text", i+1, ":", tokenized_texts[i])

Tokenized Text 1 : ['assessment', 'specialist']
Tokenized Text 2 : ['location', 'scout', 'manager']
Tokenized Text 3 : ['health', 'information', 'technician']
Tokenized Text 4 : ['technical', 'recruit', '##er']
Tokenized Text 5 : ['financial', 'advisor', 'associate']
Tokenized Text 6 : ['dev', '##ops', 'engineer']
Tokenized Text 7 : ['medical', 'equipment', 'technician']
Tokenized Text 8 : ['seo', 'specialist']


In [None]:
dls = TextDataLoaders.from_df(mask_df, label_col='label', text_col='text', text_func=custom_tokenizer, seed=42)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.opt_func = partial(Adam, wd=1e-4)
learn.fine_tune(4)

  self.pid = os.fork()


epoch,train_loss,valid_loss,accuracy,time
0,1.741384,1.08645,0.637363,00:05


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


epoch,train_loss,valid_loss,accuracy,time
0,1.272253,0.833466,0.721154,00:04
1,1.083904,0.691471,0.78022,00:06
2,0.968678,0.603323,0.809066,00:06
3,0.889113,0.59429,0.806319,00:07


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


In [None]:
test_set_accuracy(test_df,learn)

  self.pid = os.fork()
  self.pid = os.fork()


Accuracy: 0.81


The BERT subword tokenizer shows similar performance to the fastai default in our study, but it is worth trying on your data to see if it has any affect. In Part II we will be working with a BERT model for which this tokenizer was created for.

# Conclusion

In conclusion, we took our baseline model at 60.5% accuracy and improved it to 81% using various techniques. It is not impossible to build an effecive model with limited data! Instead of taking weeks to process and clean thousands of inputs, try these methods to add some variance and beef up the dataset with what you already have. Not every method will have the same affect on every dataset, so it is important to try a few and see how they work. In this study, we found duplication in combination with data masking to be the most effective augmentation techniques, but maybe a dataset with longer sentences like the Kaggle movie sentiment reviews will perform better with random deletion or swapping. It's worthwhile to explore these techniques before jumping to an expensive paid-for model like OpenAI's GPT series.

Remember, over time as the dataset grows, performance should follow. <a href="https://nbviewer.org/github/fastai/fastbook/blob/master/10_nlp.ipynb">In the attached</a> lesson notebook from the fastai course, they were able to achieve 95.1% accuracy from a baseline of 83% on a movie sentiment classification task with a training dataset of 50,000 items. Imagine what can be done when this dataset of ~1,000 items grows 10 times. This study was only meant to be an exploration of techniques and how important it is to be knowledgeable of leveraging them to improve performance for small datasets.

If you want to learn more or to discuss how to get more out of your data, please don't hesitate to reach out to our team at Stele & Rivers Group. We'd be happy to work with you.

Stay tuned for more studies like this one and if there is something you would like to chat about, feel free to reach out at:

shane@stelerivers.com


