## Iterate like a grandmaster ... blurrified!

Jeremy Howard recently released a wonderful [notebook](https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster/) on how to tackle the [U.S. Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/) competition effectively with a focus on two key topics:

1. Creating an effective validation set
1. Iterating rapidly to find changes which improve results on the validation set.

As the author of the [blurr](https://ohmeow.github.io/blurr/) library, a fastai first framework for training huggingface transformers, I thought it might be fun to ***blurrify*** his notebook and demonstrate how one might work through this competition using it rather the Hugging Face Trainer API. Along the way I'll point out some **TIP**s for those looking to use blurr as well.

So without further ado, lets go!!!

### Step 1: Install and imports

In [None]:
! pip install -q ohmeow-blurr==1.0.4

This will get you blurr, fastai, transformers, etc....  To keep things simple, I'm going to import all the data and modeling bits from blurr.

In [None]:
import gc

from fastai.callback.all import *
from fastai.data.block import CategoryBlock, ColReader, DataBlock, IndexSplitter, RegressionBlock
from fastai.imports import *
from fastai.learner import *
from fastai.optimizer import Adam
from fastai.metrics import *
from fastai.torch_core import *
from fastai.torch_imports import *
from transformers import AutoModelForSequenceClassification, logging

from blurr.text.data.core import TextBlock
from blurr.text.modeling.core import BaseModelWrapper, BaseModelCallback, blurr_splitter
from blurr.text.utils import get_hf_objects
from blurr.utils import PreCalculatedCrossEntropyLoss, print_versions, set_seed

os.environ["TOKENIZERS_PARALLELISM"] = "false"


Before going through Jeremy's notebook, I didn't know I could even do the below and determine whether I'm running locally or on kaggle ... so I'm keeping this in here as well :)

**TIP**: Define a variable that indicates whether you are running locally or on kagle.

In [None]:
is_kaggle = os.environ.get("KAGGLE_KERNEL_RUN_TYPE", "")

### Step 2: Data and EDA

I have things setup a bit different locally, path wise, but other than that (and some formatting/naming differences) what follows here is from Jeremy's notebook. For all the EDA and resulting observations, see his notebook [here](https://www.kaggle.com/code/jhoward?scriptVersionId=92968513&cellId=7).

**TIP**: Set your paths up to work when running locally or on kaggle.

In [None]:
if is_kaggle:
    path = Path("../input/us-patent-phrase-to-phrase-matching")
else:
    path = Path("./data")

path.ls()

Let's look at the training set:

In [None]:
train_df = pd.read_csv(path/"train.csv")

print(len(train_df))
train_df.head()

...and the test set:

In [None]:
test_df = pd.read_csv(path/"test.csv")

print(len(test_df))
test_df.head()

In [None]:
train_df['section'] = train_df.context.str[0]
train_df.section.value_counts()

**TIP**: As you explore the counts and distributions, note your observations!

### Step 3: Get your Hugging Face objects

Time to blurrify, and we'll begin to do so by using blurr's `get_hf_objects` method to get all the Hugging Face objects we need. We'll define our pretrained model checkpoint and some hyperparameters here as well.  Since we're structuring this as a regression problem, we need to tell the Hugging Face configuration object that we have only *one* label via the `config_kwargs` parameter.

In [None]:
model_checkpoint = "microsoft/deberta-v3-small" #"distilroberta-base"
batch_size = 128
weight_decay = 0.01

**TIP**: For classification tasks, my go-to baseline architectures are roberta, deberta, and bart.

**TIP**: To iterate quickly, choose a small or distilled model.

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(model_checkpoint, model_cls=AutoModelForSequenceClassification, config_kwargs={"num_labels": 1})

### Step 4: Build your DataLoaders

#### 4a. Define a **good** validation set

We'll start by following the same strategy Jeremy went with in [defining our validation set](https://www.kaggle.com/code/jhoward?scriptVersionId=92968513&cellId=54). Once we got our training/validation indicies, we'll use the `IndexSplitter` splitter function to ensure our training and validation datasets get separated properly.

From Jeremy's notebook: "BTW, a lot of people do more complex stuff for creating their validation set, but with a dataset this large there's not much point. As you can see, the mean scores in the two groups are very similar despite just doing a random shuffle."

**TIP**: A good validation set is critical to every ML solution. For more info, see [How to create good validation and test sets](https://ohmeow.com/posts/2020/11/06/ajtfb-chapter-1.html#How-to-create-good-validation-and-test-sets) in my blog post on lessons learned from chapter 1 of the fastbook, ["Deep Learning for Coders with fastai & PyTorch"](https://github.com/fastai/fastbook).

In [None]:
anchors = train_df.anchor.unique()
np.random.seed(42)
np.random.shuffle(anchors)

In [None]:
val_prop = 0.25
val_sz = int(len(anchors) * val_prop)
val_anchors = anchors[:val_sz]

In [None]:
is_val = np.isin(train_df.anchor, val_anchors)

idxs = np.arange(len(train_df))
val_idxs = idxs[ is_val]
trn_idxs = idxs[~is_val]

len(val_idxs),len(trn_idxs)

In [None]:
train_df.iloc[trn_idxs].score.mean(), train_df.iloc[val_idxs].score.mean()

#### 4b. Define your `DataBlock`

From Jeremy's notebook ... "We'll need to combine the context, anchor, and target together somehow. There's not much research as to the best way to do this, so we may need to iterate a bit. To start with, we'll just combine them all into a single string. The model will need to know where each section starts, so we can use the special separator token to tell it."

... and that is what I've done via the `build_inputs()` method.

In [None]:
set_seed(42)
sep = f' {hf_tokenizer.sep_token} '

def build_inputs(example):
    return f'{example["context"]}{sep}{example["anchor"]}{sep}{example["target"]}'

blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), RegressionBlock)
dblock = DataBlock(blocks=blocks, get_x=build_inputs, get_y=ColReader("score"), splitter=IndexSplitter(val_idxs))

#### 4c. Build your `DataLoaders`

In [None]:
dls = dblock.dataloaders(train_df, bs=batch_size)

**TIP**: Transformer models are very sensitive to how to prepare your inputs, and exploring different approaches are usually worthwhile.

**TIP**: We'll use blurr's `set_seed()` method to ensure reproducibility (which is important as you iterate over different hyperparameters, explore different augmentation strategies, etc...). For a great discussion on how to do this in fastai, see the ["[Solved] Reproducibility: Where is the randomness coming in?"](https://forums.fast.ai/t/solved-reproducibility-where-is-the-randomness-coming-in/31628/25) post on the forums.

In [None]:
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

**Tip**: Always take a look at your batches to make sure they look right and that you understand what they represent

In [None]:
b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])

In [None]:
b[0]

### Step 5: Train

Sticking with blurr's mid-level API, let's define our `Learner`.  You can learn all about the blurr specific bits [here](https://ohmeow.github.io/blurr/text-modeling-core.html).

From Jeremy's notebook: "Let's now train our model! We'll need to specify a metric, which is the correlation coefficient."

**TIP**: Your metric should reflect your objective, and in the case of kaggle comps they tell you exactly what that is.

In [None]:
set_seed(42)

model = BaseModelWrapper(hf_model)

learn = Learner(
    dls,
    model,
    opt_func=partial(Adam, wd=weight_decay),
    loss_func=PreCalculatedCrossEntropyLoss(),
    metrics=[PearsonCorrCoef()],
    cbs=[BaseModelCallback],
    splitter=blurr_splitter
)

learn = learn.to_fp16()

In [None]:
learn.lr_find(start_lr=1e-9, suggest_funcs=[minimum, steep, valley, slide])

Using **1cycle** learning allows us to be a bit more aggressive with our learning rates.  Want to learn more? Check out the ["A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay" paper](https://arxiv.org/abs/1803.09820).

**TIP**: Try different schedulers and see what works and what doesn't.

In [None]:
set_seed(42)
learn.fit_one_cycle(4, lr_max=slice(1e-5, 1e-2))

In [None]:
try:
    del learn
    gc.collect()
    torch.cuda.empty_cache()
except:
    pass


## Improving the model

As Jeremy says in his notebook, "Iteration speed is critical, so we need to quickly be able to try different data processing and trainer parameters."

... so lets create some helper methods to help us iterate more quickly. First, our `DataLoaders`:

In [None]:
def get_dls(df, get_x_func, batch_size, seed=42):
    set_seed(seed)

    blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), RegressionBlock)
    dblock = DataBlock(blocks=blocks, get_x=get_x_func, get_y=ColReader("score"), splitter=IndexSplitter(val_idxs))
    
    return dblock.dataloaders(df, bs=batch_size)

...and also a function to create a `Learner`:

In [None]:
def get_learner(dls, hf_model, weight_decay=0.01, use_fp16=True, seed=42):
    set_seed(42)

    model = BaseModelWrapper(hf_model)

    learn = Learner(
        dls,
        model,
        opt_func=partial(Adam, wd=weight_decay),
        loss_func=PreCalculatedCrossEntropyLoss(),
        metrics=[PearsonCorrCoef()],
        cbs=[BaseModelCallback],
        splitter=blurr_splitter
    )

    if use_fp16:
        learn = learn.to_fp16()
        
    return learn

In [None]:
def build_inputs_with_sep(example, sep=" - ", lower_case=False):
    return f'{example["context"]}{sep}{example["anchor"]}{sep}{example["target"]}'

Using our utility functions above, we can re-write your exploratory data prep/training loop as such:

In [None]:
model_checkpoint = "microsoft/deberta-v3-small"
batch_size = 128
weight_decay = 0.01
use_fp16 = True
seed = 42

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(model_checkpoint, model_cls=AutoModelForSequenceClassification, config_kwargs={"num_labels": 1})

dls = get_dls(train_df, get_x_func=partial(build_inputs_with_sep, sep=f' {hf_tokenizer.sep_token} '), batch_size=batch_size, seed=seed)
learn = get_learner(dls, hf_model, weight_decay=weight_decay, use_fp16=use_fp16, seed=seed)

learn.fit_one_cycle(4, lr_max=slice(1e-5, 1e-2))

**TIP**: Make sure you instantiate fresh version of your Hugging Face objects each time.

In [None]:
try:
    del learn
    gc.collect()
    torch.cuda.empty_cache()
except:
    pass


### Let's now try out some ideas...

From Jeremy's notebook: "**Perhaps using the special separator character isn't a good idea, and we should use something we create instead**. Let's see if that makes things better."

In [None]:
model_checkpoint = "microsoft/deberta-v3-small"
batch_size = 128
weight_decay = 0.01
use_fp16 = True
seed = 42

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(model_checkpoint, model_cls=AutoModelForSequenceClassification, config_kwargs={"num_labels": 1})

dls = get_dls(train_df, get_x_func=partial(build_inputs_with_sep, sep=" [s] "), batch_size=batch_size, seed=seed)
learn = get_learner(dls, hf_model, weight_decay=weight_decay, use_fp16=use_fp16, seed=seed)

learn.fit_one_cycle(4, lr_max=slice(1e-5, 1e-2))

In [None]:
try:
    del learn
    gc.collect()
    torch.cuda.empty_cache()
except:
    pass


From Jeremy's notebook "That's looking quite a bit better, so we'll keep that change. **Often changing to lowercase is helpful**. Let's see if that helps too."

In [None]:
model_checkpoint = "microsoft/deberta-v3-small"
batch_size = 128
weight_decay = 0.01
use_fp16 = True
seed = 42

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(model_checkpoint, model_cls=AutoModelForSequenceClassification, config_kwargs={"num_labels": 1})

dls = get_dls(train_df, get_x_func=partial(build_inputs_with_sep, sep=" [s] ", lower_case=True), batch_size=batch_size, seed=seed)
learn = get_learner(dls, hf_model, weight_decay=weight_decay, use_fp16=use_fp16, seed=seed)

learn.fit_one_cycle(4, lr_max=slice(1e-5, 1e-2))

In [None]:
try:
    del learn
    gc.collect()
    torch.cuda.empty_cache()
except:
    pass


From Jeremy's notebook: "**What if we made the patent section a special token?** Then potentially the model might learn to recognize that different sections need to be handled in different ways. To do that, we'll use, e.g. `[A]` for section A. We'll then add those as special tokens."

In [None]:
train_df['sectok'] = f'[{train_df.section}]'
sectoks = list(train_df.sectok.unique())

In [None]:
def build_inputs_with_sep_and_sectoks(example, sep=" - ", lower_case=False):
    return f'{example["sectok"]}{sep}{example["context"]}{sep}{example["anchor"]}{sep}{example["target"]}'

In [None]:
model_checkpoint = "microsoft/deberta-v3-small"
batch_size = 128
weight_decay = 0.01
use_fp16 = True
seed = 42

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(model_checkpoint, model_cls=AutoModelForSequenceClassification, config_kwargs={"num_labels": 1})

# After adding the new tokens, we need to resize the embedding matrix in the model and initialize the weights
hf_tokenizer.add_special_tokens({'additional_special_tokens': sectoks})
hf_model.resize_token_embeddings(len(hf_tokenizer))

with torch.no_grad():
    hf_model.get_input_embeddings().weight[-len(hf_tokenizer), :] = torch.zeros([hf_model.config.hidden_size])

dls = get_dls(train_df, get_x_func=partial(build_inputs_with_sep_and_sectoks, sep=" [s] ", lower_case=True), batch_size=batch_size, seed=seed)
learn = get_learner(dls, hf_model, weight_decay=weight_decay, use_fp16=use_fp16, seed=seed)

learn.fit_one_cycle(4, lr_max=slice(1e-5, 1e-2))

Jeremy's list of things to try:

- Try a model pretrained on legal vocabulary. E.g. how about [BERT for patents](https://huggingface.co/anferico/bert-for-patents)?
- You'd likely get better results by using a sentence similarity model. Did you know that there's a [patent similarity model](https://huggingface.co/AI-Growth-Lab/PatentSBERTa) you could try?
- You could also fine-tune any HuggingFace model using the full patent database (which is provided in BigQuery), before applying it to this dataset
- Replace the patent context field with the description of that context provided by the patent office
- ...and try out your own ideas too!

Wayde's list of things to try:

- Experiment with other smaller models that typically work well for clssification tasks (e.g., roberta, deberta, and bart are my go tos)
- Experiment with some bigger versions of those smaller models that worked well for you.
- Be creative with your inputs; you can improve your results by adding special or regular tokens and/or structuring your inputs differently
- Try using K-Fold or Stratified K-Fold cross validation and ensemble your results (see Jeremy's notebook for more info)
- Once you have a decent set of hyperparameters working for you, you can use an optimization framework like Optuna and/or Weights & Biases to fine-tune your choices.
- Read the papers related to the architectures you are using. Often you'll find recommended hyperparameter values and other important recommendations to training them well.

And as Jeremy said, "Before submitting a model, retrain it on the full dataset, rather than just the 75% training subset we've used here. Create a function like the ones above to make that easy for you!"

## In conclusion

I hope you've learned a little bit about training transformers with blurr, and may even be encouraged to give it a go on Kaggle or at work. If you enjoyed this notebook, **I would greatly appreciate an upvote**. 

Please use the comments section below to ask any questions or share insights you may have on using blurr, fastai, and the transformers library to effectively train transformer models. For folks new to working with the Hugging Face transformers library with a particular interest in using fastai and blurr, I heartily recommend the study group hosted by Weights&Biases that I've been leading for the past few months.  You can watch the entire playlist [here](https://www.youtube.com/playlist?list=PLD80i8An1OEF8UOb9N9uSoidOGIMKW96t).


And if you made it this far, thanks for reading all the way to the end :)

You can find me on tweeting at [@waydegilliam](https://twitter.com/waydegilliam) and blogging ML/Software development at [ohmeow.com](https://ohmeow.com/)