# distilBERT: text classification for intent detection

This notebook is sourcing the data from the SNIPs dataset for intent classification. The aim is to fine-tune a pre-trained distilBERT model and classify the utterances with the following intents:
- `AddToPlaylist`
- `BookRestaurant`
- `GetWeather`
- `PlayMusic`
- `RateBook`
- `SearchCreativeWork`
- `SearchScreeningEvent`

## Data loading, transformation and exploration

In [1]:
import json

def load_intents(file_name, intent_name):
    with open(file_name, 'r') as file:
        raw_data = json.load(file)
    
    return [
        ''.join(
            [
                data.get('text', '') for data in intent.get('data', [])
            ]
        ).strip() for intent in raw_data.get(intent_name, [])
    ]

intents = [
    'AddToPlaylist',
    'BookRestaurant',
    'GetWeather',
    'PlayMusic',
    'RateBook',
    'SearchCreativeWork',
    'SearchScreeningEvent'
]

In [2]:
import pandas as pd
from random import shuffle

dataset = {}

for intent in intents:
    intents_list = list(set(
        load_intents(f'data/custom-intent-engines/{intent}/train_{intent}_full.json', intent)
    ).union(
        load_intents(f'data/custom-intent-engines/{intent}/train_{intent}.json', intent)
    ).union(
        load_intents(f'data/custom-intent-engines/{intent}/validate_{intent}.json', intent)
    ))
    # shuffle the data of the list
    shuffle(intents_list)
    # calculate 1/3 split
    split = len(intents_list) // 3
    # 1/3 of the data is used for validation
    validation = intents_list[:split]
    # 2/3 of the data is used for training
    training = intents_list[split:]
    
    dataset[intent] = {}
    dataset[intent]['training'] = training
    dataset[intent]['validation'] = validation

dataset = pd.DataFrame(dataset)
dataset

Unnamed: 0,AddToPlaylist,BookRestaurant,GetWeather,PlayMusic,RateBook,SearchCreativeWork,SearchScreeningEvent
training,[I want to add Aprite le finestre to my playli...,[book australian food in Armour for 7 pm for f...,"[How temperate will it be here this week?, Wil...","[Play a record from 2015, Play some good music...","[rate this book titled The Silver Chalice a 1,...","[Find Return to Krondor., find the album Just ...",[I want a list of films that are going to be s...
validation,[add the track by josh kear to myra's playlist...,"[Book a restaurant for one person at 7 AM, I n...","[Is it warm in Albania at noon, What is the fo...",[play a tune by Syreeta Wright from twenties f...,"[Rate this book a five, rate this book four of...","[find the TV show Tribute to the Troops, Can y...","[Show movies in the neighborhood, What time is..."


Extrapolate the the training rows containing the list of utterances per intent

In [17]:
training = dataset.loc[['training']].T.explode('training').reset_index().rename(columns={
    'index': 'intent',
    'training': 'utterance'
})
training

Unnamed: 0,intent,utterance
0,AddToPlaylist,I want to add Aprite le finestre to my playlis...
1,AddToPlaylist,put Before the Eulogy onto Acoustic Blues
2,AddToPlaylist,Add teriazume to the na√ß√£o reggae playlist
3,AddToPlaylist,add party with friends by Constructs of the St...
4,AddToPlaylist,add this track to my Comedy New Releases
...,...,...
10458,SearchScreeningEvent,What is the movie schedules for films in the n...
10459,SearchScreeningEvent,Find the movie schedule at IMAX Corporation.
10460,SearchScreeningEvent,find films at Alamo Drafthouse Cinema
10461,SearchScreeningEvent,What animated movies are playing AMC Theaters


Extrapolate the the validation rows containing the list of utterances per intent

In [18]:
validation = dataset.loc[['validation']].T.explode('validation').reset_index().rename(columns={
    'index': 'intent',
    'validation': 'utterance'
})
validation

Unnamed: 0,intent,utterance
0,AddToPlaylist,add the track by josh kear to myra's playlist ...
1,AddToPlaylist,"Add another tune to my songs for you, not your..."
2,AddToPlaylist,add chas chandler to my Aux Cord Privileges
3,AddToPlaylist,add armistead burwell smith iv to Blues Masters
4,AddToPlaylist,add Rak biszewilo to my playlist named Jazz
...,...,...
5222,SearchScreeningEvent,I'd like to watch The Slender Thread at 6 PM
5223,SearchScreeningEvent,Gimme movie times.
5224,SearchScreeningEvent,What is the movie schedule for movies in the n...
5225,SearchScreeningEvent,Find movie schedules for animated movies aroun...


Visualise that the number or rows used for training are $2/3$ and for validation are $1/3$ of the entire dataset.

In [5]:
len(training) * 100 / (len(training) + len(validation))

66.68578712555768

In [6]:
len(validation) * 100 / (len(training) + len(validation))

33.31421287444232

Visualise the even split of the rows per intent (stratification), on both the training and validation sets.

In [7]:
training.value_counts(subset=['intent'], normalize=True) * 100

intent              
GetWeather              14.823664
PlayMusic               14.775877
BookRestaurant          14.747204
RateBook                14.078180
SearchCreativeWork      14.039950
AddToPlaylist           13.896588
SearchScreeningEvent    13.638536
Name: proportion, dtype: float64

In [8]:
validation.value_counts(subset=['intent'], normalize=True) * 100

intent              
GetWeather              14.826861
PlayMusic               14.769466
BookRestaurant          14.750335
RateBook                14.080735
SearchCreativeWork      14.042472
AddToPlaylist           13.889420
SearchScreeningEvent    13.640712
Name: proportion, dtype: float64

## distilBERT modelling

Create the training and evaluation datasets for distilBERT

In [9]:
from datasets import Dataset

label2id = {v: k for k, v in enumerate(intents)}
id2label = {k: v for k, v in enumerate(intents)}

def encode(row):  
    return label2id[row['intent']]

train_dataset = Dataset.from_dict({
    'text': training.utterance.to_list(),
    'label': training.apply(encode, axis=1).to_list()
})

eval_dataset = Dataset.from_dict({
    'text': validation.utterance.to_list(),
    'label': validation.apply(encode, axis=1).to_list()
})

In [10]:
train_dataset[0]

{'text': 'I want to add Aprite le finestre to my playlist entitled This Is Earth, Wind & Fire',
 'label': 0}

In [11]:
eval_dataset[100]

{'text': 'I want 4813 added to my Rhythm and Blues playlist', 'label': 0}

Load a pretrained BERT tokenizer

In [12]:
from transformers import DistilBertTokenizerFast

BERT = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizerFast.from_pretrained(BERT)

Tokenize the text and truncate the token sequences to be no longer than the distilBERT‚Äôs maximum input length

In [13]:
def preprocess(examples):
    return tokenizer(examples['text'], truncation=True) # padding="max_length"

train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

Map:   0%|          | 0/10463 [00:00<?, ? examples/s]

Map:   0%|          | 0/5227 [00:00<?, ? examples/s]

Create a batch of examples using DataCollatorWithPadding. It‚Äôs more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [14]:
train_dataset[0]

{'text': 'I want to add Aprite le finestre to my playlist entitled This Is Earth, Wind & Fire',
 'label': 0,
 'input_ids': [101,
  1045,
  2215,
  2000,
  5587,
  19804,
  4221,
  3393,
  10418,
  2890,
  2000,
  2026,
  2377,
  9863,
  4709,
  2023,
  2003,
  3011,
  1010,
  3612,
  1004,
  2543,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

In [15]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Load an evaluation method with the ü§ó Evaluate library. Including a metric during training is often helpful for evaluating your model‚Äôs performance.

In [None]:
import evaluate

accuracy = evaluate.load('accuracy')

Then create a function that passes your predictions and labels to compute to calculate the accuracy

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
from transformers import DistilBertForSequenceClassification, TrainingArguments, Trainer

model = DistilBertForSequenceClassification.from_pretrained(
    BERT,
    num_labels=len(intents),
    id2label=id2label,
    label2id=label2id
)
model.config

## References
- [Create Dataset](https://huggingface.co/docs/datasets/v4.0.0/en/create_dataset)
- [NLP quickstart](https://huggingface.co/docs/datasets/en/quickstart#nlp)
- [Trainer API](https://huggingface.co/docs/transformers/en/main_classes/trainer)
- [Evaluate](https://huggingface.co/docs/evaluate/a_quick_tour)
- [Text Classification](https://huggingface.co/docs/transformers/tasks/sequence_classification)

```
import torch

dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset = dataset.with_format(type="torch")
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
```

Define the hyperparameters

In [None]:
from transformers import TrainingArguments, Trainer

epochs = 2

training_args = TrainingArguments(
    output_dir = './models/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    # deep learning parameters
    warmup_steps=len(training) // 5, # number of warmup steps for the learning rate scheduler
    weight_decay=0.01,
    learning_rate=1e-5,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [None]:
trainer.evaluate()

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

# Inference

In [None]:
from transformers import pipeline

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

In [None]:
classifier('Is it going to be sunny tomorrow?')

In [None]:
trainer.save_model()

In [None]:
classifier = pipeline('text-classification', model='./models/output', tokenizer=tokenizer)

In [None]:
classifier('Add Stitches by Shawn Mendes to my favourites')

In [None]:
classifier('Harry Potter was 10 out of 10')

In [None]:
classifier('Put aside a crib for 2 at 20:30 at the Sartori inn')

# Fine tuning by freezing BERT parameters

In [None]:
frozen_model = DistilBertForSequenceClassification.from_pretrained(
    BERT,
    num_labels=len(intents),
    id2label=id2label,
    label2id=label2id
)

In [None]:
for param in frozen_model.distilbert.parameters():
    param.requires_grad = False # freezing the parameters of the pre-trained model

We freeze the parameters of the pre-trained model, and then we run the fine-tuning training as before.
The training phase will take less time to run, but the accuracy will be lower than the other fine-tuned model.

In [None]:
from transformers import TrainingArguments, Trainer

epochs = 2

training_args = TrainingArguments(
    output_dir = './models/frozen/output',
    num_train_epochs=epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    # deep learning parameters
    warmup_steps=len(training) // 5, # number of warmup steps for the learning rate scheduler
    weight_decay=0.01,
    learning_rate=1e-5,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=frozen_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

In [None]:
trainer.evaluate()

In [None]:
trainer.train()

In [None]:
classifier = pipeline('text-classification', model=frozen_model, tokenizer=tokenizer)

In [None]:
classifier('Add Stitches by Shawn Mendes to my favourites')

In [None]:
classifier('Put aside a crib for 2 at 20:30 at the Sartori inn')