# Intent Detection
> **Text Classification Problem** to `identify the Glorious Purpose` of `user's utterance.` It is very crucial for **Text-Based** applications like Chatbots where both the Inputs and Outputs are in the form of texts.

## Key Points

* **Knowledge distillation** is a technique that **transfers knowledge** from `Larger Complex Model to Smaller Simple Model`.

* **Quantization** is a technique that **reduces precision** of `model weights and activation functions` to **speed up inference** and **save memory**.

* **Pruning** is a technique that **removes redundant model parameters** to `reduce model size and model complexity`.

## Scenario

Let's say I am making **Jarvis** to help me with Stock Market. So when I ask Jarvis "`Hey, I have done some savings, I want to invest some of it in the Stock Market for stable monthly income of 6%...`", then Jarvis should be able to auto categorize it as "`Investing`" intent. And when I ask Jarvis something out of the scope like "`Find me a date...`", then Jarvis should say "`👎🏻 You are not eligible. Focus on your Finance instead, that's more future-proof...`".

## Data
> **CLINC150** A dataset for **intent detection** with 150 intents in 10 domains and out-of-scope examples.

**Purpose**: To test intent detection models on realistic and challenging queries that may not fit any intent.

**Size**: The dataset has **23,700** examples in total, with **22,500 in-scope examples** and **1,200 out-of-scope examples**.

* The **in-scope examples** are divided into 100 train, 20 validation, and 30 test sets examples per intent class.
    
* The **out-of-scope examples** are divided into 100 train, 100 validation, and 1,000 test examples.

**Variants**: The dataset has four variants:

* **data_full** variant is the main version of the dataset, with 100 train examples per in-scope class.

* **data_small** variant has 50 train examples per in-scope class.

* **data_imbalanced** variant has either 25, 50, 75, or 100 train examples per in-scope class, randomly assigned.

* **data_oos_plus** variant is the same as data_full, except it has 250 out-of-scope train examples.

**Benchmarks**: The dataset has been used to benchmark various intent classification models, such as **BERT, RoBERTa**, and **DSSCC**. The models are evaluated on their `accuracy` and `F1-score` on the in-scope and out-of-scope test sets.

* The **current state-of-the-art model** is `RoBERTa-Large + ICDA`, which achieves **98.9% accuracy** and **98.8% F1-score** on the in-scope test set, and **95.7% accuracy** and **95.6% F1-score** on the out-of-scope test set.

**Source**: A paper by Larson et al [“An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction”](https://archive.ics.uci.edu/dataset/570/clinc150).

## Get Started

In [None]:
from transformers import pipeline

In [None]:
model_checkpoint="transformersbook/bert-base-uncased-finetuned-clinc"
pipe=pipeline(
    task="text-classification",
    model=model_checkpoint
)

In [None]:
query = """Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in
Paris and I need a 15 passenger van"""
pipe(query)

[{'label': 'car_rental', 'score': 0.5490031242370605}]

In [None]:
query = """Hey, I have done some savings, I want to invest some of it in
the Stock Market for stable monthly income of 6%..."""
pipe(query)

[{'label': 'direct_deposit', 'score': 0.1069217249751091}]

In [None]:
query = """Find me a date..."""
pipe(query)

[{'label': 'date', 'score': 0.5273397564888}]

## Creating a Performance Benchmark

### Model performance

How well does our model perform on a well-crafted test set that reflects production data? This is especially important when the cost of making errors is large (and best mitigated with a human in the loop), or when we need to run inference on millions of examples and small improvements to the model metrics can translate into large gains in aggregate.

### Latency

How fast can our model deliver predictions? We usually care about latency in real-time environments that deal with a lot of traffic, like how Stack Overflow needed a classifier to quickly detect unwelcome comments on the website.

### Memory

How can we deploy billion-parameter models like GPT-2 or T5 that require giga-bytes of disk storage and RAM? Memory plays an especially important role in mobile or edge devices, where a model has to generate predictions without access to a powerful cloud server.

The text explains the importance of evaluating the model performance on a test set that is similar to the real-world data that the model will encounter in production. The text also mentions some scenarios where the model performance can have a significant impact on the outcome, such as when the model errors are costly or when the model is applied to a large scale of data. The text implies that the model performance should be measured by appropriate metrics that reflect the desired objectives and trade-offs.

* **Model performance**: How well the model can predict the correct labels for the input data.

* **Test set**: A subset of data that is used to evaluate the model performance after training. The test set should reflect the production data as much as possible.

* **Cost of errors**: The potential negative consequences of the model making wrong predictions, such as losing customers, money, or reputation. Errors can be reduced by involving a human expert to check or correct the model outputs.

* **Scale of inference**: The amount of data that the model needs to process and make predictions for. A large scale of inference can amplify the benefits or drawbacks of the model performance.

## Performance Benchmark
>  Use the `test set` to benchmark models.

In [None]:
class PerformanceBenchmark:
    def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
        self.pipeline=pipeline
        self.dataset=dataset
        self.optim_type=optim_type # keep track of different optimization techniques

    def computeAccuracy(self):
        pass

    def computeSize(self):
        pass

    def timePipeline(self):
        pass

    def runBenchmark(self):
        metrics:dict[str,set|dict] = dict() # collect all the metrics in dictionary, with keys given by optim_type
        metrics[self.optim_type]=self.computeSize()
        metrics[self.optim_type].update(self.timePipeline())
        metrics[self.optim_type].update(self.computeAccuracy())
        return metrics

In [None]:
! pip install datasets -Uqq

In [None]:
from datasets import load_dataset

data=load_dataset(
    path="clinc_oos",
    name="plus", # refers to the subset containing out-of-scope training examples
    trust_remote_code=True
)

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'intent'],
        num_rows: 15250
    })
    validation: Dataset({
        features: ['text', 'intent'],
        num_rows: 3100
    })
    test: Dataset({
        features: ['text', 'intent'],
        num_rows: 5500
    })
})

In [None]:
data["test"].features

{'text': Value(dtype='string', id=None),
 'intent': ClassLabel(names=['restaurant_reviews', 'nutrition_info', 'account_blocked', 'oil_change_how', 'time', 'weather', 'redeem_rewards', 'interest_rate', 'gas_type', 'accept_reservations', 'smart_home', 'user_name', 'report_lost_card', 'repeat', 'whisper_mode', 'what_are_your_hobbies', 'order', 'jump_start', 'schedule_meeting', 'meeting_schedule', 'freeze_account', 'what_song', 'meaning_of_life', 'restaurant_reservation', 'traffic', 'make_call', 'text', 'bill_balance', 'improve_credit_score', 'change_language', 'no', 'measurement_conversion', 'timer', 'flip_coin', 'do_you_have_pets', 'balance', 'tell_joke', 'last_maintenance', 'exchange_rate', 'uber', 'car_rental', 'credit_limit', 'oos', 'shopping_list', 'expiration_date', 'routing', 'meal_suggestion', 'tire_change', 'todo_list', 'card_declined', 'rewards_balance', 'change_accent', 'vaccines', 'reminder_update', 'food_last', 'change_ai_name', 'bill_due', 'who_do_you_work_for', 'share_locat

In [None]:
data["test"].to_pandas()

Unnamed: 0,text,intent
0,how would you say fly in italian,61
1,what's the spanish word for pasta,61
2,how would they say butter in zambia,61
3,how do you say fast in spanish,61
4,what's the word for trees in norway,61
...,...,...
5495,find my wallet,42
5496,can you give me the gps location of harvey,42
5497,where's my buddy steve right this second,42
5498,locate jenny at her present position,42


In [None]:
data["test"].features["intent"].int2str(values=[61,42])

['translate', 'oos']

In [None]:
data["test"].features["intent"].int2str

<bound method ClassLabel.int2str of ClassLabel(names=['restaurant_reviews', 'nutrition_info', 'account_blocked', 'oil_change_how', 'time', 'weather', 'redeem_rewards', 'interest_rate', 'gas_type', 'accept_reservations', 'smart_home', 'user_name', 'report_lost_card', 'repeat', 'whisper_mode', 'what_are_your_hobbies', 'order', 'jump_start', 'schedule_meeting', 'meeting_schedule', 'freeze_account', 'what_song', 'meaning_of_life', 'restaurant_reservation', 'traffic', 'make_call', 'text', 'bill_balance', 'improve_credit_score', 'change_language', 'no', 'measurement_conversion', 'timer', 'flip_coin', 'do_you_have_pets', 'balance', 'tell_joke', 'last_maintenance', 'exchange_rate', 'uber', 'car_rental', 'credit_limit', 'oos', 'shopping_list', 'expiration_date', 'routing', 'meal_suggestion', 'tire_change', 'todo_list', 'card_declined', 'rewards_balance', 'change_accent', 'vaccines', 'reminder_update', 'food_last', 'change_ai_name', 'bill_due', 'who_do_you_work_for', 'share_location', 'internati

## computeAccuracy
> Accuracy is **effective** when the target feature's classes are `balanced`.

In [None]:
! pip install evaluate -Uqq

In [None]:
from evaluate import load

accuracy_score=load(path="accuracy")

In [None]:
# expects y_true(references), y_pred(predictions)
accuracy_score.compute(references=[5,3,7], predictions=[5,3,6])

{'accuracy': 0.6666666666666666}

In [None]:
sample=data["test"][0]["text"]
prediction=pipe(sample)
prediction_label=prediction[0]["label"]
prediction_id=data["test"].features["intent"].str2int(prediction_label)

print(sample)
print(prediction)
print(prediction_label,"--->", prediction_id)

how would you say fly in italian
[{'label': 'translate', 'score': 0.5855257511138916}]
translate ---> 61


In [None]:
for example in data["test"]:
    print(example)
    break

{'text': 'how would you say fly in italian', 'intent': 61}


In [None]:
def computeAccuracy(self):
    """Override the PerformanceBenchmark.computeAccuracy() method"""
    prediction_ids:[int]=[]
    labels:[str]=[]
    intents=self.dataset["test"].features["intent"]

    for example in self.dataset["test"]:
        pred_label=self.pipeline(example["text"])[0]["label"]
        pred_id=intents.str2int(pred_label)
        prediction_ids.append(pred_id)

        label=example["intent"]
        labels.append(label)
    accuracy=accuracy_score.compute(references=labels, predictions=prediction_ids)
    print(f"Accuracy on test set ---> {accuracy['accuracy']:.4f}")
    return accuracy

# Override the PerformanceBenchmark.computeAccuracy() method
PerformanceBenchmark.computeAccuracy=computeAccuracy

In [None]:
list(pipe.model.state_dict().items())[42]

('bert.encoder.layer.2.attention.self.value.bias',
 tensor([-2.7834e-02,  4.9434e-02,  8.3551e-02,  4.1092e-02,  6.0157e-01,
          1.1774e-01, -5.2112e-02, -6.5143e-02, -2.9358e-02, -4.2250e-02,
          7.9177e-02,  8.0409e-02,  2.9921e-03,  1.7816e-01, -5.0480e-02,
         -1.5634e-01, -2.1707e-02,  1.4381e-02,  2.5132e-02, -2.4110e-02,
         -1.9183e-01, -7.8657e-02,  5.0709e-02,  3.3632e-02, -3.1946e-02,
          1.1616e-01,  9.2720e-02, -1.1787e-01,  2.3233e-01, -1.2678e-02,
         -1.3138e-01, -4.0024e-02,  7.4823e-02, -5.4148e-02, -1.5184e-01,
         -7.4407e-02,  1.1559e-01,  8.2729e-02, -1.3787e-01,  8.3528e-02,
          1.2154e-01,  1.6880e-02, -5.6629e-02, -3.9295e-02,  5.3725e-02,
          6.8602e-02, -1.1294e-01,  4.4001e-02, -2.5884e-01,  1.6767e-01,
          1.8316e-01,  5.6272e-02, -3.6874e-02, -2.7938e-02, -9.3204e-02,
         -7.5239e-03,  4.1141e-02, -1.1542e-02, -9.9749e-02, -3.0910e-02,
          4.1398e-02, -4.4389e-02, -2.6279e-02,  7.2100e-02, 

## computeSize

In [None]:
import torch
from pathlib import Path

torch.save(obj=pipe.model.state_dict(), f="model.pt")

In [None]:
Path("/content/model.pt").stat()

os.stat_result(st_mode=33188, st_ino=4980924, st_dev=55, st_nlink=1, st_uid=0, st_gid=0, st_size=438459014, st_atime=1703875130, st_mtime=1703875132, st_ctime=1703875132)

In [None]:
Path("/content/model.pt").stat().st_size

438459014

In [None]:
def computeSize(self):
    """Override the PerformanceBenchmark.compute_size() method"""
    state_dict=self.pipeline.model.state_dict()
    tmp_path=Path("model.pt")
    torch.save(obj=state_dict, f=tmp_path)
    # Calculate size in megabytes(MB)
    size_mb=Path(tmp_path).stat().st_size / (1024*1024)
    # Delete temporary file
    tmp_path.unlink()
    print(f"Model Size ---> {size_mb:.2f} mb")
    return {"size_mb":size_mb}
PerformanceBenchmark.computeSize=computeSize

A simple way to measure the execution time of a code snippet is to use the `perf_counter()` function from Python's `time` module. This function has a **better time resolution** than the `time.time()` function and is well suited for getting precise results.

## timePipeline
> Compute the average Latency from feeding input to getting output.

In [None]:
from time import perf_counter

for _ in range(3):
    start_time=perf_counter()
    _=pipe(data["test"][0]["text"])
    latency=perf_counter() - start_time
    print(f"Latency: {1000*latency:.3f} ms")

Latency: 100.242 ms
Latency: 102.908 ms
Latency: 94.523 ms


In [None]:
import numpy as np

def timePipeline(self, query="What is the pin number for my account?"):
    """Override the PerformanceBenchmark.time_pipeline() method"""
    latencies:[float]=[]
    # Warmup the CPU before performing actual timed run
    for _ in range(10):
        _=self.pipeline(query)
    # Timed Run
    for _ in range(100):
        start_time=perf_counter()
        _=self.pipeline(query)
        latency=perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    avg_time_in_ms=1000 * np.mean(latencies)
    std_time_in_ms=1000 * np.std(latencies)
    print(f"Latency: {avg_time_in_ms:.2f} ms +\- {std_time_in_ms:.2f} ms")
    return {
        "avg_time_in_ms":avg_time_in_ms,
        "std_time_in_ms":std_time_in_ms
    }

PerformanceBenchmark.timePipeline=timePipeline

In general, the `latency` will depend on the query length, and a good practice is to **benchmark your models with queries that they're likely to encounter in production environments**.

## Performance Benchmarking
> For the **baseline model**, we just need to pass the pipeline and the dataset we wish to perform the benchmark on.

In [None]:
performance_benchmark=PerformanceBenchmark(pipeline=pipe, dataset=data)
performance_metrics=performance_benchmark.runBenchmark()

Model Size ---> 418.15 mb
Latency: 86.24 ms +\- 10.03 ms
Accuracy on test set ---> 0.8673


# Knowledge Distillation
> Making Models Smaller via Knowledge Distillation

* **Knowledge distillation**: A smaller model (student) to mimic the behavior of a slower, larger, but better-performing model (teacher).

* **Soft probabilities**: The teacher's confidence scores for each possible output, which give extra information to the student.

* **Dark knowledge**: The teacher's insights that are not captured by the labels alone, such as the similarity of different outputs.

* **History**: First used for ensemble models in 2006, then extended to deep neural networks in 2015.

* **Applications**: Used to `compress large pretrained language models` for practical use.

## Creating a Knowledge Distillation Trainer

To implement **knowledge distillation** we need to add a few things to the `Trainer` base class:

* The new hyperparameters `α` and `T`, which control the relative weight of the `distillation loss` and how much the probability distribution of the labels should be smoothed.

* The fine-tuned teacher model, which in our case is BERT-base.

* A new loss function that combines the cross-entropy loss with the knowledge distillation loss.

Adding the new hyperparameters is quite simple, since we just need to subclass `TrainingArguments` and include them as new attributes:

In [None]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)
        self.alpha=alpha
        self.temperature=temperature


For the trainer itself, we need a new loss function. The way to implement this is by subclassing `Trainer` and overriding the `compute_loss()` method to include the `knowledge distillation loss` term $L_{KD}$.

* **Cross Entropy Loss:** measures how well the `student model predicts` the correct labels.

* **Knowledge Distillation Loss:** measures how well the `student model mimics` the teacher model's output.


In [84]:
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher_model=teacher_model

    def compute_loss(self, model, inputs, return_outputs=False):
        student_outputs=model(**inputs)
        # Extract Cross Entropy Loss and Logits from Student
        cross_entropy_loss=student_outputs.loss
        student_logits=student_outputs.logits

        # Extract logits from Teacher
        with torch.no_grad():
            teacher_outputs=self.teacher_model(**inputs)
            teacher_logits=teacher_outputs.logits
        # Soften Probabilities and Compute Distillation Loss
        loss_function=nn.KLDivLoss(reduction="batchmean")
        # Temperature
        T=self.args.temperature
        knowledge_distillation_loss=T ** 2 * loss_function(
            F.log_softmax(student_logits/T, dim=-1),
            F.softmax(teacher_logits/T, dim=-1)
        )

        # Return weighted student loss
        loss=self.args.alpha * cross_entropy_loss + (1. - self.args.alpha) * knowledge_distillation_loss
        return (loss, student_outputs) if return_outputs else loss

## Choosing a good student initialization