DistilBERT (Distilled BERT) is a smaller, faster, and lighter version of BERT (Bidirectional Encoder Representations from Transformers). It was introduced by Hugging Face and leverages knowledge distillation to achieve performance similar to BERT with significantly fewer parameters.

### Key Concepts of DistilBERT

1. Knowledge Distillation:

    * A teacher model (like BERT) trains a student model (DistilBERT) to mimic its behavior.
     * During training, the student model learns not only from the true labels (hard targets) but also from the "soft labels" (output probabilities) produced by the teacher model.
    * This helps the smaller model retain much of the performance of the larger one.

2. Size Reduction:
  * DistilBERT has about 40% fewer parameters than BERT.
  * This reduction is achieved by:
      * Removing the token-type embeddings (used for tasks like question-answering).
      * Reducing the number of transformer layers by half (BERT-base has 12 layers; DistilBERT has 6).

3. Speed and Efficiency:
  * DistilBERT is about 60% faster in inference while retaining 97% of BERT's performance on various natural language understanding tasks.

4. Training Objectives:
  * Masked Language Modeling (MLM): Predicts randomly masked words in a sentence (like BERT).
  * Distillation Loss: Matches the soft probabilities of the teacher and student models.

### Applications of DistilBERT

DistilBERT can perform various NLP tasks, similar to BERT:

1. Text classification (e.g., sentiment analysis)
2. Named Entity Recognition (NER)
3. Question Answering
4. Text Summarization
5. Language Translation

### Advantages of DistilBERT

1. Compact: Requires less memory and computational power.
2. Fast: Ideal for real-time applications like chatbots or mobile devices.
3. Generalizable: Retains the versatility of BERT for various NLP tasks.

Hereâ€™s how you can use DistilBERT for text classification with the Hugging Face library

In [1]:
!pip install transformers datasets



In [1]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

In [2]:
import os
os.environ["WANDB_MODE"] = "disabled"

In [3]:
# Load dataset (e.g., IMDb for sentiment analysis)
dataset = load_dataset("imdb")
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Tokenize data
def preprocess_data(example):
    return tokenizer(example['text'], truncation=True, padding='max_length', max_length=128)

tokenized_datasets = dataset.map(preprocess_data, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

In [5]:
# Load model
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

# Train model
trainer.train()



Epoch,Training Loss,Validation Loss


In [None]:
# Evaluate
trainer.evaluate()

### Summary

* DistilBERT is a smaller, faster alternative to BERT that uses knowledge distillation to retain most of BERT's performance.
* It is ideal for real-world applications where computational resources are limited.
* You can use libraries like Hugging Face to quickly fine-tune DistilBERT for your specific tasks.