In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch # pytorch

torch.cuda.is_available() # i am running torch on cuda

True

We use the imbd dataset for sentiment classification. It already has labels (1 for positive, 0 for negative sentiment).
- See it here: https://huggingface.co/datasets/stanfordnlp/imdb

In [11]:
dataset = load_dataset("imdb")
train_data = dataset['train'].shuffle(seed=42).select(range(2000))  # Limit for faster execution
test_data = dataset['test'].shuffle(seed=42).select(range(500))

In [12]:
model_name = "distilbert-base-uncased" # model name from huggingface repo
tokenizer = AutoTokenizer.from_pretrained(model_name) # load the tokenizer model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # load the model itself

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note: the above warning is normal. What is happening here is that we are taking the distilbert model, which is just a (smaller) bert model whose final layer is the last encoding layer (no linear layer). Hence, this model just computes vector embeddings for the tokens in a sequence. Then, with AutoModelForSequenceClassification we are putting a linear head on top to take the tokens, mean pool them, and classify the resulting vector into "positive" or "negative".

We can visualize this model (using an [onnx](https://onnx.ai/) version) using [netron](https://github.com/lutzroeder/netron): see [here](https://netron.app/?url=https://huggingface.co/onnxport/distilbert-base-uncased-onnx/blob/main/model.onnx).

Note how the last node (the output) is named "last_hidden_layer". If you click on it you see that the dimension of the layer is: (batch_size, sequence_length, 764).

- Batch size is the size of the batch that passed through the model
- Sequence length is the max number of tokens in the batch
- 764 is the size of the token embeddings

We can use onnx to export the model we just initialized and if we visualize it in netron we see that a linear head has been added on top:

In [13]:
inputs = tokenizer("This is a sample input for ONNX export.", return_tensors="pt")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

torch.onnx.export(
    model,                                          # Model to export
    (inputs["input_ids"].cuda(), inputs["attention_mask"].cuda()), # Input example (tuple)
    "distilbert_model_binary_classification.onnx",                         # Path to save ONNX model
    input_names=["input_ids", "attention_mask"],     # Names for the inputs
    output_names=["output"],                         # Names for the outputs
    dynamic_axes={                                   # Dynamic axes for variable sequence lengths
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "output": {0: "batch_size"}
    },
    opset_version=14                                # ONNX opset version
)

Let's now train the transformer model using the transformers library Trainer.

In [14]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

train_data = train_data.map(tokenize, batched=True)
test_data = test_data.map(tokenize, batched=True)

In [15]:
train_data[0].keys()

dict_keys(['text', 'label', 'input_ids', 'attention_mask'])

input_ids is the list of token ids in the input text, and attention_mask is the attention mask vector.

In [16]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
)



Let's train the transformer. This will take a bit (note: if you are running on gpu, you can use `nvidia-smi` to monitor gpu utilization).

In [17]:
trainer.train()
trainer.evaluate()

Epoch,Training Loss,Validation Loss
1,No log,0.311137
2,0.279300,0.465389


{'eval_loss': 0.46538928151130676,
 'eval_runtime': 9.4743,
 'eval_samples_per_second': 52.775,
 'eval_steps_per_second': 6.65,
 'epoch': 2.0}

Note: the above is training the transformer model and updating all the model parameters with backpropagation:

In [18]:
for name, param in model.named_parameters():
    print(f"Parameter name: {name} | Shape: {param.shape} | Requires grad: {param.requires_grad}")

Parameter name: distilbert.embeddings.word_embeddings.weight | Shape: torch.Size([30522, 768]) | Requires grad: True
Parameter name: distilbert.embeddings.position_embeddings.weight | Shape: torch.Size([512, 768]) | Requires grad: True
Parameter name: distilbert.embeddings.LayerNorm.weight | Shape: torch.Size([768]) | Requires grad: True
Parameter name: distilbert.embeddings.LayerNorm.bias | Shape: torch.Size([768]) | Requires grad: True
Parameter name: distilbert.transformer.layer.0.attention.q_lin.weight | Shape: torch.Size([768, 768]) | Requires grad: True
Parameter name: distilbert.transformer.layer.0.attention.q_lin.bias | Shape: torch.Size([768]) | Requires grad: True
Parameter name: distilbert.transformer.layer.0.attention.k_lin.weight | Shape: torch.Size([768, 768]) | Requires grad: True
Parameter name: distilbert.transformer.layer.0.attention.k_lin.bias | Shape: torch.Size([768]) | Requires grad: True
Parameter name: distilbert.transformer.layer.0.attention.v_lin.weight | Shap

In [32]:
tok = tokenizer("this is really bad", return_tensors="pt", truncation=True, padding=True, return_attention_mask=True)

In [36]:
with torch.no_grad():  # Disable gradient calculations for inference
    outputs = model(input_ids=tok["input_ids"].cuda(), attention_mask=tok["attention_mask"].cuda())

In [37]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.2898, -2.6190]], device='cuda:0'), hidden_states=None, attentions=None)

In [38]:
predictions = torch.argmax(outputs.logits, dim=1)

In [39]:
predictions

tensor([0], device='cuda:0')