HuggingFace's transformers library has greatly facilitated our work in creating LLM model structures, particularly with my main usage of PyTorch. It has streamlined the process and automated many tasks, saving us from reinventing the wheel and allowing us to focus on the core aspects of our work.

Link: https://huggingface.co/learn/nlp-course/chapter1/1?fw=pt

## Tokenizer

In [2]:
from transformers import AutoTokenizer

In [4]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
raw_inputs = [
    "I'm using HuggingFace to refresh my NLP knowledge",
    "I love machine learning!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  1049,  2478, 17662, 12172,  2000, 25416, 21898,
          2026, 17953,  2361,  3716,   102],
        [  101,  1045,  2293,  3698,  4083,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [24]:
tokens = tokenizer.tokenize(raw_inputs[0])

print(tokens)

['i', "'", 'm', 'using', 'hugging', '##face', 'to', 'ref', '##resh', 'my', 'nl', '##p', 'knowledge']


In [25]:
decoded_string = tokenizer.decode([101,  1045,  1005,  1049,  2478, 17662, 12172,  2000, 25416, 21898,
          2026, 17953,  2361,  3716,   102])
print(decoded_string)

[CLS] i'm using huggingface to refresh my nlp knowledge [SEP]


In [22]:
# tokenizer.save_pretrained("...path...")

## Feed in to model

In [8]:
from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
# output is hiddenstate which can be used as downstream job inputs
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 15, 768])


In [10]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[ 0.2838,  0.1341,  0.7624,  ..., -0.0148,  0.5160, -0.3466],
         [ 0.8246,  0.5177,  0.7209,  ..., -0.0891,  0.5928, -0.0592],
         [ 0.8274,  0.4883,  0.4438,  ...,  0.2366, -0.2426, -0.6068],
         ...,
         [-0.1527,  0.2495,  0.4099,  ..., -0.3064, -0.0788, -0.0025],
         [ 0.2095,  0.4228,  0.9903,  ...,  0.1401,  0.4759, -0.2111],
         [ 0.5834,  0.6062,  0.4520,  ...,  0.4326, -0.0893, -0.4952]],

        [[ 0.6307, -0.0675, -0.1219,  ...,  0.3686,  0.8636, -0.4454],
         [ 1.0353,  0.1182, -0.3107,  ...,  0.3189,  0.9273, -0.1798],
         [ 1.1944,  0.4115,  0.1433,  ...,  0.2893,  1.0146, -0.2526],
         ...,
         [ 0.4094,  0.1410,  0.1120,  ...,  0.5282,  0.6460, -0.3872],
         [ 0.6244,  0.3469, -0.1221,  ...,  0.1465,  0.7602, -0.1418],
         [ 0.6210,  0.1682,  0.0198,  ...,  0.4434,  0.5817, -0.4767]]],
       grad_fn=<NativeLayerNormBackward>), hidden_states=None, attentions=None)

### Classification

In [16]:
# For classification
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


In [17]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7711,  2.8365],
        [-4.2119,  4.5484]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [18]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[3.6564e-03, 9.9634e-01],
        [1.5681e-04, 9.9984e-01]], grad_fn=<SoftmaxBackward>)


In [20]:
torch.argmax(predictions, dim=1)

tensor([1, 1])

### Sequence to Sequence

In [14]:
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained(checkpoint)
outputs = model(**inputs)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertForQuestionAnswering: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stre

In [15]:
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-0.3043, -0.2337, -0.0926, -0.1852, -0.0555, -0.0929, -0.1327, -0.1678,
          0.0032, -0.0625, -0.0135,  0.0215,  0.0277,  0.0527, -0.0299],
        [-0.1896, -0.0478,  0.1463, -0.0870, -0.0384, -0.0273, -0.0391,  0.0259,
          0.0408,  0.0670, -0.0503,  0.0164,  0.0483,  0.0132, -0.0048]],
       grad_fn=<CopyBackwards>), end_logits=tensor([[ 0.0609,  0.0902, -0.0635, -0.1254, -0.1044, -0.0873, -0.1234, -0.1236,
          0.1952,  0.2212,  0.3397,  0.1109,  0.0489,  0.0885, -0.0533],
        [-0.1319,  0.0147, -0.1183, -0.0720, -0.0228, -0.0718, -0.0561, -0.1151,
         -0.1420, -0.1248, -0.1660, -0.2351, -0.2014, -0.0825, -0.1387]],
       grad_fn=<CopyBackwards>), hidden_states=None, attentions=None)

## Finetuning

One batch training

In [29]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
#checkpoint = "bert-base-uncased"
heckpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()



In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Found cached dataset glue (/home/yi/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /home/yi/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-03cc5d1bf1a32620.arrow


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Loading cached processed dataset at /home/yi/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-b52f9423caf13a03.arrow


In [3]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

In [5]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [6]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5205
1000,0.29


TrainOutput(global_step=1377, training_loss=0.3316495087844527, metrics={'train_runtime': 38.926, 'train_samples_per_second': 282.69, 'train_steps_per_second': 35.375, 'total_flos': 204140540541312.0, 'train_loss': 0.3316495087844527, 'epoch': 3.0})

In [None]:
import numpy as np
import evaluate
predictions = trainer.predict(tokenized_datasets["validation"])
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

## Save model

In [21]:
#save model
#model.save_pretrained("..path..")
#save to 2 files config.json pytorch_model.bin