<a href="https://colab.research.google.com/github/vkjadon/llm/blob/main/05hf_tokens_to_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

It also handles multiple sequences at a time, with no change in the API:

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

In [None]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
print(len(model_inputs["input_ids"][0]))
print(len(model_inputs["input_ids"][1]))
print(model_inputs["input_ids"])

In [None]:
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
print(len(model_inputs["input_ids"][0]))
print(model_inputs["input_ids"])

In [None]:
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
print(len(model_inputs["input_ids"][0]))
print(len(model_inputs["input_ids"][1]))
print(model_inputs["input_ids"][1])

In [None]:
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
print(len(model_inputs["input_ids"][0]))
print(len(model_inputs["input_ids"][1]))
print(model_inputs["input_ids"][1])

In [None]:
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
print(len(model_inputs["input_ids"][0]))
print(len(model_inputs["input_ids"][1]))
print(model_inputs["input_ids"][1])

The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — "pt" returns PyTorch tensors and "np" returns NumPy arrays

In [None]:
# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

print(len(model_inputs["input_ids"][0]))
print(len(model_inputs["input_ids"][1]))
print(type(model_inputs["input_ids"]))

In [None]:
# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(len(model_inputs["input_ids"][0]))
print(len(model_inputs["input_ids"][1]))
print(type(model_inputs["input_ids"][1]))

If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier:

In [None]:
model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

One token ID was added at the beginning, and one at the end. Let’s decode the two sequences of IDs above to see what this is about:

In [None]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Note that some models don’t add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [None]:
output = model(**model_inputs)

In [None]:
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

In [None]:
print(type(output))

In [None]:
print(dir(output))

Model prediction scores.

In [None]:
print(output.logits)


In [None]:
model.config.output_hidden_states = True
output = model(**tokens)
print(len(output.hidden_states))   # num layers
print(output.hidden_states[-1])    # final layer


In [None]:
print(output.keys())


In [None]:
print(output.values())


In [None]:
for k, v in output.items():
    print(k, type(v))


In [None]:
print(output.get("logits"))
print(output.get("loss", "No loss present"))  # default value


In [None]:
import torch

tok = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
tokens = tok("I love robotics!", return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
output = model(**tokens, labels=torch.tensor([1]))
print(output.loss)

In [None]:
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2   # spam or not spam
)

text = "Congratulations! You won a lottery."
tokens = tok(text, return_tensors="pt")

label = torch.tensor([1])   # 1 = spam, 0 = not spam

output = model(**tokens, labels=label)

print(output.loss)
print(output.logits)


In [None]:
from transformers import AutoModelForQuestionAnswering

tok = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = AutoModelForQuestionAnswering.from_pretrained(
    "distilbert-base-cased-distilled-squad"
)

context = "The Eiffel Tower is located in Paris, France."
question = "Where is the Eiffel Tower located?"

tokens = tok(question, context, return_tensors="pt")

start_label = torch.tensor([6])
end_label = torch.tensor([6])    # "Paris"

output = model(
    **tokens,
    start_positions=start_label,
    end_positions=end_label
)

print(output.loss)
print(output.start_logits, output.end_logits)


In [None]:
from transformers import AutoModelForSeq2SeqLM

tok = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

text = "The Indian Constitution was adopted in 1949 and came into effect in 1950."

tokens = tok(text, return_tensors="pt")

summary_ids = tok("Indian Constitution adopted in 1949.", return_tensors="pt")["input_ids"]

output = model(**tokens, labels=summary_ids)

print(output.loss)

In [None]:
from transformers import AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

text = "Book a flight to Delhi"

# 0 = BookFlight, 1 = CancelTicket, 2 = WeatherQuery
label = torch.tensor([0])

tokens = tok(text, return_tensors="pt")

output = model(**tokens, labels=label)

print(output.loss)
print(output.logits)


In [None]:
from datasets import Dataset

data = {
    "text": [
        "Congratulations! You won a lottery of $5000. Click here to claim.",
        "Limited-time offer! Free coupons waiting for you.",
        "Urgent: Your bank account will be closed. Verify now.",
        "Hi John, can we meet tomorrow regarding the project?",
        "Dear team, here is the report from last week.",
        "Hello, your Amazon order has been shipped successfully.",
    ],
    "label": [
        1, 1, 1, 0, 0, 0
    ]
    # 1:spam examples 0:not-spam examples
}

dataset = Dataset.from_dict(data)
dataset


In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)


In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length")

tokenized_ds = dataset.map(tokenize, batched=True)
tokenized_ds = tokenized_ds.train_test_split(test_size=0.2)


In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="spam_model",

)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
)


In [None]:
trainer.train()


In [None]:
import torch
from torch.nn.functional import softmax

def classify_mail(text):
    tokens = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    output = model(**tokens)

    probs = softmax(output.logits, dim=1)
    pred = torch.argmax(probs).item()

    print("Text:", text)
    print("Spam Probability:", probs[0][1].item())
    print("Not-Spam Probability:", probs[0][0].item())
    print("Prediction:", "SPAM" if pred == 1 else "NOT SPAM")

# Test Email
email = """
Dear user,
Your account has been temporarily locked due to unusual activity.
Click the link below to verify your identity.
"""

classify_mail(email)


In [None]:
more_data = {
    "text": [
        "Your OTP is 23456. Do not share.",
        "Free vacation! Book now!",
        "Meeting postponed to Monday.",
        "Earn $500 per day working from home!"
    ],
    "label": [0, 1, 0, 1]
}

new_ds = Dataset.from_dict(more_data)
dataset = Dataset.from_dict({
    "text": dataset["text"] + new_ds["text"],
    "label": dataset["label"] + new_ds["label"],
})


Then rerun tokenization + training.