In [None]:
%wandb disabled

In [None]:
# 1. Load and pre-process data
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq

raw_datasets = load_dataset("SetFit/tweet_sentiment_extraction")
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

task_description = "Classify the following sentence: "

def tokenize_function(example):
    prompt = task_description + example['text'] + f" Label: {example['label_text']}{tokenizer.eos_token}"
    inputs = tokenizer(prompt, truncation=True)
    inputs['labels'] = inputs['input_ids'][:]
    return inputs

tokenized_datasets = raw_datasets.map(tokenize_function, remove_columns=['text', 'label', 'textID', 'label_text'])
data_collator = DataCollatorForSeq2Seq(
    tokenizer, pad_to_multiple_of=8, return_tensors="pt"
)

# 2. Train
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments, Trainer
import numpy as np

training_args = TrainingArguments("test-trainer", evaluation_strategy='no', do_eval=False, do_train=True)
model = AutoModelForCausalLM.from_pretrained(model_name)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"].select(range(100)),
    eval_dataset=tokenized_datasets["test"].select(range(10)),
    data_collator=data_collator,
)
trainer.train()

### The Hugging Face Hub

As outlined earlier, transfer learning is one of the key factors driving the success of transformers because it makes it possible to reuse pretrained models for new tasks. Consequently, it is crucial to be able to load pretrained models quickly and run experiments with them.

The Hugging Face Hub hosts over 500,000 freely available models and datasets. As weâ€™ve seen with the pipelines, loading a promising model in your code or loading a dataset is then just one line of code away. This makes experimenting with a wide range of models and datasets simple, and allows you to focus on the domain-specific parts of your project.

![](https://i.imgur.com/SSTNZU3.png)

### ðŸ¤— Tokenizers

Behind each of the pipeline examples that weâ€™ve seen is a tokenization step that *splits the raw text into smaller pieces* called **tokens**. It would be good to understand that tokens may be *words, parts of words, or just characters* like punctuation. Transformer models are trained on numerical representations of these tokens, so getting this step right is very important for the whole NLP project!

ðŸ¤— Tokenizers provides many tokenization strategies and is extremely fast at tokenizing text thanks to its Rust backend. It also takes care of all the pre- and post-processing steps, such as normalizing the inputs and transforming the model outputs to the required format. With ðŸ¤— Tokenizers, we can load a tokenizer in the same way we can load pretrained model weights with   Transformers.

In [None]:
from transformers import AutoTokenizer

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

The **tokenization** process is done by the tokenize() method of the tokenizer:

In [None]:
sequence = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
tokens = tokenizer.tokenize(sequence)
print(tokens)

This tokenizer is a **subword** tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. Thatâ€™s the case here with `TinyaLlama`, which is split into five tokens: `_T`, `iny`, `L`, `l` and `ama`.

The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method:

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen later in this notebook.

**Decoding** is going the other way around: from vocabulary indices, we want to get a string. This can be done with the `decode()` method as follows:

In [None]:
decoded_string = tokenizer.decode([450, 323, 4901, 29931, 29880, 3304, 2060, 263, 9893, 304, 758, 14968, 263, 29871, 29896, 29889, 29896, 29933, 365, 29880, 3304, 1904, 373, 29871, 29941, 534, 453, 291, 18897, 29889])
print(decoded_string)

Okay, and this is how we can utilize tokenizers to feed models.

In [None]:
raw_inputs = [
    "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.",
    "The training has started on 2023-09-01.",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string of text (as is the case with a standard language model), the model instead continues a conversation that consists of one or more messages, each of which includes a role, like â€œuserâ€ or â€œassistantâ€, as well as message text. Chat templates are part of the tokenizer. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects. Refer to [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating).

In [None]:
template = """{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}
{% if message['role'] == 'user' %}{{ bos_token + 'Classify the following sentence: ' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{{ 'Label: '  + message['content'] + eos_token }}
{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Label: ' }}{% endif %}"""

tokenizer.chat_template = template

messages = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    # {"role": "assistant", "content": "The sun."}
]
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Okay, so here is how we would train an LLM as a sequence classifier on one batch in PyTorch.

I recommend that you apply the chat template as a preprocessing step for your dataset. After this, you can simply continue like any other language model training task. When training, you should usually set `add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during training. Letâ€™s see an example:

In [None]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset

# Same as before
model = AutoModelForCausalLM.from_pretrained(model_name)

chat1 = [
    {"role": "user", "content": "what interview! leave me alone"},
    {"role": "assistant", "content": "negative"}
]
chat2 = [
    {"role": "user", "content": "I really really like the song Love Story by Taylor Swift"},
    {"role": "assistant", "content": "positive"}
]

dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])

# Tokenize dataset
batch = tokenizer(dataset['formatted_chat'], padding=True, truncation=True, return_tensors="pt")
# Clone inputs for labels
batch["labels"] = batch['input_ids'].clone()

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()