This notebook is an introduction to natural language processing (NLP) using libraries from the Hugging Face ecosystem â€” ðŸ¤— Transformers, ðŸ¤— Datasets, ðŸ¤— Tokenizers, and ðŸ¤— Accelerate â€” as well as the Hugging Face Hub.

![](https://avatars.githubusercontent.com/u/25720743?s=200&v=4)

HuggingFace provides a standardized interface to a wide range of transformer models as well as code and tools to adapt these models (especially NLP) to new use cases. The libraries currently supports three major deep learning frameworks (PyTorch, TensorFlow, and JAX) and allows you to easily switch between them. In addition, it provides task-specific heads so you can easily fine-tune transformers on downstream tasks such as text classification, question answering, and text generation. This reduces the time it takes a practitioner to train and test a handful of models from a week to a single afternoon!

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

### Descriptions:
- **datasets**: This library allows you to easily load, process, and explore various NLP datasets from a central hub or your local filesystem. It simplifies data preparation for your machine learning tasks.

- **evaluate**: This library provides tools for evaluating the performance of your NLP models on different metrics. It allows you to compare your models against baselines and get insights into their strengths and weaknesses.

- **transformers**: This is a popular library for building and using state-of-the-art pre-trained NLP models. It offers pre-trained models for various tasks like text classification, question answering, and text generation. The `[sentencepiece]` addition ensures the installation includes support for `sentencepiece` tokenization, a method commonly used by these models.

- **accelerate**: This library helps with training and evaluating models on multiple GPUs or TPUs (Tensor Processing Units) for faster training and better resource utilization. It simplifies distributed training setups, allowing you to scale your computations effectively.

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## A Tour of ðŸ¤— Transformer Applications

 ### Working with pipelines

 The most pythonic object in the ðŸ¤— Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

In [None]:
from transformers import pipeline

text = "Khách hàng không hài lòng về món ăn nhưng rất thích cách phục vụ của nhân viên"
text = ' '.join([sent.strip() for sent in text.splitlines()])

classifier = pipeline("sentiment-analysis", "wonrax/phobert-base-vietnamese-sentiment")
classifier(text)

In [None]:
classifier(
    [text,
     "Hôm nay kẹt xe quá",
     "Tôi được sếp khen",
     "Hôm nay trời cũng bình thường",]
)

In [None]:
classifier=0

There are three main steps involved when you pass some text to a pipeline:
1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of them.

Some of the currently available pipelines are:
* feature-extraction (get the vector representation of a text)
* fill-mask
* ner (named entity recognition)
* question-answering
* sentiment-analysis
* summarization
* text-generation
* translation
* zero-shot-classification

Letâ€™s have a look at a few of these!

#### Translation

In [None]:
text = "en: Write me a function to calculate the first 10 digits of the fibonacci sequence in Python and print it out to the CLI."
translator = pipeline("translation_en_to_vi",
                      model="VietAI/envit5-translation")
outputs = translator(text, max_length=400)
vi_text = outputs[0]['translation_text']
print(vi_text)

In [None]:
translator=0

#### Text Generation


In [None]:
import torch
from transformers import set_seed
from transformers import pipeline
set_seed(42) # Set the seed to get reproducible results

prompt = "<|system|> You are a chatbot who can help code!</s> <|user|> Write me a function to calculate the first 10 digits of the fibonacci sequence in Python and print it out to the CLI.</s> <|assistant|>"

generator = pipeline("text-generation", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
print(generator(prompt, max_length=256)[0]['generated_text'])

In [None]:
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a chatbot who can help code!",
    },
    {"role": "user", "content": "Write me a function to calculate the first 10 digits of the fibonacci sequence in Python and print it out to the CLI."},
]
prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = generator(prompt, max_new_tokens=256, do_sample=True, temperature=0.1, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

In [None]:
prompt

In [None]:
generator.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)

### Load models with tokenizer and model

In [None]:
from transformers import LlamaTokenizer, LlamaForCausalLM

# Load the tokenizer for TinyLlama
tokenizer = LlamaTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Load the TinyLlama model
model = LlamaForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Prepare some text input
input_text = prompt
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Generate text using the model
output = model.generate(input_ids, max_new_tokens=10)

# Decode the output
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

print(decoded_output)


What if the model can not fit on your memory? GPU poor? Quantization might be a good solution for you.

Fortunately, HuggingFace also provides a strightforward way to quantize your models to 8-bit or 4-bit. You just need to add one line/one parameter to your models.

You need to install `bitsandbytes` package to use the useful feature.

Now, I will tell you how to load your models in 8-bit and 4-bit.

In [None]:
import time
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
# Load the TinyLlama model in 8-bit
model = LlamaForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", quantization_config=quantization_config, low_cpu_mem_usage=True)

# Generate and calculate time
start_time = time.time()
output = model.generate(input_ids, max_new_tokens=10)
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
end_time = time.time()
print("Time:", (end_time-start_time))

Let's compare the inference time with the 4-bit model.

In [None]:
import time
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)
# Load the TinyLlama model in 8-bit
model = LlamaForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", quantization_config=quantization_config, low_cpu_mem_usage=True)

# Generate and calculate time
start_time = time.time()
output = model.generate(input_ids, max_new_tokens=50)
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
end_time = time.time()
print("Time:", (end_time-start_time))

## The Hugging Face Ecosystem

![](https://raw.githubusercontent.com/nlp-with-transformers/notebooks/48e4a5e5c44b86e1593c0945a49af9675cfd7158//images/chapter01_hf-ecosystem.png)

These few lines of code are all you need to train and evaluate using Hugging Face.

In [None]:
%wandb disabled

In [None]:
# 1. Load and pre-process data
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq

raw_datasets = load_dataset("SetFit/tweet_sentiment_extraction")
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

task_description = "Classify the following sentence: "

def tokenize_function(example):
    prompt = task_description + example['text'] + f" Label: {example['label_text']}{tokenizer.eos_token}"
    inputs = tokenizer(prompt, truncation=True)
    inputs['labels'] = inputs['input_ids'][:]
    return inputs

tokenized_datasets = raw_datasets.map(tokenize_function, remove_columns=['text', 'label', 'textID', 'label_text'])
data_collator = DataCollatorForSeq2Seq(
    tokenizer, pad_to_multiple_of=8, return_tensors="pt"
)



# 2. Train
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments, Trainer
import numpy as np

training_args = TrainingArguments("test-trainer", evaluation_strategy='no', do_eval=False, do_train=True)
model = AutoModelForCausalLM.from_pretrained(model_name)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"].select(range(100)),
    eval_dataset=tokenized_datasets["test"].select(range(10)),
    data_collator=data_collator,
)
trainer.train()

### The Hugging Face Hub

As outlined earlier, transfer learning is one of the key factors driving the success of transformers because it makes it possible to reuse pretrained models for new tasks. Consequently, it is crucial to be able to load pretrained models quickly and run experiments with them.

The Hugging Face Hub hosts over 500,000 freely available models and datasets. As weâ€™ve seen with the pipelines, loading a promising model in your code or loading a dataset is then just one line of code away. This makes experimenting with a wide range of models and datasets simple, and allows you to focus on the domain-specific parts of your project.

![](https://i.imgur.com/SSTNZU3.png)

### ðŸ¤— Tokenizers

Behind each of the pipeline examples that weâ€™ve seen is a tokenization step that *splits the raw text into smaller pieces* called **tokens**. It would be good to understand that tokens may be *words, parts of words, or just characters* like punctuation. Transformer models are trained on numerical representations of these tokens, so getting this step right is very important for the whole NLP project!

ðŸ¤— Tokenizers provides many tokenization strategies and is extremely fast at tokenizing text thanks to its Rust backend. It also takes care of all the pre- and post-processing steps, such as normalizing the inputs and transforming the model outputs to the required format. With ðŸ¤— Tokenizers, we can load a tokenizer in the same way we can load pretrained model weights with   Transformers.

In [None]:
from transformers import AutoTokenizer

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

The **tokenization** process is done by the tokenize() method of the tokenizer:


In [None]:
sequence = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
tokens = tokenizer.tokenize(sequence)
print(tokens)

This tokenizer is a **subword** tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. Thatâ€™s the case here with `TinyaLlama`, which is split into five tokens: `_T`, `iny`, `L`, `l` and `ama`.

The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method:

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen later in this notebook.

**Decoding** is going the other way around: from vocabulary indices, we want to get a string. This can be done with the `decode()` method as follows:

In [None]:
decoded_string = tokenizer.decode([450, 323, 4901, 29931, 29880, 3304, 2060, 263, 9893, 304, 758, 14968, 263, 29871, 29896, 29889, 29896, 29933, 365, 29880, 3304, 1904, 373, 29871, 29941, 534, 453, 291, 18897, 29889])
print(decoded_string)

Okay, and this is how we can utilize tokenizers to feed models.

In [None]:
raw_inputs = [
    "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens.",
    "The training has started on 2023-09-01.",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string of text (as is the case with a standard language model), the model instead continues a conversation that consists of one or more messages, each of which includes a role, like â€œuserâ€ or â€œassistantâ€, as well as message text. Chat templates are part of the tokenizer. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects. Refer to [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating).

In [None]:
template = """{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}
{% if message['role'] == 'user' %}{{ bos_token + 'Classify the following sentence: ' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{{ 'Label: '  + message['content'] + eos_token }}
{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Label: ' }}{% endif %}"""

tokenizer.chat_template = template

In [None]:
messages = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    # {"role": "assistant", "content": "The sun."}
]
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Okay, so here is how we would train an LLM as a sequence classifier on one batch in PyTorch.

I recommend that you apply the chat template as a preprocessing step for your dataset. After this, you can simply continue like any other language model training task. When training, you should usually set `add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during training. Letâ€™s see an example:

In [None]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset

# Same as before
model = AutoModelForCausalLM.from_pretrained(model_name)

chat1 = [
    {"role": "user", "content": "what interview! leave me alone"},
    {"role": "assistant", "content": "negative"}
]
chat2 = [
    {"role": "user", "content": "I really really like the song Love Story by Taylor Swift"},
    {"role": "assistant", "content": "positive"}
]

dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])

# Tokenize dataset
batch = tokenizer(dataset['formatted_chat'], padding=True, truncation=True, return_tensors="pt")
# Clone inputs for labels
batch["labels"] = batch['input_ids'].clone()

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Of course, just training the model on two sentences is not going to yield good results. To get better results, you will need to prepare a bigger dataset.

### ðŸ¤— Datasets

Loading, processing, and storing datasets can be a cumbersome process, especially when the datasets get too large to fit in your laptopâ€™s RAM. In addition, you usually need to implement various scripts to download the data and transform it into a standard format.

ðŸ¤— Datasets simplifies this process by providing a standard interface for thousands of datasets that can be found on the Hub. It also provides smart caching (so you donâ€™t have to redo your preprocessing each time you run your code) and avoids RAM limitations by leveraging a special mechanism called **memory mapping** that stores the contents of a file in virtual memory and enables multiple processes to modify a file more efficiently. The library is also interoperable with popular frameworks like Pandas and NumPy.

The ðŸ¤— Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download our Vietnamese sentiment analysis dataset like this:

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("SetFit/tweet_sentiment_extraction")
raw_datasets

This command downloads and caches the dataset, by default in `~/.cache/huggingface/datasets`. You can customize your cache folder by setting the `HF_HOME` environment variable.

We can access each element in our `raw_datasets` object by indexing:

In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[10]

In [None]:
raw_train_dataset

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. As you saw in the previous section, this is done with a tokenizer. We will use the `Dataset.map()` method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The `map()` method works by applying a function on each element of the dataset, so letâ€™s define a function that tokenizes our inputs:

In [None]:
def tokenize_function(example):
    message = [
        {'role': 'user', 'content': example['text']},
        {'role': 'assistant', 'content': example['label_text']}
    ]
    inputs = tokenizer.apply_chat_template(message, tokenize=True, return_dict=True, add_generation_prompt=False)
    inputs['labels'] = inputs['input_ids'][:]
    return inputs

In [None]:
import time
t1 = time.time()
tokenized_datasets = raw_datasets.map(tokenize_function, remove_columns=['textID', 'text', 'label', 'label_text'])
print("Tokenization time: ", time.time()-t1)
tokenized_datasets

The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together â€” a technique we refer to as *dynamic padding*.

The function that is responsible for putting together samples inside a batch is called a **collate** function. Itâ€™s an argument you can pass when you build a DataLoader, the default being a function that will just *convert your samples to PyTorch tensors and concatenate them* (recursively if your elements are lists, tuples, or dictionaries). This wonâ€™t be possible in our case since the inputs we have wonâ€™t all be of the same size.

We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training significantly.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the ðŸ¤— Transformers library provides us with such a function via **DataCollatorWithPadding**.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer, pad_to_multiple_of=8, return_tensors="pt"
)

To test this new toy, letâ€™s grab a few samples from our training set that we would like to batch together.

In [None]:
samples = tokenized_datasets["train"].select(range(10)).to_list()
[len(x["input_ids"]) for x in samples]

No surprise, we get samples of varying length, from 18 to 42. Dynamic padding means the samples in this batch should all be padded to a length of 42, the maximum length inside the batch. Letâ€™s double-check that our `data_collator` is dynamically padding the batch properly:

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

Looking good! Now that weâ€™ve gone from raw text to batches our model can deal with, weâ€™re ready to fine-tune it!

The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer",
                                  eval_steps=10,
                                  evaluation_strategy='steps',
                                  logging_steps=10,
                                  do_train=True,
                                  do_eval=True,
                                  warmup_steps=10,
                                  max_steps=500,
                                  learning_rate=2e-4,
                                  optim="adamw_torch",
                                  per_device_train_batch_size=16,
                                 )

The second step is to define our model. As in the previous chapter, we will use the `AutoModelForSequenceClassification` class, with two labels:

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_name)

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now â€” the `model`, the `training_args`, the training and validation datasets, our `data_collator`:

In [None]:
from transformers import Trainer

small_valid_set = tokenized_datasets["test"].select(20)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=small_valid_set,
    data_collator=data_collator,
)

To fine-tune the model on our dataset, we just have to call the `train()` method of our Trainer:

In [None]:
trainer.train()

This will start the fine-tuning (which should take a couple of minutes on a GPU) and report the training loss every 10 steps.


The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use `fp16 = True` in your training arguments).

### Making training faster

1. Flash Attention

Flash Attention is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. It is based on the paper "[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)". This accelerates training up to 3x. Flash Attention is currently only available for Ampere (A10, A40, A100, ...) & Hopper (H100, ...) GPUs.

In [None]:
!pip install ninja packaging
!pip install flash-attn --no-build-isolation

Installing flash attention can take quite a bit of time (10-45 minutes).

2. Low-rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This drastically reduces the number of parameters that need to be fine-tuned.

To use LoRA training, you need to install `peft` package.

In [None]:
%%capture
!pip install peft

In [None]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)


# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

3. Gradient Accumulation & Gradient Checkpointing

The idea behind gradient accumulation is to instead of calculating the gradients for the whole batch at once to do it in smaller steps. The way we do that is to calculate the gradients iteratively in smaller batches by doing a forward and backward pass through the model and accumulating the gradients in the process. When enough gradients are accumulated we run the modelâ€™s optimization step. This way we can easily increase the overall batch size to numbers that would never fit into the GPUâ€™s memory. In turn, however, the added forward and backward passes can slow down the training a bit.

We can use gradient accumulation in the Trainer by simply adding the `gradient_accumulation_steps` argument to TrainingArguments.

Even when we set the batch size to 1 and use gradient accumulation we can still run out of memory when working with large models. In order to compute the gradients during the backward pass all activations from the forward pass are normally saved. This can create a big memory overhead. Alternatively, one could forget all activations during the forward pass and recompute them on demand during the backward pass. This would however add a significant computational overhead and slow down training.

To enable gradient checkpointing in the Trainer we only need ot pass it as a flag to the TrainingArguments.

4. BF16/FP16 Training

The idea of mixed precision training is that no all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the fp16 flag to `True`.

Now, let's try a different package from HuggingFace ecosystem, `trl`. TRL is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. The library is integrated with ðŸ¤— transformers.

You can install the `trl` package as follows:

In [None]:
%%capture
!pip install trl

In [None]:
from trl import SFTTrainer


def format_instruction(example):
    message = [
        {'role': 'user', 'content': example['text']},
        {'role': 'assistant', 'content': example['label_text']}
    ]
    return tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False)


training_args = TrainingArguments("test-trainer",
                                  eval_steps=10,
                                  evaluation_strategy='no',
                                  logging_steps=10,
                                  do_train=True,
                                  do_eval=False,
                                  warmup_steps=10,
                                  max_steps=50,
                                  learning_rate=2e-4,
                                  optim="adamw_torch",
                                  fp16=True,
                                  per_device_train_batch_size=16,
                                  gradient_accumulation_steps=4,
                                  gradient_checkpointing=True
                                 )

model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="flash_attention_2", device_map='auto')


trainer = SFTTrainer(
    model=model,
    train_dataset=raw_datasets['train'],
    peft_config=peft_config,
    max_seq_length=50,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction,
    args=training_args,
)

# train
trainer.train()

# save model
trainer.save_model()

Test this model

In [None]:
from peft import AutoPeftModelForCausalLM

output_dir = "test-trainer"

load base LLM model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)

input_ids = tokenizer.apply_chat_template([{'role': 'user', 'content': 'my boss is bullying me...'}], tokenize=True, return_tensors="pt", add_generation_prompt=True).cuda()
with torch.inference_mode():
    outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=True, top_p=0.9, temperature=0.1)

print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(input_ids[0]):]}")

### ðŸ¤— Accelerate

With the ðŸ¤— Transformers, ðŸ¤— Tokenizers, and ðŸ¤— Datasets libraries we have everything we need to train our very own transformer models! However, as weâ€™ll see in the upcoding weeks, there are situations where we need fine-grained control over the training loop. Thatâ€™s ðŸ¤— Accelerate.

If youâ€™ve ever had to write your own training script in PyTorch, chances are that youâ€™ve had some headaches when trying to port the code that runs on your laptop to the code that runs on your organizationâ€™s server. ðŸ¤— Accelerate adds a layer of abstraction to your normal training loops that takes care of all the custom logic necessary for the training infrastructure. This literally accelerates your workflow by simplifying the change of infrastructure when necessary.

In [None]:
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import accelerate


model_name = 'mistralai/Mistral-7B-Instruct-v0.2'

Init an empty skeleton of the model which wonâ€™t take up any RAM

In [None]:
config = AutoConfig.from_pretrained(model_name)
with accelerate.init_empty_weights():
    dummy_model = AutoModelForCausalLM.from_config(config)

We have some options to set the `device_map`:
- `device_map="auto"` (`"balanced"`, `"balanced_low_0"`, `"sequential"`) (GPU > CPU > disk)
- `device_map = {"block1": 0, "block2": 1}`
- `max_memory={0: "4GiB", 1: "8GiB", "cpu": "5GiB", "disk": "30GiB"}`

However, we can also let Accelerate automatically handle it or design `device_map` itself

In [None]:
device_map = accelerate.infer_auto_device_map(dummy_model, max_memory={0: "10GiB", "cpu": "4GiB"})
device_map

Load model with `device_map` & `offload_state_dict=True` to disk to avoid getting out of CPU RAM

In [None]:
# Initialize model with custom device_map
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device_map,
    offload_state_dict=True,
    offload_folder="offload"
)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

Then, we can try to generate responses

In [None]:
messages = [
    {"role": "user", "content": "What is the capital of Vietnam?"},
    # ...
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
device = "cpu"   # "cuda"
model_inputs = encodeds.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Thank you for your attention. Let's enjoy coding!

## Reference
1. [Natural Language Processing with Transformers Book](https://transformersbook.com/)
2. [Hugging Face course](https://huggingface.co/course/chapter1/1)