<a href="https://colab.research.google.com/github/vkjadon/llm/blob/main/hf_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We can use `AutoModel` class from the `transformers` library to download and cache the specific model architecture and weights using `from_pretrained()` method. The AutoModel class is a wrappers designed to fetch the appropriate model architecture for a given checkpoint.

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

In this case AutoModel will fetch a BERT model on the basis of the checkpoint provided in the `from_pretrained()` method. It downloads and chches the model architechure (12 layers, 768 hidden size, 12 attention heads) and the weights from HuggingFace Hub.

However, we can use a specific model class directly in case we know the type of model we want to use for the checkpoint.

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

## Loading and saving

We can use `save_pretrained()` method, to save the model's weights and architecture configuration.

In [None]:
model.save_pretrained("bert")

This will save two file in the path provided in the `save_pretrained()`. The path will have two files `config.json` and `model.safetensors`.

The config.json file have necessary attributes needed to build the model architecture and some metadata. The `model.safetensors` is the state dictionary containing weights.

To reuse a saved model, use the from_pretrained() method again.

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert")

## Model Flow

Transformer models handle text by turning the inputs into numbers. Here we will look at exactly what happens when your text is processed by the tokenizer.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

We can decode the input IDs to get back the original text

In [3]:
tokenizer.decode(encoded_input["input_ids"])

"[CLS] Hello, I ' m a single sentence! [SEP]"

You’ll notice that the tokenizer has added special tokens — [CLS] and [SEP] — required by the model. Not all models need special tokens; they’re utilized when a model was pretrained with them

In [None]:
encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)

In [None]:
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)

In [None]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

In [None]:
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)
print(encoded_input["input_ids"])

In [None]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

In [None]:
encoded_input = tokenizer("How are you?")
print(encoded_input["input_ids"])
tokenizer.decode(encoded_input["input_ids"])

In [None]:
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]