# **AutoModel - The Universal Model Loader**

## **What's Covered?**
1. Introduction to AutoModel
    - What is AutoModel?
    - Using .from_pretrained("model_name")
    - Configuration Files
    - What are these warnings?
    - What is last_hidden_state and pooler_output?
    - Essential Model Configuration and Architecture
2. Core AutoModel vs. Task-Specific AutoModelFor...
    - AutoModel Generate Embeddings
    - Task-Specific AutoModelFor... (Ready for Downstream Tasks)
3. AutoModelForSequenceClassification
    - Sentiment Analysis (Step by Step)
    - Step 1: Import and initialize the model and tokenizer
    - Step 2: Preprocess the data batch
    - Step 3: Pass your preprocessed batch of inputs directly to the model
    - Step 4: Applying softmax activation on the output of model
    - Step 5: Print the predicted labels
    - Sentiment Analysis with pipeline()
4. AutoModelForTokenClassification
5. A


## **Introduction to AutoModel**

### **What is AutoModel?**
Just as AutoTokenizer simplifies text preprocessing, AutoModel is your gateway to interacting with the pre-trained brain of a Transformer network.

AutoModel is a class from the transformers library designed to automatically load the correct pre-trained model architecture and its weights based on a given checkpoint name (e.g., "bert-base-uncased", "gpt2", "facebook/bart-large").

It loads:
- The weights
- The architecture class (like BertModel)
- The configuration (like hidden_size, num_layers, etc.)

But note that, AutoModel gives you the raw model without task-specific heads. It is used for feature extraction from text data. (Discussed in detail later in this notebook)


### **Using .from_pretrained("model_name")**

When you call AutoModel.from_pretrained("model_name") the library does several things:
1. **Configuration Download (`config.json`):** When you call AutoModel.from_pretrained("model_name"), the library first looks for and downloads the `config.json` file from the Hugging Face Hub for that model_name. This JSON file contains all the architectural blueprints and hyperparameters of the model (e.g., number of layers, hidden dimension, attention heads, vocabulary size, the task it's meant for).
2. **Class Instantiation:** Based on the model_type specified in config.json (e.g., "bert"), AutoModel dynamically determines the correct Python class to instantiate (e.g., transformers.BertModel).
3. **Weight Download:** Once the architecture is known, it downloads the actual pre-trained model weights (usually large binary files like **`pytorch_model.bin`** or **`tf_model.h5`**). These weights contain the knowledge the model acquired during its extensive pre-training on vast amounts of data.
4. **Local Caching:** All downloaded files (config, weights) are stored in your local Hugging Face cache directory (usually `~/.cache/huggingface/transformers`), so subsequent loads of the same model are much faster.
5. **Model Loading:** The model instance is created, and the downloaded weights are loaded into its layers.

### **Configuration Files**
- model.safetensors

In [28]:
from transformers import AutoModel, AutoTokenizer

model_checkpoint = "google-bert/bert-base-uncased"

model = AutoModel.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



### **What are these warnings?**
- These warnings are coming from the PyTorch backend, typically when a model is converted from a TensorFlow-style checkpoint (e.g., from .ckpt or .h5) into PyTorch, or during custom loading of model weights where the layer names do not fully align with PyTorch naming conventions.
- In TensorFlow / Keras, the common names used in Batch Normalization or Layer Normalization layers are:
    - gamma → scale → maps to weight in PyTorch
    - beta → offset → maps to bias in PyTorch
- When Hugging Face tries to load these weights into a PyTorch model, it renames them internally for compatibility.
- No need to worry about these warnings. Your model will still load and run correctly. It won’t affect performance, outputs, or fine-tuning.


In [29]:
# Preprocess the input text
inputs = tokenizer("What will be the output of model?", return_tensors="pt")

print(inputs)

{'input_ids': tensor([[ 101, 2054, 2097, 2022, 1996, 6434, 1997, 2944, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [30]:
# Now pass your preprocessed batch of inputs directly to the model. 
# You just have to unpack the dictionary by adding **
outputs = model(**inputs)

print(outputs.keys())

odict_keys(['last_hidden_state', 'pooler_output'])


### **What is last_hidden_state and pooler_output?**

**last_hidden_state (Embeddings for each token)**
- It's the final output from the last encoder layer of the transformer. It contains token-level contextual embeddings.
- Tensor of shape **(batch_size, seq_len, hidden_size)**
- Used in NER, attention, sentence embeddings

**pooler_output (Sentence-Level Embedding)**
- It is a summary representation of the entire sentence, derived from the [CLS] token.
- Specifically:
    - Take the embedding of [CLS] from last_hidden_state
    - Pass it through a dense (Linear) layer + Tanh activation
- This gives you a fixed-size sentence vector.
- Tensor of shape **(batch_size, hidden_size)**
- Some models like DistilBERT, RoBERTa may not include pooler_output by default.

**Example**  

Let's say input is: "I love GenAI!"

For the above input, last_hidden_state and pooler_output will be as follows:

- last_hidden_state contains:
    - [CLS] → Vector 1
    - I → Vector 2
    - love → Vector 3
    - GenAI → Vector 4
    - ! → Vector 5
    - [SEP] → Vector 6
- pooler_output = Tanh(Dense(Vector 1))



In [32]:
outputs.last_hidden_state.shape

torch.Size([1, 10, 768])

In [33]:
outputs.pooler_output.shape

torch.Size([1, 768])

### **Essential Model Configuration and Architecture**
- You can access the model's configuration via its .config attribute. The config object contains all the architectural details and hyperparameters.
- **model.config.model_type:** This defines the type of model architecture used, such as: "bert", "roberta", "gpt2", "t5", "distilbert", "bloom", etc. It helps Hugging Face determine:
    - What tokenizer class to use
    - Which model architecture to load
    - How to handle special tokens like [CLS], [SEP], etc.
- **model.config.vocab_size:** The number of unique tokens (words, subwords, or characters) the model knows. For eg:
    - BERT: 30522
    - GPT-2: 50257
    - RoBERTa: 50265
    - T5: 32128
- **model.config.num_attention_heads:** Number of self-attention heads in each Transformer layer. For eg:
    - BERT-base: 12 heads
    - BERT-large: 16 heads
    - GPT-2: 12, 24, or 32 depending on size
- **model.config.num_hidden_layers:** The number of Transformer encoder (or decoder) layers in the model. Each layer contains Multi-head self-attention, Feed-forward neural network and Layer norm + residual connections. For eg:
    - BERT-base: 12 layers
    - BERT-large: 24 layers
    - GPT-2 medium: 24 layers
- **model.config.hidden_size:** The size of each hidden layer’s output vector and the embedding dimension. It controls: Size of token embeddings, Size of [CLS] vector, and Input/output shape of attention blocks. Bigger hidden_size = better learning capacity, but slower inference and training. For eg:
    - BERT-base: 768
    - BERT-large: 1024
    - GPT-2: 768, 1024, or 1600
- **model.config.max_position_embeddings:** This defines the maximum input sequence length the model can handle. Each token in the input gets a positional embedding based on its position (1st token, 2nd token, etc.). If your sequence is longer than this → it will be truncated or need special handling like chunking or sliding window. Typical values:
    - BERT: 512
    - RoBERTa: 514
    - GPT-2: 1024
- **model.training:** This is a PyTorch flag that indicates whether the model is in training mode (True) or evaluation mode (False).
    - Training mode: Dropout is enabled
    - Evaluation mode: Dropout is disabled
```python
model.train()  # enables training mode
model.eval()   # sets evaluation mode
```

In [34]:
print("\n--- Model Configuration ---")
print(f"  Model Type: {model.config.model_type}")
print(f"  Vocabulary Size: {model.config.vocab_size}")
print(f"  Number of Attention Heads: {model.config.num_attention_heads}")
print(f"  Number of Layers: {model.config.num_hidden_layers}")
print(f"  Hidden Size (Embedding Dimension): {model.config.hidden_size}")
print(f"  Max Position Embeddings (max sequence length it can handle): {model.config.max_position_embeddings}")


--- Model Configuration ---
  Model Type: bert
  Vocabulary Size: 30522
  Number of Attention Heads: 12
  Number of Layers: 12
  Hidden Size (Embedding Dimension): 768
  Max Position Embeddings (max sequence length it can handle): 512


In [35]:
# You can also see if it's currently in training or evaluation mode
print(f"Model is in training mode (default): {model.training}")

# It's good practice to set the model to evaluation mode for inference
model.eval()
print(f"Model is in evaluation mode: {model.training}")

Model is in training mode (default): False
Model is in evaluation mode: False


In [36]:
print("----Model Architecture----")
print(model)

----Model Architecture----
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): D

## **Core AutoModel vs. Task-Specific AutoModelFor...**

### **AutoModel Generates Embeddings**
- Loads only the core Transformer architecture (encoder or decoder layers) without any task-specific output layers (like a classification head or a language modeling head).
- **Output:** The raw hidden states (or embeddings) produced by the model's final layer for each input token. These are rich, contextualized numerical representations of your text.
- Use Cases:
    - **Feature Extraction:** You want to get embeddings for your text that you can then feed into a traditional machine learning model (e.g., SVM, Logistic Regression).
    - **Building Custom Heads:** You want to design your own custom layers on top of the Transformer backbone for a highly specific task.
    - **Semantic Search:** Using sentence embeddings to find similar documents.

### **Task-Specific AutoModelFor... (Ready for Downstream Tasks)**
These models include the pre-trained Transformer backbone plus a specific "head" (a small neural network layer) trained or fine-tuned for a particular task. Their output is directly usable for that task.

This is a critical distinction! The transformers library provides various AutoModelFor... classes, each tailored for a specific downstream task. They all share the same pre-trained backbone but differ in the "head" (the final layers) added on top.

| Class                                | Use Case                         | Architecture                      |
| ------------------------------------ | -------------------------------- | --------------------------------- |
| `AutoModel`                          | Raw model for feature extraction | BERT, RoBERTa, GPT, etc.          |
| `AutoModelForSequenceClassification` | Text classification              | Adds a classification head        |
| `AutoModelForTokenClassification`    | NER, POS tagging                 | Token-wise classification         |
| `AutoModelForQuestionAnswering`      | QnA tasks like SQuAD             | Outputs start/end logits          |
| `AutoModelForCausalLM`               | Text generation (GPT)            | Decoder-only LM                   |
| `AutoModelForMaskedLM`               | Fill-in-the-blank (BERT-style)   | Masked token prediction           |
| `AutoModelForSeq2SeqLM`              | Translation, Summarization       | Encoder-decoder models (T5, BART) |

## **AutoModelForSequenceClassification**

- For tasks where you classify an entire sequence (sentence/document) into one or more categories.
- **Head:** A linear layer applied to the hidden state of the special [CLS] token (or the pooled output of the sequence) to predict class logits.
- **Output:** logits (raw scores for each class) → shape (batch_size, num_labels). You usually apply a softmax activation (i.e. softmax(logits)) to these to get probabilities  → classification probabilities
- **Use Cases:** Sentiment analysis, spam detection, topic classification, intent recognition.

### **Sentiment Analysis (Step by Step)**
- Step 1: Import and initialize the model and tokenizer
- Step 2: Preprocess the data batch
- Step 3: Pass your preprocessed batch of inputs directly to the model
- Step 4: Applying softmax activation on the output of model
- Step 5: Print the predicted labels

### **Step 1: Import and initialize the model and tokenizer**

In [27]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_checkpoint = "cardiffnlp/twitter-roberta-base-sentiment-latest"

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### **Step 2: Preprocess the data batch**

In [22]:
tweets = ["A very bad experience at the airport.", 
          "Amazing job done by the government with their initiatives.", 
          "This is just a random string."]

token_ids = tokenizer(tweets, padding=True, truncation=True, max_length=15, return_tensors="pt")

print(token_ids)

{'input_ids': tensor([[    0,   250,   182,  1099,   676,    23,     5,  3062,     4,     2,
             1,     1],
        [    0, 41710,   633,   626,    30,     5,   168,    19,    49,  5287,
             4,     2],
        [    0,   713,    16,    95,    10,  9624,  6755,     4,     2,     1,
             1,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}


### **Step 3: Pass your preprocessed batch of inputs directly to the model**

In [23]:
# Now pass your preprocessed batch of inputs directly to the model. 
# You just have to unpack the dictionary by adding **
outputs = model(**token_ids)

print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.5498, -0.3649, -2.5233],
        [-2.2652, -1.1891,  3.2718],
        [ 0.5962,  1.1877, -1.9644]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


### **Step 4: Applying softmax activation on the output of model**

In [24]:
# Applying softmax activation on the output of model
from torch import nn

predictions = nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

tensor([[0.9430, 0.0511, 0.0059],
        [0.0039, 0.0114, 0.9847],
        [0.3468, 0.6265, 0.0268]], grad_fn=<SoftmaxBackward0>)


### **Step 5: Print the predicted labels**

In [26]:
# Get the index of the maximum value (the predicted class)
predicted_class_idx = predictions.argmax(dim=-1)

# Define the mapping from indices to class labels
class_labels = ['negative', 'neutral', 'positive']

# Convert the predicted indices to the corresponding labels
predicted_labels = [class_labels[idx] for idx in predicted_class_idx]

# Print the predicted labels
for i in range(len(predicted_labels)):
    print(f"Tweet: {tweets[i]}")
    print(f"Prediction: {predicted_labels[i]}")
    print()

Tweet: A very bad experience at the airport.
Prediction: negative

Tweet: Amazing job done by the government with their initiatives.
Prediction: positive

Tweet: This is just a random string.
Prediction: neutral



### **Sentiment Analysis with pipeline()**

In [39]:
from transformers import pipeline
classifier = pipeline(task="text-classification", 
                      model="cardiffnlp/twitter-roberta-base-sentiment-latest")
tweet = "A very bad experience at the airport."
classifier(tweet)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'negative', 'score': 0.9429681897163391}]

## **AutoModelForTokenClassification**
- For tasks where you classify each token in a sequence.
- **Head:** A linear layer applied to the hidden state of every token in the sequence.
- **Output:** logits for each token, for each possible class.
- **Use Cases:** Named Entity Recognition (NER), Part-of-Speech (POS) tagging.

In [42]:
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_checkpoint = "dslim/bert-base-NER"

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

inputs = tokenizer("My name is ThatAIGuy", return_tensors="pt")
outputs = model(**inputs)

print(outputs.logits.shape)  # [1, seq_len, num_labels]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([1, 9, 9])


## **AutoModelFor`*`, TFAutoModelFor`*` and FlaxAutoModelFor`*`**

We will show how to use those briefly, following this pattern:

* Given input articles.
* Tokenize them (converting to token indices).
* Apply the model on the tokenized data to generate summaries (represented as token indices).
* Decode the summaries into human-readable text.

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd


# Load the pre-trained tokenizer.
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Load the pre-trained model.
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

In [7]:
# For summarization, T5-small expects a prefix "summarize: ", 
# so we prepend that to each article as a prompt.

articles = list(map(lambda article: "summarize: " + article, xsum_sample["document"]))

pd.DataFrame(articles, columns=["prompts"])

NameError: name 'xsum_sample' is not defined

In [8]:
# Tokenize the input

inputs = tokenizer(
    articles, return_tensors="pt", padding=True, truncation=True, max_length=1024
)

print("input_ids:")
print(inputs["input_ids"])
print("attention_mask:")
print(inputs["attention_mask"])

NameError: name 'articles' is not defined

In [9]:
# Generate summaries

summary_ids = model.generate(
                inputs.input_ids,
                attention_mask=inputs.attention_mask,
                num_beams=2,
                min_length=0,
                max_length=40,
)

print(summary_ids)

tensor([[    0,  1609,  1997,  2944,  1029,   102,  2944,  1029,   102,  1029,
           102, 14839,  1029,   102,  1029,   102, 14839,  1029,   102, 14839,
          1029,   102, 14839,  1029,   102, 14839,  1029,   102, 14839,  1029,
           102, 14839,  1029,   102, 14839,  1029,   102, 14839,  1029,   102]])


In [10]:
# Decode the generated summaries

decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

pd.DataFrame(decoded_summaries, columns=["decoded_summaries"])

Unnamed: 0,decoded_summaries
0,Get sun bought Fromp bought Fromp Fromp Purcha...


## **Fine-Tunning**

https://huggingface.co/docs/transformers/training#train-a-tensorflow-model-with-keras