# Transformers Library: A Comprehensive Guide

This notebook provides a thorough introduction to the Hugging Face Transformers library, bridging the gap between basic tokenization concepts and advanced applications like math problem solving. We'll explore various transformer model architectures, how to access pre-trained models, and how to use them effectively.

## 1. Introduction to the Transformers Library

The Hugging Face Transformers library has become the de facto standard for working with transformer-based models in natural language processing (NLP) and beyond. Let's install and import the necessary libraries:

In [1]:
# Install required libraries (uncomment if needed)
# !pip install transformers datasets torch pandas numpy matplotlib

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
from transformers import pipeline, set_seed
import os
from datasets import Dataset
import seaborn as sns

In [3]:
# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


## 2. Understanding Transformer Architecture Types

Transformer models come in three main architectural varieties, each suited for different tasks:

### 2.1 Encoder-Only Models

Encoder-only models process the entire input sequence and generate contextualized representations for each token. They're excellent for understanding tasks.

**Examples**: BERT, RoBERTa, DistilBERT, ALBERT

**Best for**: Classification, named entity recognition, token classification, feature extraction

In [4]:
# Load a BERT model
bert_model_name = "bert-base-uncased"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
bert_model = AutoModel.from_pretrained(bert_model_name)

# Example input
text = "Transformers are powerful models for various NLP tasks."
inputs = bert_tokenizer(text, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    outputs = bert_model(**inputs)

# The last hidden states contain contextual embeddings for each token
last_hidden_states = outputs.last_hidden_state
print(f"BERT output shape: {last_hidden_states.shape}")
print(f"This represents embeddings for {last_hidden_states.shape[1]} tokens, each with dimension {last_hidden_states.shape[2]}")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT output shape: torch.Size([1, 12, 768])
This represents embeddings for 12 tokens, each with dimension 768


### 2.2 Decoder-Only Models

Decoder-only models generate text autoregressively, predicting one token at a time based on previous tokens. They excel at text generation tasks.

**Examples**: GPT, GPT-2, GPT-3, Bloom, LLaMA

**Best for**: Text generation, completion, story writing, code generation

In [5]:
# Load a GPT-2 model
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)
gpt2_model = AutoModelForCausalLM.from_pretrained(gpt2_model_name)

# Set padding token to be the same as EOS token
if gpt2_tokenizer.pad_token is None:
    gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Example text generation
prompt = "The solution to the equation 2x + 5 = 13 is"
input_ids = gpt2_tokenizer(prompt, return_tensors="pt").input_ids

# Generate text
set_seed(42)  # For reproducibility
generated_outputs = gpt2_model.generate(
    input_ids,
    max_length=30,
    num_return_sequences=1,
    temperature=0.7,
    do_sample=True,
    pad_token_id=gpt2_tokenizer.pad_token_id
)

generated_text = gpt2_tokenizer.decode(generated_outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated text: The solution to the equation 2x + 5 = 13 is:

(13 x 5) = 13 x 13 = 12

This is


## 3. Accessing and Using Pre-trained Models

Hugging Face hosts thousands of pre-trained models that can be easily accessed and used.

### 3.1 The Hugging Face Hub

The Hugging Face Hub is a platform for sharing and discovering models, datasets, and more:
- Contains over 100,000 models
- Supports multiple frameworks (PyTorch, TensorFlow, JAX)
- Provides model cards with documentation
- Allows search by task, language, and other criteria

### 3.2 Model Naming Conventions

Understanding model names helps you choose the right one:

In [7]:
# Common naming patterns
model_naming = {
    'Architecture': ['bert-base-uncased', 'roberta-large', 'gpt2-medium', 't5-small'],
    'Size': ['base/small (110-125M params)', 'large (300-350M params)', 'xl (750M-1.5B params)'],
    'Case Sensitivity': ['uncased (lowercase)', 'cased (preserves case)'],
    'Language': ['multilingual (bert-base-multilingual)', 'english (roberta-base)', 'german (dbmdz/bert-base-german)'],
    'Domain': ['financial (finbert)', 'biomedical (biomed-roberta-base)', 'code (codegen)']
}

# Display as a table
for category, examples in model_naming.items():
    print(f"{category}: {', '.join(examples)}")

Architecture: bert-base-uncased, roberta-large, gpt2-medium, t5-small
Size: base/small (110-125M params), large (300-350M params), xl (750M-1.5B params)
Case Sensitivity: uncased (lowercase), cased (preserves case)
Language: multilingual (bert-base-multilingual), english (roberta-base), german (dbmdz/bert-base-german)
Domain: financial (finbert), biomedical (biomed-roberta-base), code (codegen)


### 3.3 Loading Pre-trained Models

The `AutoModel` classes simplify loading models for different tasks:

In [8]:
# Using Auto classes to load models for different tasks
from transformers import (
    AutoModel,                 # Base model (embeddings only)
    AutoModelForSequenceClassification,  # Classification
    AutoModelForTokenClassification,     # NER, POS tagging
    AutoModelForQuestionAnswering,       # Question answering
    AutoModelForMaskedLM,                # Masked language modeling
    AutoModelForCausalLM,                # Text generation
    AutoModelForSeq2SeqLM                # Translation, summarization
)

# Example of loading a model for classification
num_labels = 2  # Binary classification
classifier = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=num_labels)
print(f"Model loaded with {sum(p.numel() for p in classifier.parameters() if p.requires_grad):,} trainable parameters")

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded with 66,955,010 trainable parameters


## 4. Working with Tokenizers in Depth

Tokenizers convert text into a format models can understand. Different models use different tokenization strategies.

### 4.1 Types of Tokenizers

In [9]:
# Display different tokenization strategies
tokenization_examples = {
    'Text': "The transformer library allows fine-tuning on math problems like 3x + 7 = 28.",
    'WordPiece (BERT)': ["the", "transform", "##er", "library", "allows", "fine", "-", "tuning", "on", "math", "problems", "like", "3", "x", "+", "7", "=", "28", "."],
    'BPE (GPT-2)': ["The", "Ġtransformer", "Ġlibrary", "Ġallows", "Ġfine", "-", "tuning", "Ġon", "Ġmath", "Ġproblems", "Ġlike", "Ġ3", "x", "Ġ+", "Ġ7", "Ġ=", "Ġ28", "."],
    'SentencePiece (T5)': ["▁The", "▁transformer", "▁library", "▁allows", "▁fine", "-", "tuning", "▁on", "▁math", "▁problems", "▁like", "▁3", "x", "▁+", "▁7", "▁=", "▁28", "."]
}

# Compare tokenization outputs
for method, tokens in tokenization_examples.items():
    if method == 'Text':
        print(f"{method}: {tokens}")
    else:
        print(f"{method}: {tokens}")
        print(f"Token count: {len(tokens)}")
    print()

Text: The transformer library allows fine-tuning on math problems like 3x + 7 = 28.

WordPiece (BERT): ['the', 'transform', '##er', 'library', 'allows', 'fine', '-', 'tuning', 'on', 'math', 'problems', 'like', '3', 'x', '+', '7', '=', '28', '.']
Token count: 19

BPE (GPT-2): ['The', 'Ġtransformer', 'Ġlibrary', 'Ġallows', 'Ġfine', '-', 'tuning', 'Ġon', 'Ġmath', 'Ġproblems', 'Ġlike', 'Ġ3', 'x', 'Ġ+', 'Ġ7', 'Ġ=', 'Ġ28', '.']
Token count: 18

SentencePiece (T5): ['▁The', '▁transformer', '▁library', '▁allows', '▁fine', '-', 'tuning', '▁on', '▁math', '▁problems', '▁like', '▁3', 'x', '▁+', '▁7', '▁=', '▁28', '.']
Token count: 18



### 4.2 Practical Tokenization Examples

Let's tokenize the same text using different tokenizers:

In [10]:
# Example text for tokenization
text = "The equation 3x + 7 = 28 has the solution x = 7."

# Get tokenizers for different models
tokenizers = {
    "BERT": AutoTokenizer.from_pretrained("bert-base-uncased"),
    "GPT-2": AutoTokenizer.from_pretrained("gpt2"),
    "T5": AutoTokenizer.from_pretrained("t5-small"),
    "RoBERTa": AutoTokenizer.from_pretrained("roberta-base")
}

# Compare tokenization results
for name, tokenizer in tokenizers.items():
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    decoded = tokenizer.decode(token_ids)
    
    print(f"{name} tokenization:")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}")
    print(f"Token count: {len(tokens)}")
    print(f"Decoded text: {decoded}")
    print(f"Vocabulary size: {tokenizer.vocab_size}")
    print("-" * 50)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

BERT tokenization:
Tokens: ['the', 'equation', '3', '##x', '+', '7', '=', '28', 'has', 'the', 'solution', 'x', '=', '7', '.']
Token IDs: [101, 1996, 8522, 1017, 2595, 1009, 1021, 1027, 2654, 2038, 1996, 5576, 1060, 1027, 1021, 1012, 102]
Token count: 15
Decoded text: [CLS] the equation 3x + 7 = 28 has the solution x = 7. [SEP]
Vocabulary size: 30522
--------------------------------------------------
GPT-2 tokenization:
Tokens: ['The', 'Ġequation', 'Ġ3', 'x', 'Ġ+', 'Ġ7', 'Ġ=', 'Ġ28', 'Ġhas', 'Ġthe', 'Ġsolution', 'Ġx', 'Ġ=', 'Ġ7', '.']
Token IDs: [464, 16022, 513, 87, 1343, 767, 796, 2579, 468, 262, 4610, 2124, 796, 767, 13]
Token count: 15
Decoded text: The equation 3x + 7 = 28 has the solution x = 7.
Vocabulary size: 50257
--------------------------------------------------
T5 tokenization:
Tokens: ['▁The', '▁equation', '▁3', 'x', '▁+', '▁7', '▁=', '▁28', '▁has', '▁the', '▁solution', '▁', 'x', '▁=', '▁7.']
Token IDs: [37, 13850, 220, 226, 1768, 489, 3274, 2059, 65, 8, 1127, 3, 226, 3274

### 4.3 Handling Special Tokens

Understanding special tokens is crucial for proper model usage:

In [11]:
# Display special tokens for different models
for name, tokenizer in tokenizers.items():
    print(f"{name} special tokens:")
    
    special_tokens = {}
    if hasattr(tokenizer, 'cls_token'): 
        special_tokens['CLS token'] = tokenizer.cls_token
    if hasattr(tokenizer, 'sep_token'): 
        special_tokens['SEP token'] = tokenizer.sep_token
    if hasattr(tokenizer, 'pad_token'): 
        special_tokens['PAD token'] = tokenizer.pad_token
    if hasattr(tokenizer, 'mask_token'): 
        special_tokens['MASK token'] = tokenizer.mask_token
    if hasattr(tokenizer, 'eos_token'): 
        special_tokens['EOS token'] = tokenizer.eos_token
    if hasattr(tokenizer, 'bos_token'): 
        special_tokens['BOS token'] = tokenizer.bos_token
    
    for token_name, token in special_tokens.items():
        token_id = tokenizer.convert_tokens_to_ids(token) if token else None
        print(f"  {token_name}: '{token}' (ID: {token_id})")
    
    print()

BERT special tokens:
  CLS token: '[CLS]' (ID: 101)
  SEP token: '[SEP]' (ID: 102)
  PAD token: '[PAD]' (ID: 0)
  MASK token: '[MASK]' (ID: 103)
  EOS token: 'None' (ID: None)
  BOS token: 'None' (ID: None)

GPT-2 special tokens:
  CLS token: 'None' (ID: None)
  SEP token: 'None' (ID: None)
  PAD token: 'None' (ID: None)
  MASK token: 'None' (ID: None)
  EOS token: '<|endoftext|>' (ID: 50256)
  BOS token: '<|endoftext|>' (ID: 50256)

T5 special tokens:
  CLS token: 'None' (ID: None)
  SEP token: 'None' (ID: None)
  PAD token: '<pad>' (ID: 0)
  MASK token: 'None' (ID: None)
  EOS token: '</s>' (ID: 1)
  BOS token: 'None' (ID: None)

RoBERTa special tokens:
  CLS token: '<s>' (ID: 0)
  SEP token: '</s>' (ID: 2)
  PAD token: '<pad>' (ID: 1)
  MASK token: '<mask>' (ID: 50264)
  EOS token: '</s>' (ID: 2)
  BOS token: '<s>' (ID: 0)



## 5. Using Transformer Models with Pipelines

Pipelines are the easiest way to use models for practical tasks:

### 5.1 Text Classification

In [12]:
# Sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

texts = [
    "I love working with transformer models!",
    "This code is confusing and difficult to understand.",
    "Learning about neural networks is challenging but rewarding."
]

for text in texts:
    result = sentiment_analyzer(text)
    print(f"Text: {text}")
    print(f"Sentiment: {result[0]['label']} (Score: {result[0]['score']:.4f})")
    print()

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


Text: I love working with transformer models!
Sentiment: POSITIVE (Score: 0.9993)

Text: This code is confusing and difficult to understand.
Sentiment: NEGATIVE (Score: 0.9996)

Text: Learning about neural networks is challenging but rewarding.
Sentiment: POSITIVE (Score: 0.9999)



### 5.2 Question Answering

In [13]:
# Question answering pipeline
qa_pipeline = pipeline("question-answering")

context = """
Transformer models were introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.
The original transformer architecture uses an encoder-decoder structure and relies entirely on attention
mechanisms without recurrence or convolution. Since then, many variants have been developed, including
BERT, GPT, T5, and many others. These models have significantly advanced the state of the art in 
natural language processing tasks.
"""

questions = [
    "When were transformer models introduced?",
    "Who introduced transformer models?",
    "What is special about the architecture?",
    "What are some variants of the transformer model?"
]

for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"Question: {question}")
    print(f"Answer: {result['answer']} (Score: {result['score']:.4f})")
    print()

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


Question: When were transformer models introduced?
Answer: 2017 (Score: 0.9008)

Question: Who introduced transformer models?
Answer: Vaswani et al. (Score: 0.6975)

Question: What is special about the architecture?
Answer: attention
mechanisms without recurrence or convolution (Score: 0.1351)

Question: What are some variants of the transformer model?
Answer: BERT, GPT, T5 (Score: 0.9328)



### 5.3 Text Generation

In [None]:
# Text generation pipeline
generator = pipeline("text-generation", model="gpt2")

prompts = [
    "The quadratic formula for solving ax² + bx + c = 0 is",
    "To calculate the area of a circle, you need to",
    "The Pythagorean theorem states that"
]

for prompt in prompts:
    set_seed(42)  # For reproducibility
    results = generator(prompt, max_length=50, num_return_sequences=1)
    print(f"Prompt: {prompt}")
    print(f"Generated: {results[0]['generated_text']}")
    print()

### 5.4 Custom Pipeline for Math Problem Generation

In [None]:
def generate_math_problem(generator, topic, difficulty, max_length=60):
    """Generate a math problem based on topic and difficulty."""
    prompt = f"Generate a {difficulty} math problem about {topic}:\n\nProblem:"
    
    set_seed(42)  # For reproducibility
    results = generator(prompt, max_length=max_length, num_return_sequences=1)
    
    return results[0]['generated_text']

# Create a text generation pipeline with GPT-2
math_generator = pipeline("text-generation", model="gpt2")

# Topics and difficulties
topics = ["algebra", "geometry", "calculus"]
difficulties = ["easy", "medium", "hard"]

# Generate problems
for topic in topics:
    for difficulty in difficulties:
        problem = generate_math_problem(math_generator, topic, difficulty)
        print(f"{difficulty.capitalize()} {topic} problem:")
        print(problem)
        print("-" * 60)

## 6. Understanding Model Hyperparameters

Hyperparameters significantly affect model performance and behavior.

### 6.1 Model Hyperparameters

These parameters define the model's architecture and capabilities:

In [None]:
# Common model hyperparameters
model_hyperparams = {
    'Model Architecture': {
        'hidden_size': 'Dimension of hidden layers (e.g., 768, 1024)',
        'num_hidden_layers': 'Number of transformer layers (e.g., 12, 24)',
        'num_attention_heads': 'Number of attention heads per layer (e.g., 12, 16)',
        'intermediate_size': 'Size of feedforward layer (typically 4x hidden_size)',
        'hidden_dropout_prob': 'Dropout probability for hidden layers (e.g., 0.1)',
        'attention_probs_dropout_prob': 'Dropout probability for attention probabilities'
    }
}

# Display model hyperparameters
for category, params in model_hyperparams.items():
    print(f"\n{category}:")
    for param, description in params.items():
        print(f"  {param}: {description}")

In [None]:
# Examine a model's configuration
bert_model = AutoModel.from_pretrained("bert-base-uncased")
config = bert_model.config
print("BERT model configuration:")
for param, value in vars(config).items():
    print(f"  {param}: {value}")

### 6.2 Generation Hyperparameters

These parameters control how text is generated:

In [None]:
# Common generation hyperparameters
generation_hyperparams = {
    'Basic Control': {
        'max_length': 'Maximum length of generated sequence',
        'min_length': 'Minimum length of generated sequence',
        'do_sample': 'Whether to use sampling (True) or greedy decoding (False)'
    },
    'Sampling Parameters': {
        'temperature': 'Controls randomness (lower = more deterministic, higher = more random)',
        'top_k': 'Limits sampling to top k highest probability tokens',
        'top_p': 'Limits sampling to highest probability tokens that sum to p (nucleus sampling)'
    },
    'Other Controls': {
        'num_beams': 'Number of beams for beam search',
        'no_repeat_ngram_size': 'Prevents repetition of n-grams',
        'repetition_penalty': 'Penalizes repeated tokens'
    }
}

# Display generation hyperparameters
for category, params in generation_hyperparams.items():
    print(f"\n{category}:")
    for param, description in params.items():
        print(f"  {param}: {description}")

### 6.3 Effect of Temperature on Generation

Let's see how temperature affects text generation:

In [None]:
# Temperature experiment
prompt = "The formula for solving a quadratic equation is"
temperatures = [0.2, 0.5, 1.0, 2.0]

print(f"Prompt: {prompt}\n")
for temp in temperatures:
    set_seed(42)  # For reproducibility
    output = generator(prompt, max_length=40, temperature=temp, do_sample=True)
    print(f"Temperature = {temp}:")
    print(f"{output[0]['generated_text']}")
    print()