# Module 17: HuggingFace Ecosystem

**The Complete Toolkit for Modern NLP**

---

## 1. Objectives

- ‚úÖ Master HuggingFace pipelines
- ‚úÖ Understand tokenizers in depth
- ‚úÖ Use datasets library
- ‚úÖ Know the Hub and model sharing

## 2. Prerequisites

- [Module 16: GPT](../16_gpt/16_gpt.ipynb)

## 3. HuggingFace Libraries

| Library | Purpose |
|---------|--------|
| `transformers` | Models, tokenizers, pipelines |
| `datasets` | Data loading & processing |
| `tokenizers` | Fast tokenizers (Rust) |
| `evaluate` | Metrics (accuracy, F1, BLEU) |
| `accelerate` | Distributed training |
| `peft` | Parameter-efficient fine-tuning |

In [1]:
# Install if needed
# !pip install transformers datasets evaluate accelerate

import torch
from transformers import pipeline, AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from datasets import load_dataset

print(f"PyTorch: {torch.__version__}")



PyTorch: 2.9.0+cpu


## 4. Pipelines (Fastest Way to Use Models)

In [2]:
# Sentiment Analysis
classifier = pipeline('sentiment-analysis')
results = classifier(["I love this!", "This is terrible."])
print("Sentiment:", results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Sentiment: [{'label': 'POSITIVE', 'score': 0.9998764991760254}, {'label': 'NEGATIVE', 'score': 0.9996345043182373}]


In [3]:
# Named Entity Recognition
ner = pipeline('ner', aggregation_strategy='simple')
text = "Elon Musk founded SpaceX in California."
entities = ner(text)
print(f"\nNER: {text}")
for e in entities:
    print(f"  {e['word']}: {e['entity_group']} ({e['score']:.2f})")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu



NER: Elon Musk founded SpaceX in California.
  Elon Musk: PER (1.00)
  SpaceX: ORG (1.00)
  California: LOC (1.00)


In [4]:
# Text Generation
generator = pipeline('text-generation', model='gpt2')
output = generator("The future of AI is", max_length=30, do_sample=True)
print("\nGeneration:", output[0]['generated_text'])

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Generation: The future of AI is in the hands of a single leader ‚Äî not some kind of AI.

The question is, how do we build it? I'll spare you the details, but let me give you a quick overview: AI is probably the most complex and successful artificial intelligence technology ever devised. In a sense, it's the most complete and efficient.

AI was invented by John Searle, who created a computer program called DeepDiscovery in 1965. That program, which enabled computers to learn, understand, and even create new things, was used in several industries, including medical research, automotive engineering, and government.

What's more, it's also the most powerful machine ever created by a computer. A machine with 10,000 neurons can learn from 500,000 neurons of a single neuron. It's also a computer that can do a task at speeds up to 250 times faster than humans.

It's also a computer that can learn a lot from other machines, such as other humans, but it's not as powerful as the human brain.

I

In [5]:
# Question Answering
qa = pipeline('question-answering')
result = qa(
    question="What is the capital of France?",
    context="Paris is the capital and largest city of France."
)
print(f"\nQA: {result['answer']} (score: {result['score']:.2f})")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu



QA: Paris (score: 0.99)


### All Available Pipelines

| Task | Pipeline Name |
|------|---------------|
| Classification | `text-classification`, `sentiment-analysis` |
| NER | `ner`, `token-classification` |
| QA | `question-answering` |
| Generation | `text-generation`, `text2text-generation` |
| Summarization | `summarization` |
| Translation | `translation` |
| Fill Mask | `fill-mask` |
| Zero-shot | `zero-shot-classification` |

## 5. Tokenizers Deep Dive

In [6]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Hello, how are you doing today?"

# Different encoding methods
print("1. Basic tokenize:")
tokens = tokenizer.tokenize(text)
print(f"   {tokens}")

print("\n2. Encode (token IDs):")
ids = tokenizer.encode(text)
print(f"   {ids}")

print("\n3. Full encoding (with attention mask):")
encoded = tokenizer(text, return_tensors='pt')
print(f"   input_ids: {encoded['input_ids']}")
print(f"   attention_mask: {encoded['attention_mask']}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

1. Basic tokenize:
   ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']

2. Encode (token IDs):
   [101, 7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029, 102]

3. Full encoding (with attention mask):
   input_ids: tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029,  102]])
   attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [7]:
# Batch encoding with padding
texts = ["Short text.", "This is a much longer text that needs padding."]

batch = tokenizer(
    texts,
    padding=True,          # Pad to longest
    truncation=True,       # Truncate if too long
    max_length=20,         # Max sequence length
    return_tensors='pt'    # Return PyTorch tensors
)

print("Batch encoding:")
print(f"  input_ids shape: {batch['input_ids'].shape}")
print(f"  attention_mask: {batch['attention_mask']}")

Batch encoding:
  input_ids shape: torch.Size([2, 13])
  attention_mask: tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [8]:
# Special tokens
print("Special tokens:")
print(f"  [CLS] = {tokenizer.cls_token} (id: {tokenizer.cls_token_id})")
print(f"  [SEP] = {tokenizer.sep_token} (id: {tokenizer.sep_token_id})")
print(f"  [PAD] = {tokenizer.pad_token} (id: {tokenizer.pad_token_id})")
print(f"  [UNK] = {tokenizer.unk_token} (id: {tokenizer.unk_token_id})")
print(f"  [MASK] = {tokenizer.mask_token} (id: {tokenizer.mask_token_id})")

Special tokens:
  [CLS] = [CLS] (id: 101)
  [SEP] = [SEP] (id: 102)
  [PAD] = [PAD] (id: 0)
  [UNK] = [UNK] (id: 100)
  [MASK] = [MASK] (id: 103)


## 6. Datasets Library

In [9]:
# Load popular datasets
dataset = load_dataset('imdb', split='train[:1000]')  # First 1000 samples

print(f"Dataset: {dataset}")
print(f"\nSample:")
print(f"  Text: {dataset[0]['text'][:100]}...")
print(f"  Label: {dataset[0]['label']}")

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(‚Ä¶):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset: Dataset({
    features: ['text', 'label'],
    num_rows: 1000
})

Sample:
  Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it w...
  Label: 0


In [10]:
# Preprocessing with map
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['text']  # Remove original text column
)

print(f"Tokenized columns: {tokenized_dataset.column_names}")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenized columns: ['label', 'input_ids', 'token_type_ids', 'attention_mask']


In [11]:
# Convert to PyTorch format
tokenized_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

# Create DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(tokenized_dataset, batch_size=8, shuffle=True)

batch = next(iter(dataloader))
print(f"Batch keys: {batch.keys()}")
print(f"input_ids shape: {batch['input_ids'].shape}")

Batch keys: dict_keys(['label', 'input_ids', 'attention_mask'])
input_ids shape: torch.Size([8, 128])


## 7. HuggingFace Trainer

In [13]:
!pip install evaluate accelerate

from transformers import TrainingArguments, Trainer
import evaluate
import torch
from transformers import AutoModelForSequenceClassification

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2
)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=10,
    save_strategy='epoch',
)

# Metrics
accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

print("Trainer setup complete!")

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script: 0.00B [00:00, ?B/s]

Trainer setup complete!


## 8. üî• Real-World Usage

### Workflow Pattern

```python
# 1. Quick prototype
pipe = pipeline('sentiment-analysis')
pipe("Test text")

# 2. Custom model
model = AutoModelForSequenceClassification.from_pretrained(...)
tokenizer = AutoTokenizer.from_pretrained(...)

# 3. Fine-tune with Trainer
trainer = Trainer(model=model, args=args, ...)
trainer.train()

# 4. Push to Hub
model.push_to_hub('my-model')
```

### Best Practices

| Stage | Tool |
|-------|------|
| Prototype | `pipeline()` |
| Fine-tune | `Trainer` |
| Production | Convert to ONNX |

## 9. Interview Questions

**Q1: What's the difference between `encode` and `__call__`?**
<details><summary>Answer</summary>

- `encode()`: Returns only token IDs
- `__call__()`: Returns full dict with input_ids, attention_mask, token_type_ids
</details>

**Q2: How do you handle variable-length sequences?**
<details><summary>Answer</summary>

Use `padding=True` and `truncation=True` in tokenizer. Set `max_length` to limit sequence length. Attention mask indicates which tokens are real vs padding.
</details>

## 10. Summary

- **Pipelines**: One-line inference for any task
- **Tokenizers**: Handle encoding, padding, truncation
- **Datasets**: Efficient data loading and preprocessing
- **Trainer**: Simplified training loop with logging

## 11. References

- [HuggingFace Documentation](https://huggingface.co/docs)
- [HuggingFace Course](https://huggingface.co/course)
- [Model Hub](https://huggingface.co/models)

---
**Next:** [Module 18: Large Language Models](../18_llms/18_llms.ipynb)