 # Hugging Face Transformers: A Practical Introduction



 Welcome! This notebook will guide you through the Hugging Face ecosystem, focusing on the `transformers` library. We'll cover how to use pre-trained models for various tasks, how to fine-tune a model for text classification, and briefly touch upon other NLP tasks and sharing your work.



 **Target Audience:** Undergraduate biomedical engineering students.

 **Prior Knowledge:** Transformer architecture theory.



 Let's get started!

 ## Table of Contents



 - [1. Introduction to the Hugging Face Ecosystem](#1-introduction-to-the-hugging-face-ecosystem)

   - [1.1. The Hugging Face Hub](#11-the-hugging-face-hub)

   - [1.2. Core Libraries](#12-core-libraries)

 - [2. Using Pipelines](#2-using-pipelines)

   - [2.1. Sentiment Analysis](#21-sentiment-analysis)

   - [2.2. Text Generation](#22-text-generation)

   - [2.3. Zero-Shot Classification](#23-zero-shot-classification)

   - [2.4. Other Common Use Cases](#24-other-common-use-cases)

   - [Exercise 1: Translation with Pipelines](#exercise-1-translation-with-pipelines)

 - [3. Fine-tuning for Text Classification](#3-fine-tuning-for-text-classification)

   - [3.1. Load Dataset](#31-load-dataset)

   - [3.2. Preprocessing with a Tokenizer](#32-preprocessing-with-a-tokenizer)

   - [3.3. Load Pretrained Model](#33-load-pretrained-model)

   - [3.4. Define Training Arguments](#34-define-training-arguments)

   - [3.5. Define Metrics](#35-define-metrics)

   - [3.6. Initialize Trainer](#36-initialize-trainer)

   - [3.7. Train the Model](#37-train-the-model)

   - [3.8. Evaluate the Model](#38-evaluate-the-model)

   - [3.9. Brief Mention of Token Classification](#39-brief-mention-of-token-classification)

   - [Exercise 2: Fine-tune on a Different Dataset](#exercise-2-fine-tune-on-a-different-dataset)

 - [4. Other NLP Tasks](#4-other-nlp-tasks)

   - [4.1. Question Answering](#41-question-answering)

   - [4.2. Summarization](#42-summarization)

   - [4.3. Masked Language Modeling (Fill-Mask)](#43-masked-language-modeling-fill-mask)

   - [Exercise 3: Question Answering](#exercise-3-question-answering)

 - [5. Model Sharing & Demos](#5-model-sharing--demos)

   - [5.1. Sharing Models on the Hugging Face Hub](#51-sharing-models-on-the-hugging-face-hub)

   - [5.2. Creating Simple Demos with Gradio](#52-creating-simple-demos-with-gradio)

   - [Exercise 4: Gradio for Question Answering](#exercise-4-gradio-for-question-answering)

 - [Conclusion](#conclusion)

 ## Setup



 First, let's install the necessary libraries. If you're running this in Google Colab, some of these might already be installed.

In [15]:
# It's good practice to install specific versions for reproducibility,
# but for a general intro, the latest stable versions are usually fine.
# !pip install transformers datasets evaluate torch accelerate gradio -q

# optional to download the models faster
# !pip install huggingface_hub[hf_xet]
# !pip install hf_xet

# sometimes needed
# !pip install scikit-learn


 ## 1. Introduction to the Hugging Face Ecosystem



 The Hugging Face ecosystem is a collection of tools and resources designed to make state-of-the-art machine learning, especially Natural Language Processing (NLP), accessible to everyone.



 ![HF Ecosystem](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/summary.svg)

 *(Image Source: Hugging Face NLP Course)*



 ### 1.1. The Hugging Face Hub



 The [Hugging Face Hub](https://huggingface.co/) is a central platform where the community shares:

 - **Models**: Thousands of pre-trained models for various tasks (NLP, Computer Vision, Audio).

 - **Datasets**: A vast collection of datasets for training and evaluation.

 - **Spaces**: Interactive demos of ML models.



 You can browse, download, and contribute to the Hub. We'll be using it extensively.



 ### 1.2. Core Libraries



 The ecosystem revolves around several key open-source libraries:



 - **`transformers`**:

   - Provides access to thousands of pre-trained Transformer models (like BERT, GPT-2, T5, etc.).

   - Offers a unified API for loading models, tokenizers, and using them for inference and fine-tuning.

   - Simplifies complex architectures into easy-to-use classes.



 - **`datasets`**:

   - A library for easily accessing and processing datasets, especially large ones.

   - Provides tools for downloading, caching, mapping (preprocessing), and evaluating datasets.

   - Integrates seamlessly with `transformers` and other ML frameworks like PyTorch and TensorFlow.



 - **`tokenizers`**:

   - Offers highly optimized (Rust-backed) tokenizer implementations.

   - Handles the conversion of raw text into numerical inputs that models can understand.

   - Supports various tokenization strategies like BPE, WordPiece, and Unigram.



 - **`evaluate`**:

   - A library for easily evaluating machine learning models and datasets.

   - Provides implementations for many common metrics (e.g., accuracy, F1, BLEU, ROUGE).



 - **`accelerate`**:

   - Simplifies running PyTorch training scripts on any distributed configuration (single GPU, multiple GPUs, TPUs).

   - Requires minimal changes to your existing PyTorch code.



 These libraries work together to provide a comprehensive toolkit for your NLP projects.

 ## 2. Using Pipelines



 The `pipeline` function from the `transformers` library is the easiest way to use pre-trained models for inference. It abstracts away most of the preprocessing and postprocessing steps.



 ![NLP Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

 *(Image Source: Hugging Face NLP Course)*

In [2]:
from transformers import pipeline
import pandas as pd


 ### 2.1. Sentiment Analysis



 Let's start with a sentiment analysis pipeline. This will classify a piece of text as positive or negative.

In [None]:
sentiment_classifier = pipeline("sentiment-analysis")
results = sentiment_classifier(["I love using Hugging Face!", "This is not what I expected."])
pd.DataFrame(results)


 ### 2.2. Text Generation



 Next, let's try generating some text. This pipeline uses a model like GPT-2 to complete a given prompt.

In [None]:
from transformers import set_seed
import textwrap

set_seed(42) # For reproducible results

text_generator = pipeline("text-generation", model="gpt2") # You can specify other models like "distilgpt2"
prompt = "Biomedical engineering is a field that"
generated_texts = text_generator(prompt, max_length=50, num_return_sequences=1)

print("\n" + "="*50)
print("GENERATED TEXT:")
print("="*50)

generated_text = generated_texts[0]['generated_text']
wrapped_text = textwrap.fill(generated_text, width=60)
print(wrapped_text)

print("="*50)


 ### 2.3. Zero-Shot Classification



 This pipeline allows you to classify text into categories you define on the fly, without needing to fine-tune a model specifically for those categories.

In [None]:
zero_shot_classifier = pipeline("zero-shot-classification")
sequence_to_classify = "This new medical device can detect heart anomalies."
candidate_labels = ["cardiology", "oncology", "neurology", "pediatrics"]
results = zero_shot_classifier(sequence_to_classify, candidate_labels)
pd.DataFrame(results)


 ### 2.4. Other Common Use Cases



 Pipelines support many other tasks, including:

 - **Named Entity Recognition (NER)**: `pipeline("ner")` - Identifies entities like persons, organizations, locations.

 - **Question Answering**: `pipeline("question-answering")` - Extracts answers from a given context.

 - **Translation**: `pipeline("translation_en_to_fr")` - Translates text between languages.

 - **Fill-Mask (Masked Language Modeling)**: `pipeline("fill-mask")` - Fills in masked words in a sentence.



 You can find more tasks and models on the [Hugging Face Hub](https://huggingface.co/models).

 ### Exercise 1: Translation with Pipelines



 **Task:** Use a pipeline to translate the following English sentence into French: "Hugging Face is making NLP easy."



 **Hint:** You'll need to specify the task and a model suitable for English-to-French translation (e.g., `google-t5/t5-base`).

 #### Student Solution for Exercise 1

In [None]:
# Exercise 1: Student Code Cell
# Your code here



 ## 3. Fine-tuning for Text Classification

![](https://raw.githubusercontent.com/nlp-with-transformers/notebooks/0cb211095b4622fa922f80fbdc9d83cc5d9e0c34/images/chapter01_transfer-learning.png)



![](https://raw.githubusercontent.com/nlp-with-transformers/notebooks/0cb211095b4622fa922f80fbdc9d83cc5d9e0c34/images/chapter02_encoder-fine-tuning.png)

 While pre-trained models are powerful, fine-tuning them on a specific dataset can significantly improve performance for your particular task. We'll walk through fine-tuning DistilBERT for sentiment analysis on the `emotion` dataset.



 This dataset contains tweets classified into six basic emotions: anger, fear, joy, love, sadness, and surprise.

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate # The new library for metrics


 ### 3.1. Load Dataset



 We'll use the `emotion` dataset from the Hugging Face Hub.

In [None]:
raw_datasets = load_dataset("emotion")
print(raw_datasets)


 ### 3.2. Preprocessing with a Tokenizer



 We need to convert the text inputs into numerical representations (tokens) that the model can understand. We'll use the `AutoTokenizer` to load the tokenizer associated with `distilbert-base-uncased`.

In [None]:
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


 Create a tokenization function:

In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


 Apply the tokenization function to all splits of the dataset using `map`:

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
print(tokenized_datasets)


**Attention masks** are binary arrays (0s and 1s) that tell Transformer models which tokens are real content (1) versus padding (0). When batching sequences of different lengths, shorter texts get padded to match the longest one. Attention masks ensure the model ignores these meaningless padding tokens during processing, preventing them from affecting the actual content's representation. They're automatically created by Hugging Face tokenizers alongside `input_ids`.


In [None]:
# Let's examine two examples from the tokenized training set
print("Example 1:")
print(f"Text: {tokenized_datasets['train'][0]['text']}")
print(f"Label: {tokenized_datasets['train'][0]['label']}")
print(f"Input IDs (first 20): {tokenized_datasets['train'][0]['input_ids'][:20]}")
print(f"Attention Mask (first 20): {tokenized_datasets['train'][0]['attention_mask'][:20]}")

print("\nExample 2:")
print(f"Text: {tokenized_datasets['train'][1]['text']}")
print(f"Label: {tokenized_datasets['train'][1]['label']}")
print(f"Input IDs (first 20): {tokenized_datasets['train'][1]['input_ids'][:20]}")
print(f"Attention Mask (first 20): {tokenized_datasets['train'][1]['attention_mask'][:20]}")


 ### 3.3. Load Pretrained Model



 We load `distilbert-base-uncased` with a sequence classification head. We need to specify the number of labels.

In [None]:
num_labels = raw_datasets["train"].features["label"].num_classes
print(f"Number of labels: {num_labels}")

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)


#### Login in HF

In [None]:
# Login to Hugging Face (optional, but recommended for saving models)
from huggingface_hub import notebook_login

# Uncomment the line below to login to Hugging Face
# This will allow you to save your trained model to the Hugging Face Hub
# notebook_login()

print("Note: If you want to save your model to Hugging Face Hub, uncomment the notebook_login() line above")


 ### 3.4. Define Training Arguments



 `TrainingArguments` is a class that contains all the hyperparameters for training and evaluation.

In [11]:
# It's good practice to log in to Hugging Face if you want to push your model to the Hub.
# from huggingface_hub import notebook_login
# notebook_login() # Uncomment and run this if you want to push to hub

training_args = TrainingArguments(
    output_dir="my_awesome_emotion_model", # Directory to save the model
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1, # For a quick demo, usually 2-4 epochs are better
    weight_decay=0.01,
    eval_strategy="epoch", # Evaluate at the end of each epoch
    save_strategy="epoch",       # Save at the end of each epoch
    load_best_model_at_end=True,
    # push_to_hub=True, # Uncomment if you logged in and want to push
    # hub_model_id="your_username/my_awesome_emotion_model" # Uncomment and set your username
)


 ### 3.5. Define Metrics



 We'll use the `evaluate` library to compute accuracy and F1-score.

In [None]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="weighted") # Use "weighted" for imbalanced datasets
    return {"accuracy": acc["accuracy"], "f1": f1["f1"]}


 ### 3.6. Initialize Trainer



 The `Trainer` class handles the training and evaluation loop.

In [None]:
# For demonstration, let's take smaller subsets of the data
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) # 1000 samples for training
small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(200)) # 200 samples for evaluation


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset, # Use tokenized_datasets["train"] for full training
    eval_dataset=small_eval_dataset,   # Use tokenized_datasets["validation"] for full evaluation
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


 ### 3.7. Train the Model



 Now, we can start the fine-tuning process.

In [None]:
# This will take a few minutes depending on your hardware.
# If running on CPU, it will be much slower.
# Consider using Google Colab with a GPU for faster training.
trainer.train()


 ### 3.8. Evaluate the Model



 After training, evaluate the model on the validation set.

In [None]:
eval_results = trainer.evaluate()
print(eval_results)


 ### 3.9. Brief Mention of Token Classification



 **Token Classification** is another important task where each token in a sentence is assigned a label.

 - **Named Entity Recognition (NER)**: Labels tokens as part of entities (e.g., Person, Organization, Location).

 - **Part-of-Speech (POS) Tagging**: Labels tokens with their grammatical role (e.g., Noun, Verb, Adjective).



 Fine-tuning for token classification is similar to sequence classification, but:

 1. You use `AutoModelForTokenClassification`.

 2. The labels are sequences of tags, one for each token.

 3. Special care is needed to align labels with tokens after subword tokenization (as some words might be split into multiple tokens). The `word_ids()` method from a fast tokenizer is helpful here.

 4. Metrics like `seqeval` are used for evaluation.



 **Data Collator for Token Classification:**

 For tasks like NER or POS tagging, a specialized data collator `DataCollatorForTokenClassification` is used. This collator correctly pads not only the input IDs and attention mask, but also the labels. It ensures that labels for padding tokens are set to a special value (often -100) so they are ignored by the loss function. This is crucial because, unlike sequence classification where you have one label per sequence, here you have a label for each token, and the padding needs to be handled consistently across inputs and labels.

 ### Exercise 2: Fine-tune on a Different Dataset



 **Task:** Modify the fine-tuning script above to use the `sst2` (Stanford Sentiment Treebank) dataset from the GLUE benchmark. This is a binary sentiment classification task (positive/negative).



 **Hints:**

 1. Load the `sst2` dataset: `raw_datasets = load_dataset("glue", "sst2")`.

 2. The text column in `sst2` is named `"sentence"`.

 3. The `sst2` dataset has 2 labels.

 4. You might need to adjust batch sizes or number of samples for quicker execution.

 #### Student Solution for Exercise 2

In [None]:
# Exercise 2: Student Code Cell
# Your code here. Try to load "glue", "sst2", adapt the tokenize_function,
# num_labels, and then train and evaluate.
# For a quick run, you can use small subsets like:
# small_sst2_train_dataset = tokenized_sst2_datasets["train"].shuffle(seed=42).select(range(200))
# small_sst2_eval_dataset = tokenized_sst2_datasets["validation"].shuffle(seed=42).select(range(50))



 ## 4. Other NLP Tasks



 Transformers excel at a wide range of NLP tasks. Here's a quick look at a few more, using the `pipeline` for simplicity.



 ### 4.1. Question Answering



 Extractive Question Answering models find the answer to a question within a given text (context).

In [None]:
from transformers import pipeline

qa_pipeline = pipeline("question-answering")
context = """
Hugging Face is a company developing tools for machine learning.
It is famous for its Transformers library, which provides thousands of pre-trained models.
The main office is in New York City.
"""
question = "Where is Hugging Face's main office?"
answer = qa_pipeline(question=question, context=context)
print(f"Question: {question}")
print(f"Answer: {answer['answer']} (Score: {answer['score']:.4f})")


 ### 4.2. Summarization



 Summarization models generate a shorter version of a long text while preserving key information.

In [None]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6") # Using a specific smaller model for faster demo
long_text = """
Biomedical engineering (BME) or medical engineering is the application of engineering principles and design concepts
to medicine and biology for healthcare purposes (e.g., diagnostic or therapeutic). 
BME is also traditionally logical sciences to advance health care treatment, including diagnosis, 
monitoring, and therapy. Also included under the scope of a biomedical engineer is the management 
of current medical equipment within hospitals while adhering to relevant industry standards. 
This involves making equipment recommendations, procurement, routine testing, and preventive maintenance, 
a role also known as a Biomedical Equipment Technician (BMET) or as clinical engineering.
"""
summary = summarizer(long_text, max_length=45, min_length=10, do_sample=False)
print("Original Text Length:", len(long_text))
print("Summary:", summary[0]['summary_text'])
print("Summary Length:", len(summary[0]['summary_text']))


 ### 4.3. Masked Language Modeling (Fill-Mask)



 These models predict masked (hidden) tokens in a sentence.

In [None]:
fill_masker = pipeline("fill-mask", model="distilbert-base-uncased")
masked_sentence = "Biomedical engineers design [MASK] and systems."
predictions = fill_masker(masked_sentence, top_k=3)
for pred in predictions:
    print(f"{pred['sequence']} (Score: {pred['score']:.4f})")


 ### Exercise 3: Question Answering



 **Task:** Use the question answering pipeline to find the answer to "What is the capital of France?" using the following context: "France is a country in Western Europe. Its capital and largest city is Paris."

 #### Student Solution for Exercise 3

In [None]:
# Exercise 3: Student Code Cell
# Your code here



 ## 5. Model Sharing & Demos



 Sharing your models and creating interactive demos is crucial for collaboration and showcasing your work.



 ### 5.1. Sharing Models on the Hugging Face Hub



 The Hugging Face Hub makes it easy to share your fine-tuned models.



 **Method 1: Using `TrainingArguments`**

   - Set `push_to_hub=True` in `TrainingArguments`.

   - Optionally, set `hub_model_id="your_username/your_model_name"`.

   - You need to be logged in (`huggingface-cli login` or `notebook_login()`).

   - The `Trainer` will automatically upload your model during/after training.



 **Method 2: Manual Push**

   - After training, you can use:

     ```python

     # Assuming `model` and `tokenizer` are your trained objects

     model.push_to_hub("your_username/your_model_name")

     tokenizer.push_to_hub("your_username/your_model_name")

     ```

   - This creates a repository on the Hub and uploads the model files.



 **Model Cards:**

 It's important to create a "model card" (a `README.md` file in your model repository) that describes your model, its training data, intended uses, limitations, and biases. The `Trainer` often generates a basic one.



 ![Model Card](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter4/model_card.png)

 *(Image Source: Hugging Face NLP Course)*

 ### 5.2. Creating Simple Demos with Gradio



 [Gradio](https://www.gradio.app/) is a Python library that allows you to quickly create web-based UIs for your machine learning models.



 ![Gradio Demo](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter9/gradio-demo-overview.png)

 *(Image Source: Hugging Face NLP Course)*



 Here's a very simple example using the sentiment classifier we fine-tuned (or a pre-trained one if fine-tuning was skipped/failed).

In [None]:
import gradio as gr

# Let's use the pre-trained sentiment pipeline for this demo
# If you successfully fine-tuned your_awesome_emotion_model, you could use that.
# sentiment_pipeline_gradio = pipeline("sentiment-analysis", model="my_awesome_emotion_model")
sentiment_pipeline_gradio = pipeline("sentiment-analysis")


def predict_sentiment(text):
    results = sentiment_pipeline_gradio(text)
    # Gradio's Label component expects a dictionary of labels to scores
    return {res['label']: res['score'] for res in results}

# Create the Gradio interface
iface = gr.Interface(
    fn=predict_sentiment,
    inputs=gr.Textbox(lines=2, placeholder="Enter text here..."),
    outputs=gr.Label(num_top_classes=2), # Show top 2 classes (POSITIVE, NEGATIVE)
    title="Sentiment Analyzer",
    description="Enter some text and see its predicted sentiment."
)

# Launch the demo
# iface.launch() # This will launch in a new tab or inline if in a notebook.
# For this script, we'll just show how to define it.
# To run it, you'd typically save this script and run `python your_script_name.py`
# or run the cell in a Jupyter notebook.
print("Gradio interface defined. Call iface.launch() to run it.")


 **Running the Gradio Demo:**

 - If you run `iface.launch()`, Gradio will start a local web server.

 - You can access the demo in your browser (usually at `http://127.0.0.1:7860`).

 - You can also easily share a temporary live link by setting `share=True` in `launch()`.

 - For permanent hosting, you can use [Hugging Face Spaces](https://huggingface.co/spaces).

In [None]:
iface.launch(share=True)

 ### Exercise 4: Gradio for Question Answering



 **Task:** Create a simple Gradio interface for the question answering pipeline.

 It should take two text inputs: `context` and `question`, and output the `answer` as text.



 **Hints:**

 1. Your prediction function will take `context` and `question` as arguments.

 2. The `inputs` for `gr.Interface` will be a list of two `gr.Textbox` components.

 3. The `outputs` will be a single `gr.Textbox`.

 #### Student Solution for Exercise 4

In [None]:
# Exercise 4: Student Code Cell
# Your code here



 ## Conclusion



 You've now had a practical introduction to the Hugging Face ecosystem!

 We've covered:

 - The main libraries: `transformers`, `datasets`, `tokenizers`, `evaluate`, `accelerate`.

 - Using `pipeline` for quick inference on various tasks.

 - The process of fine-tuning a Transformer model for text classification, including data loading, tokenization, training, and evaluation.

 - A brief look at other NLP tasks like Question Answering and Summarization.

 - How to share your models on the Hub and create simple demos with Gradio.



 This foundation should help you explore more advanced topics and apply Transformers to your own projects in biomedical engineering and beyond! Remember that the Hugging Face Hub and documentation are your best friends for finding models, datasets, and learning more.



 Happy transforming!

### Quick minimal scripts for NLP

https://github.com/muellerzr/minimal-trainer-zoo