# **Hugging Face Basics**

**Date:** February 5th, Day 3

---

## 1. Introduction and Objective

**Introduction:**  
Hugging Face has emerged as the go-to library for working with large language models (LLMs). It provides an extensive ecosystem of pre-trained models (such as GPT-2, BERT, T5, etc.), state-of-the-art tokenizers, easy-to-use pipelines, and an active community along with a Model Hub containing thousands of models. Hugging Face simplifies prototyping and experimentation with LLMs and other NLP tasks.

**Objective:**  
- To introduce the Hugging Face ecosystem and its key components.
- To load and explore a pre-trained model (using GPT-2 as an example) for text generation.
- To understand tokenizers and pipelines through practical demonstrations.
- To showcase relevant use cases such as text generation, sentiment analysis, and token classification.
- To discuss the advantages and limitations of using these pre-trained models.

---

## 2. Metadata

- **Framework:** Hugging Face Transformers
- **Pre-trained Model:** GPT-2 (for text generation), BERT (for sentiment analysis and token classification use cases)
- **Technologies:** Python, Transformers, PyTorch/TensorFlow (back-end agnostic), Tokenizers, Pipelines
- **Environment:** Jupyter Notebook / Google Colab (CPU-friendly execution)
- **Application:** Text generation, sentiment analysis, token classification, and rapid prototyping in NLP

---

## 3. Conceptual Overview and Mathematical Intuition

### 3.1 Pre-trained Models and the Model Hub

Hugging Face’s Model Hub hosts thousands of pre-trained models for various NLP tasks. These models are fine-tuned on large-scale datasets and are built upon architectures such as GPT, BERT, T5, RoBERTa, etc. They empower developers to quickly deploy state-of-the-art NLP systems without the need for training models from scratch.

### 3.2 Tokenizers and Pipelines

- **Tokenizers:**  
  Tokenizers convert raw text into numerical tokens that models can understand. They handle operations such as tokenization, padding, truncation, and the addition of special tokens required by the models.

- **Pipelines:**  
  Pipelines offer high-level abstractions for various NLP tasks (e.g., text generation, sentiment analysis, question answering). They combine a tokenizer and a pre-trained model to deliver task-specific outputs with minimal code.

### 3.3 Mathematical Intuition Behind Language Models

For example, consider GPT-2. It is based on the Transformer architecture and uses self-attention mechanisms to model relationships between words. The model is trained to maximize the probability of a token given all previous tokens:
\[
P(x) = \prod_{t=1}^{T} P(x_t \mid x_1, x_2, \ldots, x_{t-1})
\]
This autoregressive training allows GPT-2 to generate coherent sequences of text by sampling one token at a time.

### 3.4 Advantages and Disadvantages

**Advantages:**
- **Rapid Prototyping:** Pre-trained models allow quick experimentation without extensive computational resources.
- **Wide Range of Tasks:** A single library can handle multiple NLP tasks (text generation, classification, translation, etc.).
- **Community Support:** Extensive documentation, tutorials, and community contributions.

**Disadvantages:**
- **Computational Requirements:** Some models (e.g., large GPT-2 or BERT variants) require significant memory and processing power.
- **Domain Adaptation:** Pre-trained models may not perform optimally on domain-specific tasks without fine-tuning.
- **Interpretability:** The internal mechanisms (like attention weights) can be complex and less interpretable.

---

## 4. Implementation

Below, you will find several code cells with detailed explanations that showcase the practical usage of Hugging Face. We cover:

- Loading a pre-trained GPT-2 model for text generation.
- Using tokenizers for processing text.
- Employing pipelines for rapid prototyping.
- Additional use cases such as sentiment analysis.

---



In [1]:
#Install and Import Hugging Face Transformers

!pip install transformers --quiet

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline


**Explanation for Cell 1:**  
We install the Hugging Face `transformers` library if it is not already installed, then import the following:
- **GPT2LMHeadModel:** The pre-trained GPT-2 model for language modeling.
- **GPT2Tokenizer:** The tokenizer that converts text into tokens for GPT-2.
- **pipeline:** A high-level interface that simplifies inference tasks such as text generation.
  
This cell sets up the environment to work with pre-trained models and tokenizers.


In [2]:
#Load the Pre-trained GPT-2 Model and Tokenizer

# Load the GPT-2 tokenizer from the Hugging Face Model Hub
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load the pre-trained GPT-2 model for text generation
model = GPT2LMHeadModel.from_pretrained("gpt2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Explanation**  
Here, we load the pre-trained GPT-2 model and its corresponding tokenizer from Hugging Face's Model Hub:
- **Tokenizer:** Converts input text into numerical tokens required by the model.
- **Model:** The GPT-2 model, which is designed for text generation tasks.
  
These components are fetched automatically from the Model Hub.


In [3]:
# Generate Text Using GPT-2

# Define a prompt for text generation
prompt = "In the future, artificial intelligence will"

# Encode the prompt into tokens using the GPT-2 tokenizer
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text with the GPT-2 model using specified parameters
generated_outputs = model.generate(
    input_ids,
    max_length=100,          # Maximum number of tokens in the generated sequence
    num_return_sequences=1,  # Return one sequence
    no_repeat_ngram_size=2,  # Prevent repetition by ensuring no 2-gram repeats
    early_stopping=True
)

# Decode the generated tokens back to a human-readable text string
generated_text = tokenizer.decode(generated_outputs[0], skip_special_tokens=True)
print("Generated Text:\n", generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text:
 In the future, artificial intelligence will be able to do things like search for information about people, and to search through the data for clues.

"We're going to see a lot of things that are going on in the next few years," said Dr. Michael S. Hirsch, a professor of computer science at the University of California, Berkeley. "We'll see more of these things in a few decades."
.


**Explanation:**  
This cell demonstrates text generation with GPT-2:
- We define a prompt that serves as the starting point for text generation.
- The prompt is tokenized using the GPT-2 tokenizer.
- The model’s `generate` method is called with parameters to control the output:
  - `max_length` sets the maximum length of the generated text.
  - `num_return_sequences` specifies the number of sequences to generate.
  - `no_repeat_ngram_size` ensures that the same n-grams are not repeated.
  - `early_stopping` allows the generation process to stop once a complete sequence is produced.
- Finally, the generated tokens are decoded back into text and printed.


In [4]:
# Using a Pipeline for Text Generation

# Create a text generation pipeline using GPT-2
text_generator = pipeline("text-generation", model="gpt2", tokenizer="gpt2")

# Generate text using the pipeline with a new prompt
pipeline_output = text_generator("The future of technology is", max_length=100, num_return_sequences=1)
print("Pipeline Generated Text:\n", pipeline_output[0]["generated_text"])


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Pipeline Generated Text:
 The future of technology is changing the way we think about technology, and one of the biggest benefits of this is that it allows us to see beyond conventional technologies in the current day. Today, it is less possible to design a smartphone and we are now able to develop a new mobile device (MUMV.) and use it to explore new technologies that we have never thought about before."

The new products may also be a preview at E3, which kicked off this year with a video showing


**Explanation:**  
This cell demonstrates how to use Hugging Face's high-level pipeline API:
- We create a **text-generation** pipeline that internally handles the tokenization and model inference.
- By providing a prompt, the pipeline generates text with minimal code.
- The result is printed, showcasing how pipelines can simplify rapid prototyping.


In [5]:
# Exploring Tokenizers - Tokenizing and Decoding

# Define a sample sentence for tokenization
sample_text = "Hugging Face provides state-of-the-art NLP models."

# Tokenize the sample text
tokens = tokenizer.tokenize(sample_text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids_sample = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids_sample)

# Decode the input IDs back to text
decoded_text = tokenizer.decode(input_ids_sample)
print("Decoded Text:", decoded_text)


Tokens: ['Hug', 'ging', 'ĠFace', 'Ġprovides', 'Ġstate', '-', 'of', '-', 'the', '-', 'art', 'ĠN', 'LP', 'Ġmodels', '.']
Input IDs: [48098, 2667, 15399, 3769, 1181, 12, 1659, 12, 1169, 12, 433, 399, 19930, 4981, 13]
Decoded Text: Hugging Face provides state-of-the-art NLP models.


**Explanation:**  
This cell explores the functionality of the tokenizer:
- A sample sentence is defined.
- The sentence is tokenized into individual tokens.
- The tokens are then converted into their corresponding numerical IDs.
- Finally, the numerical IDs are decoded back into text to confirm that the tokenizer works as expected.


In [7]:
# Additional Use Case - Sentiment Analysis Pipeline

# Create a sentiment analysis pipeline using a pre-trained model (default is a BERT-based model)
sentiment_pipeline = pipeline("sentiment-analysis")

# Analyze the sentiment of a sample text
sentiment_output = sentiment_pipeline("I love the simplicity and power of Hugging Face!")
print("Sentiment Analysis Output:", sentiment_output)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


Sentiment Analysis Output: [{'label': 'POSITIVE', 'score': 0.9998835325241089}]


**Explanation:**  
This cell showcases another use case of Hugging Face pipelines: sentiment analysis.
- We create a sentiment analysis pipeline, which by default uses a pre-trained model fine-tuned on sentiment tasks.
- A sample text is provided, and the pipeline outputs the sentiment (e.g., positive or negative) along with a confidence score.
  
This example highlights the versatility of pipelines for various NLP tasks.


In [6]:
# Additional Use Case - Named Entity Recognition (NER)

# Create a named entity recognition pipeline using a pre-trained model
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", tokenizer="bert-base-cased")

# Analyze a sample text for named entities
ner_output = ner_pipeline("Hugging Face is based in New York and was founded by Clément Delangue.")
print("NER Output:")
for entity in ner_output:
    print(entity)


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


NER Output:
{'entity': 'I-ORG', 'score': 0.98935014, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.90081507, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.97460246, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-LOC', 'score': 0.99907994, 'index': 7, 'word': 'New', 'start': 25, 'end': 28}
{'entity': 'I-LOC', 'score': 0.9985216, 'index': 8, 'word': 'York', 'start': 29, 'end': 33}
{'entity': 'I-PER', 'score': 0.999686, 'index': 13, 'word': 'C', 'start': 53, 'end': 54}
{'entity': 'I-PER', 'score': 0.9924972, 'index': 14, 'word': '##lé', 'start': 54, 'end': 56}
{'entity': 'I-PER', 'score': 0.999303, 'index': 15, 'word': '##ment', 'start': 56, 'end': 60}
{'entity': 'I-PER', 'score': 0.9986883, 'index': 16, 'word': 'Del', 'start': 61, 'end': 64}
{'entity': 'I-PER', 'score': 0.98653525, 'index': 17, 'word': '##ang', 'start': 64, 'end': 67}
{'entity': 'I-PER', 'score': 0.9919424, 'index': 18, 'word': '##ue

**Explanation**  
In this cell, we demonstrate how to perform Named Entity Recognition (NER) using a pipeline:
- We create an NER pipeline using a pre-trained model fine-tuned on the CoNLL-2003 dataset.
- A sample sentence is processed to extract named entities (such as organizations, locations, and names).
- The output is printed, showing the entity labels, the corresponding words, and confidence scores.
  
This example illustrates the flexibility of Hugging Face for various NLP tasks beyond text generation.


## 5. Conclusion and Key Learnings

**Conclusion:**  
In this notebook, we explored the Hugging Face ecosystem by:
- Introducing pre-trained models and the Hugging Face Model Hub.
- Demonstrating text generation with GPT-2 using both low-level API calls and high-level pipelines.
- Exploring the tokenizer functionality to convert text to tokens and back.
- Showcasing additional use cases including sentiment analysis and named entity recognition.

Hugging Face streamlines the use of state-of-the-art NLP models, enabling rapid prototyping and experimentation. Its pipelines abstract complex tasks into a few lines of code, making advanced NLP accessible to a wide audience.

**Key Learnings:**
- **Ease of Use:** Hugging Face’s high-level pipelines significantly simplify working with pre-trained models.
- **Flexibility:** The library supports a wide range of NLP tasks from text generation to sentiment analysis and NER.
- **Community and Resources:** The extensive Model Hub and active community contribute to a rich ecosystem for research and development.
- **Considerations:** While pre-trained models accelerate development, they may require fine-tuning and substantial computational resources for domain-specific applications.

**Next Steps:**
- Experiment with additional models (e.g., BERT, T5) for different NLP tasks.
- Fine-tune pre-trained models on custom datasets to improve performance.
- Explore Hugging Face’s other libraries such as `datasets` and `accelerate` for scalable, production-level NLP workflows.
