In [None]:
from transformers import pipeline
from google.colab import output

# Using Transformer Pipelines

Think of the `pipeline` as a *magic wand*. It bundles together all the necessary steps to go from raw text to a structured, understandable output:

1.  **Preprocessing:** Converts your text into numbers a model can understand (tokenization).
2.  **Model Inference:** Feeds the numbers through the model to get a prediction.
3.  **Post-processing:** Cleans up the model's output and presents it in a human-readable format.

We're going to play around with following types of pipelines.


| Task | Input | Output | Example Use Case |
|------|-------|--------|------------------|
| **Sentiment Analysis** | Text | Sentiment label + score | Product reviews |
| **Text Generation** | Prompt text | Generated continuation | Chatbots, story writing |
| **Zero-shot Classification** | Text + Categories | Category probabilities | Email routing |
| **Question Answering** | Question + Context | Extracted answer | FAQ systems |
| **Fill-Mask** | Text with `<mask>` | Predicted words | Autocomplete |
| **Summarization** | Long text | Short summary | News digests |
| **Tokenization** | Raw text | Tokens + IDs | All NLP tasks |
| **NER** | Text | Entity labels + positions | Privacy compliance |

# Sentiment Analysis

Let's start with a _real_ simple example. We'll use the sentiment analysis model to get the general gist of how positive or negative a given sentence is. Sentiment analysis is one of the most common NLP (Natural Language Processing) tasks. It determines whether a piece of text expresses positive or negative emotions. Think of it as teaching a computer to understand if someone is happy or upset based on their words.

Sentiment analysis has numerous real-world applications including customer review analysis, social media monitoring, product feedback classification, and support ticket prioritization.

Imagine you run an online store with thousands of reviews. Reading them all would be mind-numbingly boring. Sentiment analysis can:

- Instantly categorize reviews as positive or negative
- Alert you to unhappy customers who need attention
- Track satisfaction trends over time
- Identify your best and worst products

In [None]:
sentiment_analyzer = pipeline(
    "sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english"
)

texts = [
    "I love this product! It's amazing.",
    "This is terrible. I hate it.",
    "It's okay, nothing special.",
    "This is bad and whoever made it should feel bad.",
    "Absolutely fantastic! Best purchase ever!",
    "Meh, could be better.",
]


for text in texts:
    result = sentiment_analyzer(text)[0]
    label = result["label"]
    confidence = result["score"]

    print("\n" + "-" * 50 + "\n")
    print(f"Text: {text}")
    print(f"Sentiment: {label} (Confidence: {confidence:.3f})")

The confidence score tells you how sure the model is. A score of 0.999 means 99.9% confident (very sure), 0.750 means 75% confident (pretty sure), and 0.500 means 50% confident (uncertain).



`pipeline` is an abstraction that handles loading the model, tokenization (converting text to numbers), inference (making predictions), and post-processing (converting outputs to readable format).

- **Pre-processing**: Using a tokenizer to pre-process the text and break it into tokens.
- **Processing**: Feeds the tokens to the model and gets the result.
- **Post-processing**: Gets us our feedback in the ways we might expect.

## Common Pitfalls

- **Binary Classification**: This default model only outputs `POSITIVE` or `NEGATIVE`. It doesn't have a `NEUTRAL` category.
- **Context Limitations**: The model might miss sarcasm or complex emotions.
- **Language**: The default model works best with English. For other languages, you'll need specific models.

### Demonstrating a Pitfall: Neutrality and Sarcasm

This model is trained for binary classification (`POSITIVE`/`NEGATIVE`) and can be easily confused by neutral statements or sarcasm.

In [None]:
neutral_text = "This notebook is running on a server."
sarcastic_text = "Oh, great. Another meeting. That's exactly what I needed."

print("Analyzing a neutral sentence…")
print(sentiment_analyzer(neutral_text))

print("\nAnalyzing a sarcastic sentence…")
print(sentiment_analyzer(sarcastic_text))

# Text Generation

Text generation is like having an AI co-writer. You provide a starting text (a "prompt"), and the model continues writing, mimicking the style and content it learned from a massive dataset of text. Unlike sentiment analysis, which *interprets* existing text, text generation *creates* new text.

Unlike sentiment analysis—which analyzes existing text—generation creates new text that didn't exist before.

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation", model="gpt2")

output.clear()

prompts = [
    "The future of artificial intelligence is",
    "Once upon a time in a distant galaxy",
    "The secret to happiness is",
    "In the year 2050, technology will",
]

for prompt in prompts:
    print(f"{prompt}…")

    generated = text_generator(
        prompt,
        max_new_tokens=100,
        truncation=True,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=text_generator.tokenizer.eos_token_id,
    )

    generated_text = generated[0]["generated_text"]
    print(generated_text)
    print("\n" + "-" * 50 + "\n")


Notice the difference from our sentiment analyzer:

- We specify `"text-generation"` as the task
- We explicitly choose `model="gpt2"`

**Why specify the model?** Different models have different capabilities and sizes. GPT-2 is a good balance of quality and speed for learning. But, mostly we're impatient and we wanted a small, fast model.

These prompts are intentionally:

- **Open-ended**: They invite continuation rather than yes/no answers
- **Diverse**: From sci-fi to philosophy to predictions
- **Incomplete**: They end mid-thought, encouraging the model to continue

## Understanding the Parameters

- `max_length=100`
  - Maximum total length (prompt + generated text)
  - Prevents endless generation
  - Measured in "tokens" (roughly words/subwords)
- `num_return_sequences=1`
  - How many different versions to generate
  - Set to 2 or 3 to see different possibilities from the same prompt
- `temperature=0.7`: Think of temperature like a creativity dial. Low values (0.1-0.5) produce conservative, safe choices. Medium values (0.6-0.9) provide balanced creativity. High values (1.0+) result in experimental, risky choices.
  - Controls randomness/creativity (0.0 to 2.0)
  - **0.0**: Very predictable, repetitive
  - **0.7**: Balanced creativity (default)
  - **1.5+**: Wild, sometimes nonsensical
- `do_sample=True`
  - Enables probabilistic sampling
  - Without this, generation would be deterministic (same output every time)

## Common Pitfalls

- **Repetition**: Models can get stuck in loops. Adjust temperature or use repetition penalties.
- **Context Limits**: GPT-2 can only "remember" about 1024 tokens. Long prompts reduce generation space.
- **Inappropriate Content**: Models learn from internet text and may generate biased or inappropriate content.
- **Coherence Drift**: Longer generations may lose focus or contradict earlier statements.

## Experimentation Ideas

Try different temperatures (0.3, 0.7, and 1.2) to see how creativity changes. Use longer prompts to give more context for guiding generation. Set `num_return_sequences=3` to generate multiple versions. Compare different lengths with `max_length=50` versus `max_length=200`.

You can also influence the generation style through your prompt:

```python
# Formal style
"In academic terms, artificial intelligence represents"

# Casual style
"So basically, AI is like"

# Technical style
"The implementation of neural networks requires"
```

# Classification

Zero-shot classification is _almost_ magical: you can classify text into categories you define, without training the model on any examples. Unlike traditional machine learning (which needs thousands of labeled examples), zero-shot classification understands language well enough to match text to categories based on meaning alone.

You can totally think of it as asking a smart assistant: "Which of these categories best describes this text?"

"Zero-shot" means zero training examples. The model uses its understanding of language to connect texts with labels it's never seen before.

## Tasting Notes

- All scores sum to 1.0 (100%). The model distributes probability across all categories.
- The model actually creates a hypothesis ("This text is about [category]"), checks how well the hypothesis fits, compares all hypotheses, and returns probabilities.

## How It Works

The model converts classification into a natural language inference task:

- For each category, it creates a hypothesis: "This text is about [category]"
- It evaluates how well each hypothesis fits the text
- It returns probability scores for all categories

This approach is incredibly powerful because you can define new categories on the fly without retraining the model.

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

output.clear()

candidate_labels = ["business", "technology", "sports", "entertainment", "politics"]

texts = [
    "Apple reported record quarterly earnings driven by strong iPhone sales.",
    "The new artificial intelligence model shows significant improvements in natural language understanding.",
    "The Denver Nuggets defeated the Lakers in last night's basketball game.",
    "The upcoming movie sequel has generated significant buzz among fans.",
    "The senator announced new legislation aimed at healthcare reform.",
]

for text in texts:
    result = classifier(text, candidate_labels)

    print(f"Text: {text}")
    print("Classifications:")

    for label, score in zip(result["labels"], result["scores"]):
        print(f"  {label}: {score:.3f}")

    print("\n" + "-" * 50 + "\n")

## Common Pitfalls

- **Ambiguous Labels** can cause confusion—"positive" could mean sentiment or medical test results, so be specific. **Too Many Categories** leads to lower confidence; start with 3-7 categories.
- **Similar Categories** like "technology" versus "computers" confuse the model. **Language Mismatch** between an English model and Spanish text produces poor results.

## Practical Applications

Zero-shot classification excels in **Customer Support** for routing tickets to departments, **Content Moderation** for flagging inappropriate content, **News Categorization** for organizing articles by topic, **Email Filtering** beyond simple spam detection, and **Sentiment Analysis** with custom emotion categories.

## Using Multiple Labels

By default, the classifier assumes each text belongs to ONE category. We can allow multiple categories.

In [None]:
# Create classifier
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Single-label classification
result = classifier(
    "I love programming in Python!",
    candidate_labels=["positive", "negative", "neutral"]
)

# Multi-label classification
result = classifier(
    "The new iPhone is expensive but innovative.",
    candidate_labels=["technology", "business", "review"],
    multi_label=True
)

print(result)

## Try It Yourself

Experiment with **Different Categories** like emotional states: `["happy", "sad", "angry", "surprised"]`. Try **Longer Labels** using phrases: `["related to climate change", "about the economy"]`. Test **Multi-Label** scenarios where a tech business story could be both "technology" AND "business". Explore **Non-English** labels in other languages, though results may vary.

# Question Answering

This is probably the one that most looks most reminiscent of ChatGPT, Claude, and Gemini if you squint closely—but not quite, as we'll see in a moment.. The one thing you'll notice that you're basically providing _all_ of it's context. So, it's basically only good at answering questions about the context that you have provided it.

Question answering (QA) is like having a smart assistant read a document and answer your questions about it. The model finds the exact portion of text that answers your question—it doesn't generate new text, it extracts existing text.

This is called _extractive_ QA because it extracts answers directly from the provided context.

- **Extractive** (what we're using): Finds and extracts the exact text span that answers the question.
- **Generative** (a la ChatGPT): Creates new text to answer the question.

Extractive QA is more reliable when you need precise information from source documents—it _never_ invents facts.

In [None]:
hotdog_historian = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad",
)

context = """
🌭 The modern hot dog traces its roots to German sausages brought to the U.S. by immigrants in the 1800s.
The first famous hot-dog stand popped up on Coney Island in 1871, allegedly selling 3,684 sausages in buns
during its first year. Classic toppings include mustard, ketchup (controversial in Chicago), relish, onions,
and sauerkraut. Nathan’s Famous launched its annual Hot Dog Eating Contest in 1916, turning competitive eating
into a Fourth-of-July spectacle.
"""

questions = [
    "Where did the first famous hot-dog stand appear?",
    "Which condiment causes drama in Chicago?",
    "What competitive eating event did Nathan’s Famous start?",
]

for question in questions:
    result = hotdog_historian(question=question, context=context)

    print(f"Question: {question}")
    print(f"Answer: {result['answer']}")
    print(f"Confidence: {result['score']:.3f}")
    print("\n" + "-" * 50 + "\n")


Simple and consistent with our previous pipelines. The pipeline automatically selects a suitable question-answering model (typically [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert)-based) that's optimized for extractive QA tasks.

The context is the "document" where the model looks for answers. The context must contain the answer since the model can't draw from general knowledge. Longer contexts are harder to process because there's more text to search through. Quality matters greatly—clear, well-written text produces better results.

## Key Concepts

### Extractive vs. Generative

**Extractive**—what we're using here—finds answers within provided text, always returns exact text from context, and can't answer if information isn't there. **Generative**—which is *actually* like ChatGPT—can combine information and reason, generates new text for answers, and can use general knowledge.

### Confidence Scores

Different from classification, QA confidence scores work as follows: high scores (&gt;0.9) indicate the model found a clear answer, medium scores (0.5-0.9) suggest an answer was found but the model is uncertain, and low scores (&lt;0.5) usually mean the model is probably guessing.

### Answer Span

The model actually predicts the start position of the answer, the end position of the answer, and then extracts the text between these positions.

Question answering transforms the challenge of information extraction into a simple Q&A interface. The model doesn't generate answers—it finds them in the text you provide. Success depends on asking clear questions and ensuring the answer actually exists in the context. High confidence scores (>0.9) usually indicate reliable answers, while low scores suggest the model is guessing. Remember: garbage in, garbage out—the quality of your context determines the quality of your answers.

## Common Pitfalls

- **Answer Not in Context**: The model will guess, usually poorly. For example, with context "The sky is blue" and question "What color is grass?", it might return "blue" with low confidence.
- **Ambiguous Questions**: Multiple possible answers confuse the model.
- **Complex Reasoning**: The model can't handle "Why?" or "How?" questions well.
- **Long Contexts**: Performance degrades with very long documents.

### What Happens When the Answer Isn't There?

The model is forced to find the *best possible* answer within the context, even if it's completely wrong. This is a critical limitation of extractive QA.

In [None]:
no_answer_question = "What is the best brand of mustard for a hot dog?"

result = hotdog_historian(question=no_answer_question, context=context)

print(f"Question: {no_answer_question}")
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.3f}")

**Notice**: The model returned "Nathan’s Famous" with a very low confidence score. This demonstrates that when the answer is not present, the model essentially makes a guess based on keyword proximity, and the low score is your signal that the answer is unreliable.

# Fill Mask

[Fill-Mask](https://huggingface.co/tasks/fill-mask) is like a smart autocomplete that can predict missing words based on context. Fill-mask reveals how language models predict missing words. This is actually how [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) was trained—by learning to predict masked words in millions of sentences.

Think of it as a sophisticated version of:

> The cat sat on the `fill in the blank` → "mat"

This is a fundamental task that BERT was trained on. While simpler than question answering, it provides insight into how language models learn word relationships and context.

- Use masked language models to predict missing words
- Understand how BERT and similar models were trained
- Get multiple predictions with confidence scores
- Work with the special [MASK] token

The `<mask>` token is special: exact spelling matters (must be lowercase in angle brackets for RoBERTa models), standard models handle only one mask at a time, and position matters since the model uses surrounding context.

**Note**: Different models use different mask tokens—BERT uses `[MASK]` while RoBERTa uses `<mask>`. The pipeline handles this automatically.

## Practical Applications

Practical applications include text restoration (filling in damaged or censored text), data augmentation (generating training data variations), language learning (creating fill-in-the-blank exercises), autocomplete (smarter suggestions in text editors), and quality checking (testing if sentences make sense).

### Why This Matters

- **Shows model understanding**: See what the model "thinks" should come next
- **Reveals potential biases**: Discover what associations the model has learned
- **Demonstrates vocabulary**: Understand the model's word knowledge and limitations

Fill-mask gives us a window into the model's internal representations and learned patterns.

In [None]:
fill_mask = pipeline("fill-mask", model="distilroberta-base")

sentences = [
    "The capital of France is <mask>.",
    "Python is a popular <mask> language.",
    "Machine learning is a subset of <mask> intelligence.",
    "The <mask> of gravity was discovered by Newton.",
    "Shakespeare wrote the play <mask>.",
]


for sentence in sentences:
    print(f"Original: {sentence}")

    results = fill_mask(sentence, top_k=3)

    print("Top predictions:")
    for i, result in enumerate(results, 1):
        filled_sentence = result["sequence"]
        token = result["token_str"]
        score = result["score"]
        print(f"  {i}. {filled_sentence} (token: '{token}', score: {score:.3f})")

    print("\n" + "-" * 50 + "\n")

## Key Concepts

### Masked Language Modeling (MLM)

This is how BERT was trained: take a sentence ("The dog chased the cat"), randomly mask 15% of words ("The <mask> chased the cat"), train the model to predict "dog", and repeat billions of times.

### Bidirectional Context

Unlike GPT (left-to-right), BERT looks at both sides. For "The <mask> barked loudly", it uses both "The" and "barked loudly". This makes it better at understanding but worse at generation.

### Token vs. Word

Sometimes predictions include spaces: `token_str: " Paris"` (with leading space) or `token_str: "Paris"` (no space). The model handles subword tokens and spacing automatically.

## Common Pitfalls

- **Wrong mask format**: Use `<mask>` for RoBERTa models, not `[mask]`, `{MASK}`, or `[MASK]`.
- **Multiple Masks**: Standard models struggle with multiple masks (e.g., `"The <mask> <mask> is blue."`).
- **Out-of-Vocabulary Words**: The model might not know specialized terms.
- **Context Dependence**: Same mask, different contexts produce different results—"Apple makes great <mask>" → "products" while "Apple is a delicious <mask>" → "fruit".

**Warning**: The predictions from fill-mask models reflect patterns in their training data, which can include societal biases. Always be aware of these limitations when deploying models in production, especially for sensitive applications.

# Summarization

Summarization is the task of condensing long texts while preserving the most important information. It's like having an AI assistant read a long article and give you the key points. This is incredibly useful for processing large amounts of text quickly.

There are two types: **extractive** (pulls out important sentences like highlighting) and **abstractive** (rewrites in new words like note-taking). Summarization condenses long texts while preserving key information. Modern models use **abstractive summarization**—they don't just extract sentences, they rewrite content concisely.

Most Hugging Face models do abstractive summarization.

- Use AI to summarize long texts automatically
- Control summary length with parameters
- Understand different summarization approaches
- Work with real-world text lengths

This tutorial combines concepts from previous lessons - using pipelines to process text while controlling output with parameters similar to text generation.

## Key Concepts

- **Compression Ratio**: How much shorter the summary is compared to the original
- **Information Retention**: Which key facts are preserved in the summary
- **Readability**: How natural and coherent the summary sounds

Good summarization balances brevity with completeness—keeping what matters while cutting what doesn't.

## Understanding the Parameters

You can control the summary's length and style, much like with text generation:

-   `max_length`: Sets the absolute maximum number of tokens in the final summary. The model will stop generating once it hits this limit.
-   `min_length`: Sets a minimum number of tokens. This forces the model to generate a summary that isn't trivially short.
-   `do_sample=False`: When `False` (the default for summarization), the model uses a deterministic approach, meaning it will produce the same summary every time for a given input. This is usually preferred for summarization to ensure consistency.

In [None]:
# prompt: Create an example that shows summarization.

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

text_to_summarize = """
The modern hot dog traces its roots to German sausages brought to the U.S. by immigrants in the 1800s.
The first famous hot-dog stand popped up on Coney Island in 1871, allegedly selling 3,684 sausages in buns
during its first year. Classic toppings include mustard, ketchup (controversial in Chicago), relish, onions,
and sauerkraut. Nathan’s Famous launched its annual Hot Dog Eating Contest in 1916, turning competitive eating
into a Fourth-of-July spectacle.
"""

summary = summarizer(text_to_summarize, max_length=50, min_length=10, do_sample=False)

print("Original Text:")
print(text_to_summarize)
print("\nSummary:")
print(summary[0]['summary_text'])

## Key Concepts

### Compression Ratio

Original text to summary represents compression. For example, 1000 chars to 200 chars equals 5:1 compression. Higher compression means more information loss, so balance brevity with completeness.

### Token Limits

Models have maximum context windows and can't summarize extremely long documents in one pass. You may need to chunk very long texts, with typical limits being 512-1024 tokens.

### Quality vs Length Trade-off

Shorter summaries provide key facts only, while longer summaries include more detail and nuance. Find the sweet spot for your use case.

## Common Pitfalls

- **Text Too Short**: Summarizing already-brief text produces poor results.
- **Length Mismatch**: Don't set max_length longer than the input text (e.g., `summarizer("Hello world", max_length=100)`).
- **Information Loss**: Critical details might be omitted.
- **Factual Errors**: Models can sometimes generate incorrect information.

# Named Entity Recognition

Let's talk about how to extract named entities (people, organizations, locations) from text, understand entity types and confidence scores, use aggregation strategies for better results, and highlight entities within text programmatically.

[Named Entity Recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) is the task of identifying and classifying named entities in text. It's like having an AI highlight all the important names and places in a document. This is crucial for information extraction, document analysis, building knowledge graphs, and privacy/compliance (finding personal information). Named Entity Recognition (NER) identifies and classifies named entities in text—people, organizations, locations, and more. It's like having a smart highlighter that automatically marks important names and places.

Standard NER models recognize **PER** (Person: names of people), **ORG** (Organization: companies, agencies, institutions), **LOC** (Location: countries, cities, addresses), and **MISC** (Miscellaneous: everything else like dates, events, etc.).

Different from previous tasks, NER confidence scores work as follows: high confidence (&gt;0.9) indicates a clear entity, medium (0.7-0.9) suggests a probable entity, and low (&lt;0.7) means uncertain and might be wrong.

In [None]:
ner_pipeline = pipeline(
    "ner",
    aggregation_strategy="simple",
    model="dbmdz/bert-large-cased-finetuned-conll03-english",
)

texts = [
    "My name is John Smith and I work at Apple Inc. in Cupertino, California.",
    "Barack Obama was the 44th President of the United States from 2009 to 2017.",
    "Microsoft was founded by Bill Gates and Paul Allen in Albuquerque, New Mexico.",
    "The meeting is scheduled for December 15th at Google headquarters in Mountain View.",
    "Tesla CEO Elon Musk announced the new factory will be built in Austin, Texas.",
]

for text in texts:
    print(f"Text: {text}")

    entities = ner_pipeline(text)

    if entities:
        print("Entities found:")
        for entity in entities:
            word = entity["word"]
            label = entity["entity_group"]
            confidence = entity["score"]
            start = entity["start"]
            end = entity["end"]

            print(
                f"  '{word}' -> {label} (confidence: {confidence:.3f}, position: {start}-{end})"
            )
    else:
        print("  No entities found")

    print("\n" + "-" * 50 + "\n")

sample_text = "Apple CEO Tim Cook visited the Tesla factory in Berlin, Germany."
entities = ner_pipeline(sample_text)

highlighted_text = sample_text
offset = 0

for entity in sorted(entities, key=lambda x: x["start"]):
    start = entity["start"] + offset
    end = entity["end"] + offset
    word = entity["word"]
    label = entity["entity_group"]

    replacement = f"[{word}:{label}]"
    highlighted_text = highlighted_text[:start] + replacement + highlighted_text[end:]
    offset += len(replacement) - len(word)

print(f"Original: {sample_text}")
print(f"Highlighted: {highlighted_text}")


## Common Pitfalls

- **Missing Aggregation**: Always use `aggregation_strategy="simple"` to get complete words instead of subword tokens.
- **Case Sensitivity**: Some models are case-sensitive—"apple" (fruit) versus "Apple" (company).
- **Context Matters**: "Jordan" could be person or country, "Washington" could be person, city, or state.
- **Overlapping Entities**: Standard models don't handle overlaps well.

## Entity Type Reference

NER uses the **IOB** (Inside-Outside-Beginning) tagging scheme:

- **B-XXX**: Beginning of entity type XXX
- **I-XXX**: Inside (continuation) of entity type XXX  
- **O**: Outside any entity

This scheme allows models to handle multi-word entities correctly—for example, "New York City" is tagged as [B-LOC, I-LOC, I-LOC].

| Type       | Description                      | Examples                          |
| ---------- | -------------------------------- | --------------------------------- |
| PER/PERSON | People, characters               | "Albert Einstein", "Harry Potter" |
| ORG        | Organizations                    | "NASA", "United Nations"          |
| LOC/GPE    | Locations, geopolitical entities | "Paris", "Pacific Ocean"          |
| DATE       | Dates and periods                | "January 2024", "last week"       |
| TIME       | Times                            | "3:30 PM", "morning"              |
| MONEY      | Monetary values                  | "$100", "50 euros"                |
| PERCENT    | Percentages                      | "50%", "half"                     |
| MISC       | Miscellaneous                    | Events, products, etc.            |

# Translation

Translation is another core NLP task where `pipeline` shines. You can translate text between numerous languages by simply loading the correct model.

Models are typically named with the source and target languages in mind. For this example, we'll use a model from the Helsinki-NLP group, which has published many high-quality translation models.

In [None]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

text = "The future of artificial intelligence is fascinating."

translation = translator(text)

print(f"Original (English): {text}")
print(f"Translated (German): {translation[0]['translation_text']}")

# Conclusion & Next Steps

Right on. You've successfully used the Hugging Face `pipeline` to perform seven different NLP tasks:

1.  **Sentiment Analysis**: Classifying text as positive or negative.
2.  **Text Generation**: Creating new text from a prompt.
3.  **Zero-Shot Classification**: Categorizing text with custom labels on the fly.
4.  **Question Answering**: Extracting answers from a given context.
5.  **Fill-Mask**: Predicting missing words in a sentence.
6.  **Summarization**: Condensing long documents into key points.
7.  **Named Entity Recognition (NER)**: Identifying people, places, and organizations.
8.  **(Bonus) Translation**: Translating text between languages.

Definitely explore the [Hugging Face Hub](https://huggingface.co/models) to find thousands of other pre-trained models for any task imaginable.