#License and Attribution

This notebook was developed by Emilio Serrano, Full Professor at the Department of Artificial Intelligence, Universidad Polit√©cnica de Madrid (UPM), for educational purposes in UPM courses. Personal website: https://emilioserrano.faculty.bio/

üìò License: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)

You are free to: (1) Share ‚Äî copy and redistribute the material in any medium or format; (2) Adapt ‚Äî remix, transform, and build upon the material.

Under the following terms: (1) Attribution ‚Äî You must give appropriate credit, provide a link to the license, and indicate if changes were made; (2) NonCommercial ‚Äî You may not use the material for commercial purposes; (3) ShareAlike ‚Äî If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

üîó License details: https://creativecommons.org/licenses/by-nc-sa/4.0/

# Hugging Face

In this notebook, we'll explore the Hugging Face Transformers library ‚Äî one of the most powerful and user-friendly toolkits for modern Natural Language Processing (NLP) and generative AI. Our focus will be on using the high-level pipeline API, which allows you to perform complex tasks such as sentiment analysis, text generation, translation, summarization, and more with just a few lines of code.

You'll learn how to:

- Use the `pipeline()` function to apply pretrained models to a variety of NLP tasks

- Download and switch between top-performing models from the Hugging Face Model Hub

- Customize hyperparameters (e.g., temperature, max length) to control model behavior

- Identify key NLP task categories and match them with appropriate models

This hands-on introduction will help you become familiar with state-of-the-art models while building an understanding of what makes different tasks (classification, generation, question answering, etc.) unique. No deep ML coding required ‚Äî just curiosity and a few lines of Python!

Make sure to check out the official Hugging Face [documentation](https://huggingface.co/docs)  and  [course](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt) to go deeper.





# Installing the Transformers library of Hugging Face

You can run system commands by preceding them with the !

In [1]:
%pip install transformers
%pip install torch torchvision
import transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from transformers import pipeline
import torch

# Check if MPS is available
print("MPS available:", torch.backends.mps.is_available())

# Pick device (MPS if available, else CPU)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print("Using device:", device)

MPS available: True
Using device: mps


# Pipeline function of the Transformers library


The most basic object in the Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

There are three main steps involved when you pass some text to a pipeline:

*   The text is preprocessed into a format the model.
*   The preprocessed inputs are passed to the model.
*   The predictions of the model are post-processed, so you can make sense of them.

Some of the currently available pipelines are:
- `sentiment-analysis`: Classify the sentiment of a piece of text (e.g., positive, negative). Useful for analyzing opinions in reviews or social media.

- `zero-shot-classification`: Classify text into user-defined categories without any additional training. Great for flexible, on-the-fly topic classification.

- `text-generation``: Generate coherent, human-like text from a prompt using language models like GPT. Used in chatbots, creative writing, etc.

- `feature-extraction`: Convert text into vector embeddings. These numerical representations can be used for clustering, similarity search, or feeding into other ML models.

- `fill-mask`: Predict missing words in a sentence with a [MASK] token. Demonstrates how masked language models (like BERT) understand context.

- `ner`: Named Entity Recognition, Detect and classify named entities in text (like people, places, dates, organizations). Useful for information extraction.

- `question-answering`: Extract answers from a given context based on a natural language question. Often used in reading comprehension and knowledge retrieval.

- `summarization`: Produce a concise summary of a longer text while preserving key information. Ideal for news, reports, and document analysis.

- `translation`: Translate text between different languages using pretrained translation models.
    
    

## Sentiment analysis

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I really like the sentiment analysis problem")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9860339760780334}]

In [3]:
#with several sentences
classifier(["DL4NLP does works well", "DL4NLP does not work well"])

[{'label': 'POSITIVE', 'score': 0.9997263550758362},
 {'label': 'NEGATIVE', 'score': 0.9997612833976746}]

In [4]:
#in Spanish (remember that the automatically downloaded model is distilbert-base-uncased-finetuned-sst-2-english)
classifier(["El an√°lisis de sentimientos es entretenido.", "Odio el an√°lisis de sentimientos"])


[{'label': 'NEGATIVE', 'score': 0.6868619322776794},
 {'label': 'NEGATIVE', 'score': 0.988868236541748}]

## The model hub

The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task.

Go to the [Model Hub](https://huggingface.co/models) and  click on the corresponding tag on the left to display only the supported models for that task. You can refine your search for a model by clicking on the language tags, and pick a model that will generate text in another language. The Model Hub even contains *checkpoints* for multilingual models that support several languages.

Let us try a model for the "Text classification" task in "Spanish". Use the search box to find a model for "Sentiment Analysis". Check also number of downloads and "likes".


Note: a checkpoint is a model with the exact value of all its parameters.




In [6]:
classifier = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")
classifier(["Me gusta buscar y descargar modelos de Hugging Face", "El problema del an√°lisis de sentimientos me parece aburrido"])


Device set to use mps:0


[{'label': '4 stars', 'score': 0.49561387300491333},
 {'label': '2 stars', 'score': 0.5113745331764221}]

In [5]:
classifier = pipeline("sentiment-analysis", model="finiteautomata/beto-sentiment-analysis")
classifier(["Me gusta buscar y descargar modelos de Hugging Face", "El problema del an√°lisis de sentimientos me parece aburrido"])


Device set to use mps:0


[{'label': 'POS', 'score': 0.9499416351318359},
 {'label': 'NEG', 'score': 0.9991242289543152}]

##Text generation

Now let‚Äôs see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.




In [7]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("Natural Language Processing is ")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Natural Language Processing is \xa0an ongoing effort to understand and develop software with the same goal of providing a high level of abstraction in a language. This is done by using a set of tools, such as the.NET SDK, to build and test a set of languages.\nYou can download the project at github.\nHere are the project's various parts:\nThis project has a number of other components and tools that are available in the.NET Framework and Visual Basic.\nYou can find some of these in the documentation, and some of the documentation in the README.\nYou can also find some of these in the README.\nThe project has the following components:\nThis project includes a new tool called Visual Basic.NET Core (the latest version of the language), which is available for free. It is a collection of tools that help you write and test basic C# projects. The code for this tool is hosted on GitHub.\nThis project uses the.NET Framework and Visual Basic.NET Core. This includes the follow


Text generation involves randomness. Try several times for different results.

The pipeline also accepts parameters such as max_lenght and num_return_sequences

Try with another model from the [Model Hub](https://huggingface.co/models) (and check the "Hosted inference API" to try the model before downloading it).



In [8]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "Natural Language Processing is ",
    max_length=100,
    num_return_sequences=3,
)

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'Natural Language Processing is vernacular for any language that is written by a human language, and the word is written using its own language.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'},
 {'generated_text': 'Natural Language Processing is vernacular, but it can be found in many languages, including Japanese, and in many languages and languages.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\

When selecting models from the Hugging Face Model Hub for text generation, be mindful of the number of parameters. Very large models (like LLaMA 13B+) can be too large for Google Colab (especially free-tier) and may lead to memory errors, slow execution, or crashes.

‚úÖ For smooth use in Colab (especially with limited RAM or no GPU), it is recommended:

- Choosing models with ‚â§ 500M parameters (e.g., distilgpt2, GPT2, opt-350m, mistralai/Mistral-7B-instruct on 8-bit quantized versions).

- You can filter by model size on the Model Hub using the ‚Äú# of parameters‚Äù tag.

- Look for keywords like ‚Äúdistil‚Äù, ‚Äútiny‚Äù, ‚Äúsmall‚Äù, or quantized versions when selecting a model for Colab.

If you have access to a paid Colab plan or a dedicated server with sufficient memory/GPU, you can use larger language models (LLMs) the same way ‚Äî the loading and usage process remains identical via the transformers library.

## Zero-shot Text Classification

You‚Äôve already seen how the model can classify a sentence as positive or negative using those two labels (positive and negative).

Now we need to classify texts that haven‚Äôt been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise.

For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don‚Äôt have to rely on the labels of the pretrained model.

**This is a great advance in out of the box tools for NLP!**

In [7]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This module is about the use of Deep Learning for Natural Language Processing ",
    candidate_labels=["education", "politics", "business"],
)


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Device set to use mps:0


{'sequence': 'This module is about the use of Deep Learning for Natural Language Processing ',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.45644259452819824, 0.3797537386417389, 0.1638035923242569]}

##Named entity recognition

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that involves identifying and classifying named entities in a text.  These entities can include people (PER), locations (LOC), organizations (ORG), dates, quantities, and more. Therefore, NER is a specialized type of word-level classification (or token classification).

For example, in the sentence: *Barack Obama was born in Hawaii* A NER model should recognize:

- "Barack Obama" ‚Üí Person (PER)

- "Hawaii" ‚Üí Location (LOC)

When using Hugging Face‚Äôs `pipeline` for NER, we often pass the argument `grouped_entities=True`. This option:

- Ensures that multi-token entities (like "New York City") are returned as a single grouped prediction, rather than separate predictions for each token.

- Helps make the output more readable and meaningful for downstream use.

Without grouped_entities=True, the pipeline might split "New York City" into three separate entities, even though they belong together.

In [10]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on artificial neural networks. From 2013 to 2023, he divided his time working for Google (Google Brain) and the University of Toronto.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'entity_group': 'PER',
  'score': 0.99905187,
  'word': 'Geoffrey Everest Hinton',
  'start': 0,
  'end': 23},
 {'entity_group': 'MISC',
  'score': 0.99335974,
  'word': 'British',
  'start': 52,
  'end': 59},
 {'entity_group': 'MISC',
  'score': 0.9986369,
  'word': 'Canadian',
  'start': 60,
  'end': 68},
 {'entity_group': 'ORG',
  'score': 0.9985018,
  'word': 'Google',
  'start': 222,
  'end': 228},
 {'entity_group': 'ORG',
  'score': 0.9489272,
  'word': 'Google Brain',
  'start': 230,
  'end': 242},
 {'entity_group': 'ORG',
  'score': 0.996902,
  'word': 'University of Toronto',
  'start': 252,
  'end': 273}]

## Summarization
Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text.

In [11]:
summarizer = pipeline("summarization")
summarizer(
"""
Symbolic NLP (1950s ‚Äì early 1990s)
The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it confronts.

1950s: The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[1] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted in America (though some research continued elsewhere, such as Japan and Europe[2]) until the late 1980s when the first statistical machine translation systems were developed.
1960s: Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?". Ross Quillian's successful work on natural language was demonstrated with a vocabulary of only twenty words, because that was all that would fit in a computer memory at the time.[3]
1970s: During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, the first chatterbots were written (e.g., PARRY).
1980s: The 1980s and early 1990s mark the heyday of symbolic methods in NLP. Focus areas of the time included research on rule-based parsing (e.g., the development of HPSG as a computational operationalization of generative grammar), morphology (e.g., two-level morphology[4]), semantics (e.g., Lesk algorithm), reference (e.g., within Centering Theory[5]) and other areas of natural language understanding (e.g., in the Rhetorical Structure Theory). Other lines of research were continued, e.g., the development of chatterbots with Racter and Jabberwacky. An important development (that eventually led to the statistical turn in the 1990s) was the rising importance of quantitative evaluation in this period.[6]
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'summary_text': ' The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English . Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies . The 1980s and early 1990s mark the heyday of symbolic methods in NLP .'}]

## Question answering
The question-answering pipeline answers questions using information from a given contex.

In [13]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model="deepset/roberta-base-squad2")
question_answerer(
    question="When does Geoffrey Everest Hinton worked at Google?",
    context="Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on artificial neural networks. From 2013 to 2023, he divided his time working for Google (Google Brain) and the University of Toronto.",
)

Device set to use mps:0


{'score': 0.827448271214962,
 'start': 176,
 'end': 188,
 'answer': '2013 to 2023'}

##Translation
Translation is one of the most historically important and challenging problems in Natural Language Processing (NLP). A default model if can be used when providing  a language pair in the task name (such as `translation_en_to_fr`), but the easiest way is to get the model you want to use on the [Model Hub](https://huggingface.co/models) after selecting a language.

Let us try English to Spanish.

**Note**: Check the models examples and API to find parameters. Import require libraries.

**Note 2**: sentencepiece is usually needed, if this error show up *"ValueError: This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed in order to use this tokenizer."*, try *"!pip install sentencepiece"* and restart the kernel"

In [1]:
%pip install sentencepiece
import sentencepiece

Note: you may need to restart the kernel to use updated packages.


In [2]:
from transformers import pipeline
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
translator("What can I learn in Deep Learning for Natural Language Processing?")

  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0


[{'translation_text': '¬øQu√© puedo aprender en el Aprendizaje Profundo para el Procesamiento Natural del Lenguaje?'}]

##Mask filling

Fill-mask is a classic NLP task where the model predicts missing words in a sentence, essentially ‚Äúfilling in the blanks.‚Äù

The `top_k` argument controls how many possibilities you want to be displayed.

Keep in mind that different models may use different mask tokens (e.g., `<mask>`), so it‚Äôs important to verify the correct mask token for each model you explore (you can check this in the model‚Äôs API widget).


In [3]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian <mask>, most noted for his work on artificial neural networks.", top_k=5)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'score': 0.3618656396865845,
  'token': 33832,
  'token_str': ' physicist',
  'sequence': 'Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian physicist, most noted for his work on artificial neural networks.'},
 {'score': 0.33519676327705383,
  'token': 43027,
  'token_str': ' mathematician',
  'sequence': 'Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian mathematician, most noted for his work on artificial neural networks.'},
 {'score': 0.048829086124897,
  'token': 9744,
  'token_str': ' scientist',
  'sequence': 'Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian scientist, most noted for his work on artificial neural networks.'},
 {'score': 0.043861620128154755,
  'token': 5286,
  'token_str': ' academic',
  'sequence': 'Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian academic, most noted for his work on artificial neural networks.'},
 {'score': 0.04273000359535217,
  'token': 9338,
  'token_str': ' rese

##Bias
Pretrained language models like BERT learn patterns from vast amounts of text data, but this data often contains social biases.

As a result, when asked to fill in missing words, the model may reflect or even amplify stereotypes present in the training data.

Detecting and mitigating biases in AI is a very active area of research, aiming to make models more fair, ethical, and inclusive.





In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("Ta his man works as [MASK].")
print([r["token_str"] for r in result])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']


In [5]:
result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


In [8]:
classifier(
    "The president of Spain will be elected in the following months ",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'The president of Spain will be elected in the following months ',
 'labels': ['politics', 'business', 'education'],
 'scores': [0.9009101986885071, 0.07395102828741074, 0.025138691067695618]}

#Conclusions y Next Steps

In this notebook, we explored the powerful and user-friendly Hugging Face Transformers library, which enables quick access to state-of-the-art pretrained models for a variety of natural language processing tasks. We covered essential functionalities such as:

- Using the pipeline API to perform tasks like sentiment analysis, text generation, zero-shot classification, named entity recognition, summarization, question answering, translation, and mask filling.

- How to navigate the Model Hub to select and download pretrained models tailored to your needs.

- The importance of understanding and addressing biases present in pretrained models to ensure responsible AI use.

These tools dramatically simplify the process of integrating sophisticated language understanding and generation capabilities into your projects, even with minimal coding.

**Next Steps**

To deepen your mastery and apply these skills effectively, consider the following:

- Experiment with Fine-Tuning: Try fine-tuning pretrained models on your own datasets to tailor them to specific domains or tasks.

- Explore Advanced Pipelines: Look into more complex pipelines such as conversational AI, text-to-speech, or multi-modal models.

- Bias Mitigation: Dive deeper into research and techniques aimed at detecting and reducing bias in language models.

- Optimize for Deployment: Learn about model optimization, quantization, and serving models efficiently for real-world applications.

- Stay Updated:  the NLP and GenAI fields evolve rapidly ‚Äî regularly check the Hugging Face [documentation](https://huggingface.co/docs) and community resources to stay current.

In [9]:
import torch

print(torch.backends.mps.is_available())   # True if MPS is usable
print(torch.backends.mps.is_built()) 

True
True


In [None]:
%pip install transformers datasets numpy sentencepiece protobuf speechbrain soundfile librosa


Collecting datasets
  Downloading datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting protobuf
  Using cached protobuf-6.32.1-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)
Collecting speechbrain
  Downloading speechbrain-1.0.3-py3-none-any.whl.metadata (24 kB)
Collecting soundfile
  Downloading soundfile-0.13.1-py2.py3-none-macosx_11_0_arm64.whl.metadata (16 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-21.0.0-cp310-cp310-macosx_12_0_arm64.whl.metadata (3.3 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.9.0,>=2023.1.0->datasets)
  Downloading aiohttp-3.12.15-cp310-cp310-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting hyperpyyaml (from speechbrain)
  Downloading HyperPyYAML-1.2.2-py3-none-any.whl.meta

In [16]:
import torch
from transformers import AutoFeatureExtractor, AutoModelForSeq2SeqLM
from speechbrain.pretrained import EncoderDecoderASR
import soundfile as sf
import librosa


# For the Speech-Conversion model
from speechbrain.pretrained import SpeakerRecognition  # for speaker embeddings maybe, or use the ones bundled

# Model ID
MODEL_ID = "Amirhossein75/Speech-Conversion"

# Helper: pick device
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print("Using device:", device)

# Load model & processor
from transformers import AutoModelForSpeechSeq2Seq, AutoTokenizer

# The speech conversion model is encoder-decoder + vocoder + speaker embedding
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_ID).to(device)
processor = AutoFeatureExtractor.from_pretrained(MODEL_ID)  # or equivalent if different name

# Load source utterance (content) and reference (target speaker audio)
src_audio, src_sr = librosa.load("audio/yo.wav", sr=16000)  # forces 16kHz
ref_audio, ref_sr = librosa.load("audio/clau.wav", sr=16000)  # forces 16kHz


# If needed: resample to 16000 Hz, mono etc
# e.g. use librosa or torchaudio for resampling if necessary

# Prepare inputs
# Adapted from convert_once.py in the model repo
inputs = processor(src_audio, sampling_rate=src_sr, return_tensors="pt").input_values.to(device)
ref = processor(ref_audio, sampling_rate=ref_sr, return_tensors="pt").input_values.to(device)

# The model likely expects something like:
outputs = model.generate(speech=inputs, reference=ref)

# The output might be raw waveform, or representation to pass into vocoder
# If needed, run HiFiGAN vocoder
# e.g. vocoder = AutoModel.from_pretrained("microsoft/speecht5_hifigan") ...
# vocoder(...) ‚Üí waveform

# Save output
sf.write("converted.wav", outputs.cpu().numpy(), samplerate=16000)


Using device: mps


Some weights of SpeechT5ForSpeechToText were not initialized from the model checkpoint at Amirhossein75/Speech-Conversion and are newly initialized: ['speecht5.decoder.prenet.embed_tokens.weight', 'text_decoder_postnet.lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ValueError: The following `model_kwargs` are not used by the model: ['speech', 'reference'] (note: typos in the generate arguments will also show up in this list)

In [18]:
import torch
import librosa
import soundfile as sf

from transformers import SpeechT5Processor, SpeechT5ForSpeechToSpeech, SpeechT5HifiGan
from speechbrain.pretrained import EncoderClassifier

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print("Using device:", device)

# 1. Load models
processor = SpeechT5Processor.from_pretrained("Amirhossein75/Speech-Conversion")
model = SpeechT5ForSpeechToSpeech.from_pretrained("Amirhossein75/Speech-Conversion").to(device)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)

# 2. Load audios (force 16kHz)
src_audio, _ = librosa.load("audio/yo.wav", sr=16000)
ref_audio, _ = librosa.load("audio/clau.wav", sr=16000)

# 3. Encode source audio with processor
inputs = processor(audio=src_audio, sampling_rate=16000, return_tensors="pt").to(device)

# 4. Extract speaker embedding from reference audio
spkrec = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-xvect-voxceleb",
    run_opts={"device": "cpu"}
)
embedding = spkrec.encode_batch(torch.tensor(ref_audio).unsqueeze(0))
speaker_embeddings = embedding.to(device)

# 5. Convert speech
with torch.no_grad():
    speech = model.generate_speech(inputs["input_values"], speaker_embeddings, vocoder=vocoder)

# 6. Save result
sf.write("audio/converted.wav", speech.cpu().numpy(), samplerate=16000)

print("‚úÖ Saved converted audio to audio/converted.wav")


Using device: mps


  available_backends = torchaudio.list_audio_backends()
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)


RuntimeError: expand(MPSFloatType{[1, 1, 1, 512]}, size=[-1, 1, -1]): the number of sizes provided (3) must be greater or equal to the number of dimensions in the tensor (4)