# Hugging Face

## What is Hugging Face and how does it differ from other ML libraries?

Hugging Face provides:
- Pre-trained transformer models for NLP/CV/Audio
- Model Hub for sharing and discovering models
- Datasets library for ML datasets
- Tools for training and fine-tuning models
- Inference API and model deployment
- AutoML capabilities with AutoTrain
- Spaces for ML app deployment

Key differences from other frameworks:
- Focus on transformer architectures
- Largest collection of pre-trained models
- Stronger community and sharing features
- Better standardization across models
- Simpler fine-tuning workflows
- Integrated deployment solutions

## Different kinds of models available:

- Multimodal models 
    - Audio-Text-to-Text, Image-Text-to-Text, Visual Question Answering, Document Question Answering, Video-Text-to-Text, Any-to-Any
- Computer Vision models
    - Depth Estimation, Image Classification, Object Detection, Image Segmentation, Text-to-Image, Image-to-Text, Image-to-Image, Image-to-Video, Unconditional Image Generation, Video Classification, Text-to-Video, Zero-Shot Image Classification, Mask Generation, Zero-Shot Object Detection, Text-to-3D, Image-to-3D, Image Feature Extraction, Keypoint Detection
- Natural Language Processing models
    - Text Classification, Token Classification, Table Question Answering, Question Answering, Zero-Shot Classification, Translation, Summarization, Feature Extraction, Text Generation, Text2Text Generation, Fill-Mask, Sentence Similarity, Audio, Text-to-Speech, Text-to-Audio, Automatic Speech Recognition, Audio-to-Audio, Audio Classification, Voice Activity Detection
- Tabular models
    - Tabular Classification, Tabular Regression, Time Series Forecasting
- Reinforcement Learning models
    - Reinforcement Learning, Robotics
- Other models
    - Graph Machine Learning

## How do you install this?

`!pip install transformers datasets evaluate accelerate torch sentencepiece sacremoses -U`

- transformers -> library for all kinds of NLP tasks
- datasets -> library for all kinds of datasets
- evaluate -> library for evaluation of models
- accelerate -> library for distributed training
- torch -> PyTorch library
- sentencepiece -> library for tokenization
- sacremoses -> library for tokenization

In [1]:
# !pip install transformers datasets evaluate accelerate torch sentencepiece sacremoses -U

## How can you use pipelines?

Pipelines are a high-level API for using pre-trained models for common NLP tasks. They are easy to use and require minimal code. You can select a task and a model, and then use the pipeline to perform the task.

Some parameters are `model="distilgpt2"`, `max_length=20`, `num_return_sequences=2`, etc.

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis") # distilbert-base-uncased-finetuned-sst-2-english model is used by default
print(classifier(["I love Transformers!", "I hate bugs."]))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998069405555725}, {'label': 'NEGATIVE', 'score': 0.9967179894447327}]


In [3]:
zero_shot = pipeline("zero-shot-classification")
print(zero_shot("This is a course about NLP models.", candidate_labels=["education", "technology", "sports"]))

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about NLP models.', 'labels': ['technology', 'education', 'sports'], 'scores': [0.9376305937767029, 0.05547419190406799, 0.006895218510180712]}


In [4]:
generator = pipeline("text-generation", model="distilgpt2")
print(generator("Transformers are great for", max_length=20, num_return_sequences=2))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Transformers are great for a variety of needs and you can find thousands of them out of the closet'}, {'generated_text': 'Transformers are great for creating exciting, entertaining content for your visitors, who are looking for more entertaining'}]


In [None]:
unmasker = pipeline("fill-mask", model='distilroberta-base')
print(unmasker("Hugging Face is a <mask> library.", top_k=2))

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.176646888256073, 'token': 481, 'token_str': ' free', 'sequence': 'Hugging Face is a free library.'}, {'score': 0.07091348618268967, 'token': 285, 'token_str': ' public', 'sequence': 'Hugging Face is a public library.'}]


In [6]:
ner = pipeline("ner", grouped_entities=True) # Named Entity Recognition
print(ner("Hugging Face is based in New York City."))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'ORG', 'score': 0.8907568, 'word': 'Hugging Face', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': 0.9991805, 'word': 'New York City', 'start': 25, 'end': 38}]


In [7]:
qa = pipeline("question-answering")
print(qa(question="Where is Hugging Face based?", context="Hugging Face is based in New York City."))

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9694607853889465, 'start': 25, 'end': 38, 'answer': 'New York City'}


In [8]:
summarizer = pipeline("summarization")
print(summarizer("Hugging Face creates tools for NLP. These tools are widely used in AI and ML applications.", min_length=6, max_length=10))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' Hugging Face creates tools for N'}]


In [9]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
print(translator("Hugging Face est une bibliothèque populaire pour le NLP."))

[{'translation_text': 'Hugging Face is a popular library for the NLP.'}]


In [10]:
1/0

ZeroDivisionError: division by zero

## How do you load and use pre-trained models?

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForQuestionAnswering,
    pipeline
)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Basic tokenization and inference
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

# Using pipelines (high-level API)
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
entities = ner("My name is Sarah and I live in London")

qa = pipeline("question-answering")
result = qa(question="Who was Jim Henson?",
           context="Jim Henson was a puppeteer")

# Specific task models
classifier = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

ner_model = AutoModelForTokenClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=9
)

qa_model = AutoModelForQuestionAnswering.from_pretrained(
    'bert-base-uncased'
)