<a href="https://colab.research.google.com/github/radve88/Learning-AI/blob/main/huggingface2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hugging Face Datasets (datasets library)

Pretrained Pipelines (like sentiment analysis, NER, etc.)

Fine-tuning models (we’ll explore this a bit later)

Now let’s move into real use cases using Hugging Face pipelines, starting with sentiment analysis.



In [1]:
!pip install transformers datasets


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

In [2]:
from transformers import pipeline

# Load sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Analyze a few sample texts
texts = [
    "I love the Hugging Face library!",
    "This product is terrible. I'm very disappointed.",
    "The weather is okay today."
]

results = sentiment_analyzer(texts)

for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Label: {result['label']}, Score: {result['score']:.4f}")
    print()


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


Text: I love the Hugging Face library!
Label: POSITIVE, Score: 0.9999

Text: This product is terrible. I'm very disappointed.
Label: NEGATIVE, Score: 0.9998

Text: The weather is okay today.
Label: POSITIVE, Score: 0.9998



does this happen implicitly Tokenized and embedded input text
 i dont see an autokenizer instantiated or its object used


ChatGPT said:
Great observation! 🤓 Let's clarify what's happening under the hood.

✅ Yes — It does happen implicitly.
When you use this line:

python
Copy
Edit
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
You’re not directly instantiating a tokenizer or model — but the pipeline is doing it for you behind the scenes. Here's what it does implicitly:

⚙️ What Happens Under the Hood
Model & Tokenizer Selection:

If you don’t specify a model, Hugging Face uses a default for the task.

python
Copy
Edit
# Internally equivalent to
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
Tokenization:

When you run:


classifier("I love Hugging Face!")
the pipeline internally does:


inputs = tokenizer("I love Hugging Face!", return_tensors="pt")
Embedding & Forward Pass:

Passes tokenized inputs to the model.

Gets logits (raw scores).

Applies softmax to get probabilities.

Selects the class with the highest score.


🔍 If You Want to See It Explicitly:
To take control, you can write the pipeline manually like this:


from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

text = "I love Hugging Face!"

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

probs = F.softmax(outputs.logits, dim=-1)
print(probs)
