<a href="https://colab.research.google.com/github/radve88/Learning-AI/blob/main/Huggingface_Lessons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Escavate Huggingface Transformers and Bert with chat gpt help

Hugging Face is primarily focused on deep learning models, especially transformers for tasks in NLP, vision, audio, and more. But let’s break it down:

1. For Traditional Machine Learning
You’ll still typically use scikit-learn, XGBoost, LightGBM, or CatBoost for things like:

Logistic regression

Decision trees

Random forests

SVMs

Clustering

Hugging Face doesn't aim to replace those — it complements them.

2. Where Hugging Face Shines
When you're dealing with:

Text classification or generation

Translation

Named entity recognition

Question answering

Summarization

Vision + language (e.g., CLIP, BLIP)

Audio + text (e.g., Whisper, Wav2Vec2)

It provides a unified, plug-and-play interface for state-of-the-art models using transformers, datasets, and evaluate.

3. Blending the Two
You can combine both worlds:

Use transformers from Hugging Face to extract features (embeddings).

Feed those features into sklearn models (like random forests or SVMs).

This works well when adding deep understanding to structured models.

So, yes — for traditional ML, stick to scikit-learn or related libraries, and use Hugging Face when you need state-of-the-art deep learning.

Want me to show you an example of combining Hugging Face transformers with scikit-learn?

You said:
Yes I will run it

ChatGPT said:
Awesome! Here’s a simple example where we:

Use Hugging Face to get BERT embeddings for text.

Use scikit-learn to train a logistic regression classifier on those embeddings.

Step-by-step example: Sentiment classification (positive/negative)

In [1]:
# Install required packages
!pip install transformers scikit-learn -q


Let's break this line down clearly:


cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
embeddings.append(cls_embedding)
This is happening inside a loop that processes each text sample with a Hugging Face transformer model (like BERT or its variants). Here’s a breakdown of what’s happening:

🧠 Context: What is outputs.last_hidden_state?
After you tokenize the input and pass it through the model (model(**inputs)), the result is an outputs object.
One of its key attributes is:


outputs.last_hidden_state
This has shape:

(batch_size, sequence_length, hidden_size)
For example, if you're processing one sentence, and your model has:

sequence_length = 10 tokens (after tokenization),

hidden_size = 768 (standard for BERT), then:


outputs.last_hidden_state.shape = (1, 10, 768)
📌 The [CLS] Token
Transformer models like BERT use a special [CLS] token at the beginning of every input sentence.
This token is meant to summarize the entire sentence.

So the 0th position (i.e., [:, 0, :]) in the output corresponds to the [CLS] embedding for each sentence.

🔍 Now this part:

cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
Step by step:

outputs.last_hidden_state[:, 0, :]

Take the [CLS] token embedding for each sentence in the batch.

Since you're likely running a loop over one sentence at a time, this gives shape (1, hidden_size).

.squeeze()

Removes the batch dimension if it's just one example.

Resulting shape: (hidden_size,), i.e., a single vector for the sentence.

.numpy()

Converts the tensor to a NumPy array so you can append and work with it like traditional ML features.

embeddings.append(cls_embedding)

Add the resulting sentence embedding to your list of all sentence embeddings.

🧠 Why this matters
This sentence vector (from [CLS]) is a compressed semantic representation of the whole sentence, and is often used as a feature vector for traditional ML models like:

Logistic regression

SVM

Random forest
Which is what you're probably doing in this hybrid Hugging Face + scikit-learn example.








In [3]:


# Imports
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

# Sample dataset (you can replace this with a real one)
texts = [
    "I love this movie!", "This film was terrible.",
    "Absolutely fantastic acting!", "I hated the plot.",
    "What a great experience.", "Worst movie ever made."
]
labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

# Load pretrained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Turn off gradients for speed
model.eval()

# Function to get embeddings
def get_embeddings(texts):
    with torch.no_grad():
        embeddings = []
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            outputs = model(**inputs)
            # Use [CLS] token's embedding as sentence representation
            cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
            embeddings.append(cls_embedding)
        return np.array(embeddings)

# Extract features
X = get_embeddings(texts)
y = np.array(labels)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a logistic regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("Saavy is a beautiful boutique with elegant fashion!")
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998558759689331}]


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("Saavy is a beautiful boutique with elegant fashion!")
print(result)
