<a href="https://colab.research.google.com/github/swalehaparvin/HuggingFace/blob/main/Hugging_face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment analysis using a pre-trained DistilBERT model fine-tuned on the SST-2 dataset

The model outputs either:

POSITIVE (confidence score typically > 0.5)

NEGATIVE (confidence score typically < 0.5)

Loads a tokenizer and model specifically trained for sentiment classification

Achieves 91.3% accuracy on SST-2 validation set

Tokenizes input text into model-readable format

return_tensors="pt" specifies PyTorch tensors

Disables gradient calculation for inference efficiency

Generates prediction scores (logits)
Selects the highest confidence score

Maps numerical ID to human-readable label



In [None]:
!pip install huggingface_hub
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("Hello, I like america", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]


In [None]:
from transformers import pipeline
!pip install huggingface_hub
text = "AI-powered robots assist in complex brain surgeries with precision."

# Create the pipeline
classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

# Create the categories list
categories = ["politics", "science", "sports"]

# Predict the output
output = classifier(text,categories)

# Print the top label and its score
print(f"Top Label: {output['labels'][0]} with score: {output['scores'][0]}")

Calculate and display the dataset size in megabytes

In [None]:
# Import the function to load dataset metadata
from datasets import load_dataset_builder
# Initialize the dataset builder for the MMLU-Pro dataset and Display dataset metadata
reviews_builder = load_dataset_builder("TIGER-Lab/MMLU-Pro")
print(reviews_builder.info)
# Calculate and print the dataset size in MB
dataset_size_mb = reviews_builder.info.dataset_size / (1024 ** 2)
print(f"Dataset size: {round(dataset_size_mb, 2)} MB")


# Manipulating datasets
There will likely be many occasions when you will need to manipulate a dataset before using it within a ML task. Two common manipulations are filtering and selecting (or slicing). The dataset is already loaded for you under wikipedia.
Filter the dataset for rows with the term "football" in the text column and save as filtered.
Select a single example from the filtered dataset and save as example.

In [None]:
from datasets import load_dataset
wikipedia = load_dataset("wikipedia", "20220301.en", split="train")
# Filter the documents
filtered = wikipedia.filter(lambda row: "football" in row["text"])
# Create a sample dataset
example = filtered.select(range(1))
print(example[0]["text"])

# Grammatical correctness
Text classification is the process of labeling an input text into a pre-defined category. This can take the form of sentiment - positive or negative - spam detection - spam or not spam - and even grammatical errors.
Explore the use of a text-classification pipeline for checking an input sentence for grammatical errors.


In [None]:
# Create a pipeline for grammar checking
grammar_checker = pipeline(
  task="text-classification",
  model="abdulmatinomotoso/English_Grammar_Checker"
)
# Check grammar of the input text
output = grammar_checker("I will walk dog")
print(output)

# Question Natural Language Inference
Another task under the text classification umbrella is Question Natural Language Inference, or QNLI. This checks if a premise contains enough information to answer a posed question, determining whether the answer can be found in the given text.
Create a text classification QNLI pipeline using the model "cross-encoder/qnli-electra-base" and save as classifier.
Use this classifier to determine if the text provides enough information to answer the question.

In [None]:
# Create the pipeline
classifier=pipeline(task="text-classification", model="cross-encoder/qnli-electra-base")
# Predict the output
output = classifier("Where is the capital of France?, Brittany is known for its stunning coastline.")
print(output)

# Dynamic category assignment
Dynamic category assignment enables a model to classify text into predefined categories, even without prior training for those categories.
Build the pipeline and save as a classifier.
Create a list of the labels - "politics", "science", "sports" - and save as categories.
Predict the label of text using the classifier and predefined categories.

In [None]:
text = "AI-powered robots assist in complex brain surgeries with precision."
# Create the pipeline
classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")
# Create the categories list
categories = ["politics", "science", "sports"]
# Predict the output
output = classifier(text,categories)
# Print the top label and its score
print(f"Top Label: {output['labels'][0]} with score: {output['scores'][0]}")

# Summarizing long text
Summarization reduces large text into manageable content, helping readers quickly grasp key points from lengthy articles or documents.
There are two main types: extractive, which selects key sentences from the original text, and abstractive, which generates new sentences summarizing main ideas.

In [None]:
# Create the summarization pipeline
summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum")
# Summarize the text
summary_text = summarizer(text)
# Compare the length
print(f"Original text length: {len(text)}")
print(f"Summary length: {len(summary_text[0]['summary_text'])}")

# Using min_length and max_length
The pipeline() function, has two important parameters: min_length and max_length. These are useful for adjusting the length of the resulting summary text to be short, longer, or within a certain number of words.

In [None]:
# Create a short summarizer
short_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=1, max_length=10)
# Summarize the input text
short_summary_text = short_summarizer(text)
# Print the short summary
print(short_summary_text[0]["summary_text"])
# Repeat for a long summarizer
long_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=50, max_length=150)
long_summary_text = long_summarizer(text)
# Print the long summary
print(long_summary_text[0]["summary_text"])

# Tokenizing text with AutoTokenizer
AutoTokenizers simplify text preparation by automatically handling cleaning, normalization, and tokenization. They ensure the text is processed just as the model expects.

In [None]:
# Import necessary library for tokenization
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Split input text into tokens
tokens = tokenizer.tokenize("AI: Making robots smarter and humans lazier!")

# Display the tokenized output
print(f"Tokenized output: {tokens}")

# Using AutoClasses
In this code, we will combine AutoModels and AutoTokenizers with the pipeline() function. It's a nice balance of control and convenience.

- Download the model and tokenizer and save as my_model and my_tokenizer, respectively

- Create the pipeline and save as my_pipeline

- Predict the output using my_pipeline and save as output

In [None]:
# Download the model and tokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
my_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
my_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Create the pipeline
my_pipeline = pipeline(task="sentiment-analysis", model=my_model, tokenizer=my_tokenizer)

# Predict the sentiment
output = my_pipeline("This course is pretty good, I guess.")
print(f"Sentiment using AutoClasses: {output[0]['label']}")

# Extracting text with PyPDF
PyPDF lets us extract text from PDFs, making it easy to work with multi-page documents like policy files.

In this exercise, you’ll load the US_Employee_Policy.pdf, extract its content page by page, and combine it into a single string, preparing the text for a question-answering pipeline.

In [None]:
!pip install pypdf
from pypdf import PdfReader

# Extract text from the PDF
reader = PdfReader("AI Agents vs. Agentic AI.pdf")

# Extract text from all pages
document_text = ""
for page in reader.pages:
    document_text += page.extract_text()

print(document_text)

## Building a Q&A pipeline

You’ll build a question-answering pipeline using Hugging Face to retrieve specific answers from the document.

In [None]:
!pip install pypdf
from pypdf import PdfReader

# Extract text from the PDF
reader = PdfReader("AI Agents vs. Agentic AI.pdf")

# Extract text from all pages
document_text = ""
for page in reader.pages:
    document_text += page.extract_text()

# Load the question-answering pipeline
qa_pipeline = pipeline(task="question-answering", model="distilbert-base-cased-distilled-squad")

question = "What is Negative Sample Integration?"

# Get the answer from the QA pipeline
result = qa_pipeline(question=question, context=document_text)

# Print the answer
print(f"Answer: {result['answer']}")