<a href="https://colab.research.google.com/github/sushmita25dubey/IIT-BHU-Workshop-1st-Year/blob/main/Session_4_(DAY%201%20).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤗 Hugging Face Workshop

![](https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo-with-title.png)

## **1. Introduction to Hugging Face**

### What is Hugging Face?
Hugging Face is an open-source platform specializing in Natural Language Processing (NLP) and Machine Learning (ML). It provides pre-trained models for tasks like text generation, translation, summarization, and more. With Hugging Face, you can:
- Use pre-trained models from the Transformers library.
- Fine-tune models on your own dataset.
- Share and discover models via the Hugging Face Hub.

Hugging Face is a powerful tool for both beginners and experts in AI/ML!

## **2. Setting up the environment**

### Understand & install the libraries

Before we install the libraries, let's understand their purpose:

- **Transformers:** Provides pre-trained models and tools to fine-tune or deploy them for NLP tasks like text classification, translation, and summarization.
- **Datasets:** Simplifies downloading and preparing large datasets, particularly those commonly used in NLP tasks.
- **Accelerate:** Optimizes model training, enabling faster computation and support for distributed training setups.
- **Huggingface-hub:** Provides tools to interact with the Hugging Face Hub, allowing users to download, upload, and manage models and datasets.

In [None]:
# Install the Hugging Face Transformers library
!pip install transformers datasets --quiet

# Optional: Install Accelerate for faster training
!pip install accelerate --quiet

# Install Hugging Face Hub
!pip install huggingface-hub --quiet

## **3. Getting Your Hugging Face Token**

### Why Do You Need a Token?
To access private models or push your own models to the Hugging Face Hub, you need an authentication token.

### Steps to Get Your Token:
1. Go to the [Hugging Face website](https://huggingface.co/).
2. Create an account or log in.
3. Click on your profile picture in the top right corner.
4. Select **Access Tokens**.
5. Click **New Token** and create a token with the required permissions.
6. Copy the token to your clipboard.

## **4. Logging into Hugging Face in Google Colab**

Run the following code cell and paste your token when prompted:

In [None]:
from huggingface_hub import notebook_login

# Log in to Hugging Face
notebook_login()

## **5. Using Hugging Face Transformers**

### Example 1: Using Pipelines

Pipelines make it easy to use pre-trained models for common tasks by abstracting the complexities of model loading, tokenization, and inference. Here’s how they work:

1. **Model Selection:** The pipeline function automatically selects a suitable pre-trained model for the specified task (e.g., summarization, sentiment analysis, translation).
2. **Data Preparation:** It processes the input text (e.g., tokenizing it) to make it compatible with the model's requirements.
3. **Inference Execution:** The pipeline runs the input through the model to generate predictions or outputs.
4. **Output Formatting:** It converts raw model outputs (like tensors) into a user-friendly format (like plain text or structured data).

Pipelines eliminate the need for manual setup, such as loading models, tokenizers, and writing preprocessing or postprocessing code. This makes them ideal for beginners exploring NLP tasks and for quick prototyping.

---

For more details, check out these resources:
- [Hugging Face Documentation on Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)
- [Video: "The pipeline function"](https://www.youtube.com/watch?v=tiZFewofSLM)

---

Here's are various types of pipelines:

#### 1. Summarization Pipeline

This is used for generating a summary of a given text.

In [None]:
from transformers import pipeline

# Load a summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

In [None]:
# Summarize some text
text = """
Artificial Intelligence (AI) is a field of computer science that emphasizes the creation of intelligent machines
that work and react like humans. Some of the activities computers with AI are designed for include learning, reasoning,
problem-solving, perception, and language understanding. The ultimate goal of AI is to create systems that can function
intelligently and independently.
"""

summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print("\nSummary:\n")
print(summary[0]['summary_text'])

#### 2. Sentiment Analysis Pipeline

This pipeline is used to analyze the sentiment (positive, negative, or neutral) of a given text.

In [None]:
from transformers import pipeline

# Load a sentiment analysis pipeline with a specific model
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
# Analyze sentiment of some text
text = "I love this new phone! It's so fast and has a great camera."

sentiment = sentiment_analyzer(text)
print("\nSentiment Analysis:\n")
print(sentiment)

#### 3. Text Classification Pipeline

This pipeline is used for classifying text into predefined categories (e.g., spam detection, topic classification).

In [None]:
from transformers import pipeline

# Load a text classification pipeline with a specific model
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

In [None]:
# Classify text into labels
text = "I am learning Python programming and NLP."

labels = ["Technology", "Science", "Education"]
classification = classifier(text, candidate_labels=labels)
print("\nText Classification:\n")
print(classification)

#### 4. Translation Pipeline

This pipeline is used for translating text from one language to another.

In [None]:
from transformers import pipeline

# Load a translation pipeline with a specific model
translator = pipeline("translation_en_to_fr", model="t5-small")

In [None]:
# Translate some text
text = "Hello, how are you today?"

translation = translator(text)
print("\nTranslation (English to French):\n")
print(translation[0]['translation_text'])

#### 5. Named Entity Recognition (NER) Pipeline

This pipeline is used to identify and classify entities (like names, locations, dates) in a given text.

In [None]:
from transformers import pipeline

# Load a NER pipeline with a specific model
ner_tagger = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

In [None]:
# Recognize entities in the text
text = "Elon Musk, the CEO of Tesla, visited Paris in 2023."

entities = ner_tagger(text)
print("\nNamed Entity Recognition:\n")
for entity in entities:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}")

#### 6. Question Answering Pipeline
This pipeline is used to answer questions based on a provided context.

In [None]:
from transformers import pipeline

# Load a question answering pipeline with a specific model
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

In [None]:
# Context and question
context = """
Hugging Face is a company that develops machine learning tools and offers an open-source platform for training,
deploying, and sharing machine learning models. It was founded in 2016 by Clément Delangue, Julien Chaumond, and Thomas Wolf.
"""
question = "Who founded Hugging Face?"

# Answer the question based on context
answer = qa_pipeline(question=question, context=context)
print("\nQuestion Answering:\n")
print(f"Answer: {answer['answer']}")

### Example 2: Custom Tokenization and Model Usage

Custom tokenization and model usage provide greater control over the NLP pipeline, making it suitable for advanced use cases where specific model configurations or inputs are required. Unlike pipelines, this method lets you directly manage each component, giving flexibility for fine-tuning, custom preprocessing, and interpreting intermediate outputs.

Use cases for custom tokenization include:
- **Custom preprocessing:** When your data requires specific cleaning or formatting steps before tokenization.
- **Model-specific adjustments:** To modify parameters or inputs for a particular task or dataset.
- **Debugging or analysis:** To inspect intermediate outputs like tokenized inputs or logits for better interpretability.

---

Here's an example of how to manually tokenize and use a pre-trained model:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Define model and tokenizer
model_name = "Falconsai/text_summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [None]:
# Input text
text = """
Computer programming or coding is the composition of sequences of instructions, called programs, that computers can follow to perform tasks.
It involves designing and implementing algorithms, step-by-step specifications of procedures, by writing code in one or more programming languages.
Programmers typically use high-level programming languages that are more easily intelligible to humans than machine code, which is directly
executed by the central processing unit. Proficient programming usually requires expertise in several different subjects, including knowledge of the
application domain, details of programming languages and generic code libraries, specialized algorithms, and formal logic.
"""

# Tokenize input
inputs = tokenizer(text, return_tensors="pt")

# Pass inputs through the model
outputs = model(**inputs)

# Get predictions
logits = outputs.logits
predictions = logits.argmax(dim=-1).item()
label_map = {0: "Negative", 1: "Positive"}
print("Sentiment:", label_map[predictions])

## **6. Explore More**

The Hugging Face Hub offers a variety of pre-trained models for tasks like text classification, image generation, and speech recognition. You can search for models using:

- **Tags:** Filter models by task, language, or framework.
- **Search Bar:** Enter keywords like "text generation" or "sentiment analysis" to find relevant models.
- **Filters:** Refine your search by dataset, license type, or size of the model.

Check out the [Hugging Face Hub](https://huggingface.co/models) for more models to try!

More resources to try out:
1. [Hugging Face's NLP Course](https://huggingface.co/learn/nlp-course/en/chapter2/1)
2. [Open Source AI Cookbook](https://huggingface.co/learn/cookbook/index)