# **Huggingface 🤗** 

## **What's Covered?**
1. Introduction to HuggingFace
2. 🤗 Transformers

## **Introduction to HuggingFace**

### **What is HuggingFace?**

Hugging Face is a company and a community of opensource ML projects, most famous for NLP.  

Hugging Face makes advanced AI, especially with large language models, accessible and practical for everyone. It provides the building blocks (pre-trained models, datasets) and the tools (libraries, platform) to quickly build, experiment with, and deploy AI applications without reinventing the wheel.

### **HuggingFace Hub**
**HuggingFace Hub** is a web-based platform which hosts the following:
1. **Models Hub:** Thousands of pre-trained models (BERT, GPT, T5, etc.) for tasks like sentiment analysis, translation, summarization, and more. Go to: [https://huggingface.co/models](https://huggingface.co/models) and click on any model. Pick a task you're interested in (like "text-generation"), find a popular model, and read its model card.  When you click on the model name, you'll find:
    - Model card (what a model does)
    - Usage code
    - License
    - Downloads
    - Metrics
    - Files (like pytorch_model.bin, config.json, tokenizer.json)
2. **Datasets Hub:** Access and share a vast collection of datasets. High-quality data is the fuel for AI models, and Hugging Face offers a convenient way to find and utilize ready-to-use datasets for various tasks.
3. **Spaces:** Create and host interactive demos of your machine learning models or applications directly in your browser. This allows for easy sharing and showcasing of your work.
4. **Other Tools:** It also provides various other tools and services for things like automatic model training (AutoTrain) and **Inference APIs** for easy deployment.
5. **Hub for Vision, Audio, and Multimodal Models:** Not limited to text anymore.


### **Key libraries**
1. `🤗 Transformers:` This is the flagship library. It provides a unified API for working with state-of-the-art "Transformer" models (the "T" in GPT, BERT, etc.). These models are incredibly powerful for understanding and generating human language. With `transformers`, you can load, use, and fine-tune these models with just a few lines of code, regardless of whether you're using PyTorch or TensorFlow.
2. `🤗 Datasets:` This library simplifies the process of loading, processing, and sharing datasets for machine learning. It's highly efficient for handling large datasets.
3. `🤗 Evaluate:`
4. `🤗 Tokenizers:` Before a language model can understand text, it needs to break it down into smaller pieces called "tokens." The tokenizers library provides highly optimized tokenizers for various models, ensuring your data is in the right format for the models to consume.
5. `🤗 Accelerate:` A library that helps you train your models on different hardware setups (e.g., multiple GPUs, distributed training) with minimal code changes.

### **Open Source and Framework Agnostic**
- By providing pre-trained models and easy-to-use APIs, it dramatically speeds up the development cycle.
- The open-source nature fosters a collaborative environment, leading to faster advancements and more robust solutions.
- It supports popular deep learning frameworks like PyTorch and TensorFlow, giving developers flexibility.

## **🤗 Transformers**

### **What is 🤗 Transformers?**
🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:

**📝 Natural Language Processing:** text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.  
**🖼️ Computer Vision:** image classification, object detection, and segmentation.  
**🗣️ Audio:** automatic speech recognition and audio classification.  
**🐙 Multimodal:** table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.  

🤗 Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage of a model’s life; train a model in three lines of code in one framework, and load it for inference in another. Models can also be exported to a format like ONNX and TorchScript for deployment in production environments.

### **Installation**
1. `transformers` can run on top of either PyTorch or TensorFlow. You need at least one of them installed.
2. For PyTorch
```
! pip install torch
```
3. For TensorFlow
```
! pip install tensorflow
! pip install tf-keras
```
4. Installing transformers
```
! pip install transformers
```
5. Installing tokenizers for fast tokenization
```
! pip install tokenizers
```
6. Installing datasets for efficient data loading and processing
```
! pip install datasets
```
7. Installing accelerate for easy distributed training
```
! pip install accelerate
```

In [1]:
# ! pip install tensorflow
# ! pip install tf-keras

In [2]:
# ! pip install transformers
# ! pip install tokenizers
# ! pip install datasets
# ! pip install accelerate

### **Verifying Installation**

In [3]:
import transformers

print(transformers.__version__)



4.44.0


## **Pipelines**

### **What is pipeline()?**
The `pipeline()` makes it simple to use any model from the `Hub` for inference on any language, computer vision, speech, and multimodal tasks. Even if you don’t have experience with a specific modality or aren’t familiar with the underlying code behind the models, you can still use them for inference with the `pipeline()`! 

It is the most powerful way to start using pre-trained Hugging Face models. 

It's a high level API that abstracts away all the complexity of tokenization, model loading, and post-processing, allowing you to perform common tasks with just a few lines of code.

### **Common Tasks Supported**
- sentiment-analysis
- text-generation
- ner
- summarization
- translation
- question-answering
- fill-mask (predicting missing words)
- zero-shot-classification (classifying text without specific training examples)
- ... and many more!

Explore more on:  
https://huggingface.co/docs/transformers/main_classes/pipelines

### **Pipeline syntax**
1. Start by importing `pipeline`
```python
from transformers import pipeline
```
2. Specify the inference task
```python
classifier = pipeline(task="text-classification")
```
3. Pass the input to the `pipeline()`
```python
classifier(input_text)
```

### **Behind the Scenes**
- Loads tokenizer
- Loads model
- Handles pre/post-processing
- Gives results

### **Key Points**
- The first time you run a pipeline for a specific model, it will download the model weights (which can be several hundred MB to GBs).
- Subsequent runs will use the cached version.
- You can specify a particular model within the pipeline if you don't want the default.
- The output format of the pipeline varies depending on the task.

## **Default Model List**

### **Natural Language Processing**
For Natural Language Processing, the following are the default models for respective tasks.

- Text classification: `distilbert-base-uncased-finetuned-sst-2-english`
- Token classification: `dslim/bert-base-NER`
- Text summarization: `sshleifer/distilbart-cnn-12-6`
- Question answering: `distilbert-base-cased-distilled-squad`
- Text generation: `gpt2`
- Text similarity: `sentence-transformers/all-mpnet-base-v2`
- Translation: `t5-base`
- Fill mask: `distilroberta-base`

### **Computer Vision**
The default models for computer vision tasks are as follows:

- Image classification: `google/vit-base-patch16-224`
- Object detection: `facebook/detr-resnet-50`
- Segmentation: `facebook/detr-resnet-50-panoptic`

## **Practical Natural Language Processing Use Cases**

### **Loading the text data**

In [4]:
with open("text/email.txt") as f:
    email = f.read()

print(email)

Congratulations Alice – Welcome to the GenAI Internship Program!

Dear Alice,

Congratulations! 🎉

We are thrilled to inform you that you have been selected for the GenAI Internship Program, starting on 25th of this month. Your application stood out among thousands, and we’re excited to have you on board as part of this prestigious program.

The official offer letters will be shared with all selected candidates on 20th of this month. Please keep an eye on your inbox and reach out in case you do not receive it by the end of that day.

We look forward to your active participation and can’t wait to see the incredible work you’ll do during this internship!

Best regards,
Program Coordinator
GenAI Internship Team



In [5]:
with open("text/product_review.txt") as f:
    product_review = f.read()

print(product_review)

I bought this product from Flipkart website.
This product is very worst and replacement policy is very bad. Even I went to their New Delhi support center.
I used this laptop only for 30 minute and suddenly it turn off and it will never turn on.
And Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware.
