<a href="https://colab.research.google.com/github/rmanicav/Generative-AI-LLM/blob/main/my_hug_face_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal to learn and undertsnad usage of hugging face for
1) Sentiment Analysis – Classify text sentiment (positive/negative).
2) Text Summarization – Generate concise summaries of long text.
3) Question Answering – Answer questions from a given context.
4) Named Entity Recognition (NER) – Extract entities like names, dates, and organizations.
5) Text Generation – Generate coherent text given a prompt.
6) Image Classification – Classify objects in an image.
7) Object Detection – Detect and localize objects in an image.
8) mage Segmentation – Segment different objects in an image.
9) Translation – Translate text between languages.
10) Zero-Shot Classification – Classify text without task-specific training.
11) Image Captioning – Generate descriptive captions for images.

In [None]:
!pip install transformers

import requests
from io import BytesIO
from transformers import pipeline
from PIL import Image, ImageDraw

import matplotlib.pyplot as plt





In [None]:
#sentiment analysis
from transformers import pipeline

# Load pretrained sentiment analysis pipeline
sentiment_pipeline = pipeline(
    task="sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

text = "Thanks, Google, I didn’t know that already"
result = sentiment_pipeline(text)

print(result)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9828611612319946}]


In [None]:
#Text Summarization
from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

clf("This is terrible")


Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9996459484100342}]

In [None]:
#Named Entity Recognition
ner = pipeline("ner", aggregation_strategy="simple")

ner("Albert Einstein was born in Germany and worked in the USA.")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9994811),
  'word': 'Albert Einstein',
  'start': 0,
  'end': 15},
 {'entity_group': 'LOC',
  'score': np.float32(0.9995701),
  'word': 'Germany',
  'start': 28,
  'end': 35},
 {'entity_group': 'LOC',
  'score': np.float32(0.9997819),
  'word': 'USA',
  'start': 54,
  'end': 57}]

In [None]:
#Question Answering
qa = pipeline("question-answering")

qa({
    "question": "Where was Einstein born?",
    "context": "Albert Einstein was born in Germany in 1879."
})


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'score': 0.9846653938293457, 'start': 28, 'end': 35, 'answer': 'Germany'}

In [None]:
#Text Generation
generator = pipeline("text-generation")

generator(
    "Once upon a time",
    max_length=30,
    num_return_sequences=1
)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'Once upon a time, you may have stumbled upon the following web site. http://welcome.welcome.org/\n\nThis is the web page for the new version of the C++ compiler.\n\nThe C++ compiler is based on the new C++11 Standard Edition. The C++ compiler is the compiler for the new C++11 Standard Edition. The C++ compiler is used for building C++ code.\n\nThe compiler is used for building C++ code. The compiler is used for debugging.\n\nThe compiler is used for debugging. The compiler is used for writing code.\n\nThe compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The compiler is used for writing code. The 

In [None]:
#Summarization
from transformers import pipeline

summarizer = pipeline("summarization")

summarizer(
    "Hugging Face provides tools to build, train, and deploy NLP models easily.",
    max_length=30,
    min_length=10
)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu
Your max_length is set to 30, but your input_length is only 19. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)


[{'summary_text': ' Hugging Face provides tools to build, train, and deploy NLP models easily . Hugging face provides tools for building, train and'}]

In [None]:
#Translation (English → German)
from transformers import pipeline

translator = pipeline(
    "translation_en_to_de",
    model="Helsinki-NLP/opus-mt-en-de"
)

translator("what is your name?")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'translation_text': 'Wie heißt du?'}]

In [None]:
#Fill-Mask
from transformers import pipeline

fill_mask = pipeline("fill-mask")

fill_mask("Hugging Face is creating a <mask> for NLP.")


No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'score': 0.07472262531518936,
  'token': 17715,
  'token_str': ' prototype',
  'sequence': 'Hugging Face is creating a prototype for NLP.'},
 {'score': 0.02965584211051464,
  'token': 27663,
  'token_str': ' template',
  'sequence': 'Hugging Face is creating a template for NLP.'},
 {'score': 0.026139186695218086,
  'token': 1761,
  'token_str': ' platform',
  'sequence': 'Hugging Face is creating a platform for NLP.'},
 {'score': 0.01953568123281002,
  'token': 6655,
  'token_str': ' logo',
  'sequence': 'Hugging Face is creating a logo for NLP.'},
 {'score': 0.017980799078941345,
  'token': 25644,
  'token_str': ' mascot',
  'sequence': 'Hugging Face is creating a mascot for NLP.'}]

In [None]:
#Zero-Shot Classification
zero_shot = pipeline("zero-shot-classification")

zero_shot(
    "This course teaches deep learning with PyTorch",
    candidate_labels=["education", "finance", "health"]
)


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'sequence': 'This course teaches deep learning with PyTorch',
 'labels': ['education', 'health', 'finance'],
 'scores': [0.9821008443832397, 0.009528241120278835, 0.008370935916900635]}

In [None]:
!pip install transformers torch pillow



In [None]:
#Image Classification
from PIL import Image
import requests

from transformers import pipeline

classifier = pipeline("image-classification")

image = Image.open(
    requests.get(
        "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
        stream=True
    ).raw
)

classifier(image)


No model was supplied, defaulted to google/vit-base-patch16-224 and revision 3f49326 (https://huggingface.co/google/vit-base-patch16-224).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.
Device set to use cpu


[{'label': 'tabby, tabby cat', 'score': 0.2768687605857849},
 {'label': 'tiger cat', 'score': 0.2763676643371582},
 {'label': 'Egyptian cat', 'score': 0.14028184115886688},
 {'label': 'hay', 'score': 0.02531472034752369},
 {'label': 'wool, woolen, woollen', 'score': 0.019932903349399567}]

In [None]:
#Detection
detector = pipeline("object-detection")

detector(image)

No model was supplied, defaulted to facebook/detr-resnet-50 and revision 1d5f47b (https://huggingface.co/facebook/detr-resnet-50).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/102M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/detr-resnet-50 were not used when initializing DetrForObjectDetection: ['model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DetrForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


preprocessor_config.json:   0%|          | 0.00/290 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cpu


[{'score': 0.9973955154418945,
  'label': 'cat',
  'box': {'xmin': 156, 'ymin': 31, 'xmax': 385, 'ymax': 146}},
 {'score': 0.999180018901825,
  'label': 'cat',
  'box': {'xmin': 145, 'ymin': 132, 'xmax': 429, 'ymax': 341}}]

In [None]:
#Image Segmentation
segmenter = pipeline("image-segmentation")

segments = segmenter(image)
segments


No model was supplied, defaulted to facebook/detr-resnet-50-panoptic and revision d53b52a (https://huggingface.co/facebook/detr-resnet-50-panoptic).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/172M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/detr-resnet-50-panoptic were not used when initializing DetrForSegmentation: ['detr.model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'detr.model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'detr.model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'detr.model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForSegmentation from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DetrForSegmentation from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


preprocessor_config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

Device set to use cpu


model.safetensors:   0%|          | 0.00/172M [00:00<?, ?B/s]

`label_ids_to_fuse` unset. No instance will be fused.


[{'score': 0.997536,
  'label': 'cat',
  'mask': <PIL.Image.Image image mode=L size=622x412>},
 {'score': 0.999448,
  'label': 'cat',
  'mask': <PIL.Image.Image image mode=L size=622x412>}]

In [None]:
#Image Captioning
captioner = pipeline("image-to-text")

captioner(image)


No model was supplied, defaulted to ydshieh/vit-gpt2-coco-en and revision 5bebf1e (https://huggingface.co/ydshieh/vit-gpt2-coco-en).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/211 [00:00<?, ?B/s]

Device set to use cpu
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


[{'generated_text': 'a cat laying on a pile of hay '}]