<a href="https://colab.research.google.com/github/sourabhjsr88/databricks-basics/blob/main/Welcome_To_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Playing with Transformer from Huggingface

Google colab comes with preinstalled torch & tensorflow. If you are not using colab you will have to install those.

In [1]:
!pip install transformers datasets
!pip install torch
!pip install tensorflow
!pip install sentencepiece
!pip install sacremoses


Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1


# text-classification Pipeline

As I am not passing any transformer name it will take the default on which is https://huggingface.co/Xenova/distilbert-base-uncased-finetuned-sst-2-english

In [5]:

# Initialize a text classification pipeline
classifier = pipeline("sentiment-analysis")

import pandas as pd
from transformers import pipeline

# Initialize a text classification pipeline (if not already done)
# classifier = pipeline("sentiment-analysis") # Uncomment if 'classifier' is not defined

# Sample text for classification
text_to_classify = "I love using Hugging Face Transformers! It's so powerful."

# Perform sentiment analysis
results = classifier(text_to_classify)

# Print the results
print(results)

# Example with multiple texts
texts = [
    "This movie was fantastic!",
    "I really disliked that product.",
    "The weather is neutral today."
]

multi_results = classifier(texts)
print(multi_results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998114705085754}]
[{'label': 'POSITIVE', 'score': 0.9998781681060791}, {'label': 'NEGATIVE', 'score': 0.9996312856674194}, {'label': 'NEGATIVE', 'score': 0.9955588579177856}]


# NER Pipeline

NER stands for Named Entity Recognition. Let's break down the NER pipeline code:

from transformers import pipeline: This line imports the pipeline function from the transformers library, which is a high-level API for using pre-trained models for various tasks.

ner_pipeline = pipeline("ner", grouped_entities=True): This initializes a Named Entity Recognition (NER) pipeline.


ner_results = ner_pipeline(text_for_ner): This is where the magic happens! The ner_pipeline is called with the input text. The pipeline processes the text using its underlying pre-trained model to identify named entities.

print(ner_results): Finally, this line prints the results returned by the ner_pipeline. The output will typically be a list of dictionaries, where each dictionary represents an identified entity, including its text, type (e.g., 'ORG' for organization, 'PER' for person, 'LOC' for location), start and end indices in the original text, and a confidence score.



In [6]:
from transformers import pipeline

# Initialize a NER pipeline
ner_pipeline = pipeline("ner", grouped_entities=True)

# Sample text for NER
text_for_ner = "Google was founded by Larry Page and Sergey Brin in Menlo Park, California."

# Perform NER
ner_results = ner_pipeline(text_for_ner)

# Print the results
print(ner_results)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'entity_group': 'ORG', 'score': np.float32(0.9986947), 'word': 'Google', 'start': 0, 'end': 6}, {'entity_group': 'PER', 'score': np.float32(0.99899), 'word': 'Larry Page', 'start': 22, 'end': 32}, {'entity_group': 'PER', 'score': np.float32(0.9962962), 'word': 'Sergey Brin', 'start': 37, 'end': 48}, {'entity_group': 'LOC', 'score': np.float32(0.9791672), 'word': 'Menlo Park', 'start': 52, 'end': 62}, {'entity_group': 'LOC', 'score': np.float32(0.9986546), 'word': 'California', 'start': 64, 'end': 74}]




# Question-Answering Pipeline

In [7]:
from transformers import pipeline

# Initialize a question-answering pipeline
qa_pipeline = pipeline("question-answering")

# Sample context and question
context = "The capital of France is Paris. Paris is also known for its Eiffel Tower."
question = "What is the capital of France?"

# Perform question answering
qa_results = qa_pipeline(question=question, context=context)

# Print the results
print(qa_results)

question2 = "What is Paris also known for?"
qa_results2 = qa_pipeline(question=question2, context=context)
print(qa_results2)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


{'score': 0.9544874608982354, 'start': 25, 'end': 30, 'answer': 'Paris'}
{'score': 0.8990920658980031, 'start': 60, 'end': 72, 'answer': 'Eiffel Tower'}


#text-generation Pipeline

In [8]:
from transformers import pipeline

# Initialize a text generation pipeline
generator = pipeline("text-generation")

# Sample prompt for text generation
prompt = "Once upon a time, in a land far, far away,"

# Generate text
generated_text = generator(prompt, max_length=50, num_return_sequences=1)

# Print the generated text
print(generated_text[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Once upon a time, in a land far, far away, and in a country far away, the man who is the most eminent man was a man who lived before the days of Solomon. He was the Lord of hosts, the God that brought down down in his days the angels of God, and he was the God who built up the temple of God in Jerusalem. He was the God who brought down all the nations, and he was the Lord of all the nations who gave their lives for the salvation of the earth. He was the God who created the heavens and the earth and, as a sacrifice of his own blood, died under the control of the Lord. He was the God who gave up the children of Israel and brought them to the land of Canaan. He was the God who gave the land to the Israelites, and he was the God who set them free from all oppression from the fatherland of Egypt. He was the God who commanded them to be the first to come to the land of Canaan, and he was the God who made them the first to come to the land of Canaan. He was the God who gave them the land to t

#Translation Pipeline
from transformers import pipeline: This line imports the pipeline function from the transformers library, which is a high-level abstraction designed to easily use pre-trained models for various tasks.

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-de"): This initializes a translation pipeline.

"translation" specifies that we want to perform a translation task.
model="Helsinki-NLP/opus-mt-en-de" explicitly loads a pre-trained model from Hugging Face Hub, specifically the Helsinki-NLP Opus-MT model for English to German translation. This ensures the translation will be from English to German.
text_to_translate = "Hello, how are you today?": This line defines a string variable text_to_translate containing the English text that will be translated.



In [9]:
from transformers import pipeline

# Initialize a translation pipeline (e.g., English to German)
# You might need to specify a model if the default doesn't suit your needs
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-de")

# Sample text for translation
text_to_translate = "Hello, how are you today?"

# Perform translation
translated_text = translator(text_to_translate)

# Print the translated text
print(translated_text[0]['translation_text'])

# Example with multiple texts
texts_to_translate = [
    "I love programming.",
    "The weather is beautiful."
]

multi_translated_texts = translator(texts_to_translate)
print(multi_translated_texts)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Hallo, wie geht's dir heute?
[{'translation_text': 'Ich liebe Programmieren.'}, {'translation_text': 'Das Wetter ist wunderschön.'}]


# Image Classifier
Lets move from NLP (Natural Language Procession) to Image processing with Computer Vision.

In [None]:
from transformers import pipeline
import requests
from PIL import Image

# Initialize an image classification pipeline
# Using a default model like 'google/vit-base-patch16-224' for demonstration
image_classifier = pipeline("image-classification")

# Sample image URL
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-finetuned-flickr30k-racc.jpg"

# Download and open the image
response = requests.get(image_url, stream=True)
response.raise_for_status() # Raise an exception for HTTP errors
image = Image.open(response.raw)

# Perform image classification
classification_results = image_classifier(image)

# Print the results
print(classification_results)

# You can also classify local images
# # Save the image locally first if you prefer
# with open("sample_cat.jpg", "wb") as f:
#     f.write(response.content)
# local_image = Image.open("sample_cat.jpg")
# local_classification_results = image_classifier(local_image)
# print(local_classification_results)
