# HuggingFace Transformers

**Introduction to Hugging Face**
Hugging Face is a leading AI and NLP company that provides open-source tools, models, and platforms for machine learning (ML) applications. It is widely known for Transformers, a library that enables easy access to pre-trained deep learning models for tasks like text generation, translation, classification, and more. Hugging Face has democratized NLP and AI research by making powerful models accessible to developers and researchers.

**Key Components of Hugging Face**
1. 🤗 Transformers – A library for state-of-the-art NLP models like BERT, GPT, and T5.
2. Datasets – A collection of preprocessed datasets for training ML models efficiently.
3. Tokenizers – Optimized tokenization tools for processing text data.
4. Hugging Face Hub – A repository for sharing and hosting models, datasets, and applications.
5. Inference API – Cloud-based API for running models without requiring local GPU resources.
6. AutoTrain – No-code ML training for text, vision, and tabular data.

**Applications of Hugging Face**

* Natural Language Processing (NLP): Sentiment analysis, chatbots, summarization, question answering.
* Computer Vision: Image classification, object detection, image generation.
* Speech Processing: Speech-to-text, voice recognition.
* Code Generation & Understanding: AI-assisted coding (e.g., Codex, StarCoder).
* Healthcare & Biomedicine: Medical NLP, research on clinical notes.
* Search & Recommendation Systems: AI-powered search engines and content filtering.

Hugging Face simplifies ML development by offering ready-to-use models and tools, making it a go-to platform for AI innovation.


In [3]:
import os
import torch
import pandas as pd
import soundfile as sf
from transformers import pipeline
from datasets import load_dataset
from IPython.display import Audio
from diffusers import DiffusionPipeline

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


## Sentiment Analysis

In [4]:
# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("We are very happy to include pipeline into the transformers repository.")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9978194236755371}]


## Name Entity Recognition (NER)    

In [6]:
ner = pipeline("ner", grouped_entities=True)
result = ner("Barack Obama was the 44th president of the United States.")
print(result)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER', 'score': 0.99918306, 'word': 'Barack Obama', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': 0.9986908, 'word': 'United States', 'start': 43, 'end': 56}]


### Question ANswering with Conetext

In [7]:
question_answerer = pipeline("question-answering")
result = question_answerer(question="who was the 44th president of the United States?", context="Barack Obama was the 44th president of the United States.")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


{'score': 0.9884544014930725, 'start': 0, 'end': 12, 'answer': 'Barack Obama'}


### Text Summarization

In [9]:
summarizer = pipeline("summarization")
result = summarizer("""The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square,
              measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure
              in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height
              of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft).
              Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.""")
print(result)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building . It was the first structure to reach a height of 300 metres . It is now taller than the Chrysler Building by 5.2 metres (17 ft) Excluding transmitters, it is the second tallest free-standing structure in France .'}]


### Translation

In [10]:
translator = pipeline("translation_en_to_fr")
result = translator("Hey How are you")
print(result)

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': 'Hey Comment êtes-vous?'}]


### Classification

In [11]:

classifier = pipeline("zero-shot-classification")
result = classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)
print(result)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445982933044434, 0.11197470128536224, 0.04342701658606529]}


### Text Generation

In [12]:
generator = pipeline("text-generation")
result = generator("In this course, we will teach you how to")
print(result)

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to do your job as a freelancer and how best to apply the techniques to your creative work to make business, life and finances more productive. We will build upon the lessons in this course and make a'}]


### Image Generation

In [None]:
# Load Stable Diffusion pipeline
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1",
                                         torch_dtype=torch.float16,
                                         use_safetensors=True,
                                         variant="fp16")

# Define the prompt
text = """I am happy following Ethiopian orthodx. I want a picture of me with at outside of church with Ethiopian traditional cloth having nice sunny with blue sky
          but as the same time dreaming AI which will have a impacting my church in possitive manner
"""

# Generate the image
image = pipe(prompt=text).images[0]
image

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Fetching 13 files: 100%|██████████| 13/13 [01:37<00:00,  7.49s/it]
Loading pipeline components...: 100%|██████████| 6/6 [00:00<00:00,  8.51it/s]
  0%|          | 0/50 [00:00<?, ?it/s]

### Load text-to-speech model

In [None]:
# Load text-to-speech model
synthesizer = pipeline("text-to-speech", "microsoft/speecht5_tts")

# Load speaker embedding
embedding_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embedding_dataset[1000]["xvector"]).unsqueeze(0)  # Use a valid index

# Generate speech
speech = synthesizer(
    "Hi to an artificial intelligence engineer, on the way to mastery.",
    forward_params={"speaker_embeddings": speaker_embeddings},
)

# Save and play audio
sf.write("speech.wav", speech["audio"], samplerate =speech["sampling_rate"])
Audio("speech.wav")

# HuggingFace Tokenizer

In [10]:
import os
from dotenv import load_dotenv
from huggingface_hub import login
from transformers import AutoTokenizer

# Load environment variables
load_dotenv()

True

In [11]:
# Get Hugging Face token
hf_token = os.getenv('HF_TOKEN')
if not hf_token:
    raise ValueError("HF_TOKEN is not set in the environment variables.")

# Log in to Hugging Face
login(hf_token, add_to_git_credential=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B', trust_remote_code=True)


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [12]:
# changing the text to token 
text = "I am excited to show Tokenizers in action to my LLM Engineers"
tokens = tokenizer.encode(text)
tokens


[128000,
 40,
 1097,
 12304,
 311,
 1501,
 9857,
 12509,
 304,
 1957,
 311,
 856,
 445,
 11237,
 49796]

In [13]:
len(tokens)

15

In [16]:
# Change the token to text
tokenizer.decode(tokens)  # Decode the tokens back to text

'<|begin_of_text|>I am excited to show Tokenizers in action to my LLM Engineers'

In [19]:
tokenizer.batch_decode(tokens)  

['<|begin_of_text|>',
 'I',
 ' am',
 ' excited',
 ' to',
 ' show',
 ' Token',
 'izers',
 ' in',
 ' action',
 ' to',
 ' my',
 ' L',
 'LM',
 ' Engineers']

### Instruct Variants of models
- many models have a variant that has been trained for use i chats
- They are typically labelled with the word "Instruct" at the end
- They have been trained to expect prompts with a particular format that includes system, user and assistant prompts. 
- There is a utility method *apply_chat_template* that will convert from the messages list format we are familiar with, into the right input prompt for this model


In [20]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [22]:
messages = [
    {"role": "user", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




#### Other models 
* PHI3 - Microsoft
* QWEN2_MODEL_NAME -- alibaba
* STARTCODER2_MODEL_NAME -- Generating code

In [29]:
PHI3_MODEL_NAME = "microsoft/phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"                    
STARTCODER_MODEL_NAME = "bigcode/starcoder2-3b"

In [26]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME, trust_remote_code=True)
text = "I am excited to show Tokenizers in action to my LLM Engineers"
print(tokenizer.encode(text))
print()
print(phi3_tokenizer.encode(text))
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.decode(phi3_tokenizer.encode(text)))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 49796]

[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 10863, 414]
I am excited to show Tokenizers in action to my LLM Engineers


In [27]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



<|user|>
You are a helpful assistant.<|end|>
<|user|>
What is the capital of France?<|end|>
<|assistant|>



In [30]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME, trust_remote_code=True)
print(tokenizer.encode(text))
print()
print(qwen2_tokenizer.encode(text))
tokens = qwen2_tokenizer.encode(text)
print(qwen2_tokenizer.decode(qwen2_tokenizer.encode(text)))

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 49796]

[40, 1079, 12035, 311, 1473, 9660, 12230, 304, 1917, 311, 847, 444, 10994, 48696]
I am excited to show Tokenizers in action to my LLM Engineers


In [31]:
startcode2_tokenizer = AutoTokenizer.from_pretrained(STARTCODER_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
    print(f"Hello, {person}")
    """
tokens = startcode2_tokenizer.encode(code)
for token in tokens:
    print(f"{token}={startcode2_tokenizer.decode([token])}")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


222=

610=def
17966= hello
100=_
5879=world
45=(
6427=person
731=):
303=
   
1489= print
45=(
107=f
39="
8302=Hello
49=,
320= {
6427=person
8531=}")
294=
    
