<font size="6">**Hugging Face** - A Comprehensive Guide</font>

**Requirements**

For this tutorial, the following libraries are needed: 
- Throughout the whole tutorial, we will be using the `transformers` library. 
- For the fine-tuning either `pytorch` or `tensorflow` are required. (This Notebook will be implemented with `pytorch`)
- To push the fine-tuned model to HuggingFace, the `HuggingFace_hub`library is required. 

In [1]:
#%pip install transformers
#%pip install torch
#%pip install huggingface_hub

# Core Components of Hugging Face
## Transformers Library

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Define the model_name extracted form the HuggingFace_Hub
model_name = "cardiffnlp/tweet-topic-21-multi"

# We call the model class
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# We call the tokenizer class
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("text-classification", model = model, tokenizer = tokenizer)

# We can easily execute the model by submiting an input
input_  = "I've been waiting for this tutorial all my life!"
output_ = classifier(input_)
print(output_)

[{'label': 'learning_&_educational', 'score': 0.46964123845100403}]


## Tokenizers
A tokenizer basically puts a text in a mathematical representation that the model understands. We can invoke the tokenier and tokenize a text.  

In [4]:
# Define the model_name extracted form the HuggingFace_Hub
model_name = "cardiffnlp/tweet-topic-21-multi"

# We call the model class
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# We call the tokenizer class
tokenizer = AutoTokenizer.from_pretrained(model_name)


sequence = "Using a transformer sequence is quite simple"
tokenized_sequence = tokenizer(sequence)

#We obtain a dictionary with the tokenizes ids and an attention_mask that emphasize what to pay attention and what not to. 
print("Tokenized_sequence: \n", tokenized_sequence)

#We get the tokens (101 is beginning of sentence and 102 is ending of sentence )
tokens = tokenizer.tokenize(sequence)
print("Tokens: \n", tokens)

#We get the corresponding ids for each token
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens_ids: \n", tokens_ids)

#We get back the original word.
decoded_string = tokenizer.decode(tokens_ids)
print(decoded_string)

Tokenized_sequence: 
 {'input_ids': [0, 36949, 10, 40878, 13931, 16, 1341, 2007, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Tokens: 
 ['Using', 'Ġa', 'Ġtransformer', 'Ġsequence', 'Ġis', 'Ġquite', 'Ġsimple']
Tokens_ids: 
 [36949, 10, 40878, 13931, 16, 1341, 2007]
Using a transformer sequence is quite simple


In [5]:
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
# We call the model class
model = AutoModelForSequenceClassification.from_pretrained(model_name)

#We call the tokenizer class
tokenizer = AutoTokenizer.from_pretrained(model_name)

sequence = "Using a transformer sequence is quite simple"
tokenized_sequence = tokenizer(sequence)

#We obtain a dictionary with the tokenizes ids and an attention_mask that emphasize what to pay attention and what not to. 
print("Tokenized_sequence: \n", tokenized_sequence)

#We get the tokens (101 is beginning of sentence and 102 is ending of sentence )
tokens = tokenizer.tokenize(sequence)
print("Tokens: \n", tokens)

#We get the corresponding ids for each token
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens_ids: \n", tokens_ids)

#We get back the original word.
decoded_string = tokenizer.decode(tokens_ids)
print(decoded_string)

Tokenized_sequence: 
 {'input_ids': [101, 32935, 169, 99662, 10165, 30265, 10124, 31324, 16205, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Tokens: 
 ['Using', 'a', 'transform', '##er', 'sequence', 'is', 'quite', 'simple']
Tokens_ids: 
 [32935, 169, 99662, 10165, 30265, 10124, 31324, 16205]
Using a transformer sequence is quite simple


## Datasets

In [6]:
from datasets import load_dataset
import pandas as pd

# Load a dataset by name
dataset = load_dataset('squad')

# The dataset object is now a DatasetDict with predefined splits
print(dataset)  # Access the first sample from the training set
pd.DataFrame.from_dict(dataset["train"])

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...
...,...,...,...,...,...
87594,5735d259012e2f140011a09d,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what US state did Kathmandu first establish...,"{'text': ['Oregon'], 'answer_start': [229]}"
87595,5735d259012e2f140011a09e,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",What was Yangon previously known as?,"{'text': ['Rangoon'], 'answer_start': [414]}"
87596,5735d259012e2f140011a09f,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",With what Belorussian city does Kathmandu have...,"{'text': ['Minsk'], 'answer_start': [476]}"
87597,5735d259012e2f140011a0a0,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what year did Kathmandu create its initial ...,"{'text': ['1975'], 'answer_start': [199]}"


# Getting Started with Hugging Face

## Using Pre-trained Models
Hugging Faces provides the most used NLP library on GitHub with over 115k stars. It offers a wide variety of models with different tasks to perform. You can go check all possible tasks in [here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.model). 

The most common ones are `text classification`, `QA`, `translation`... among others. 

Our first example will be using a **sentiment analysis** (text classification) model to infere the sentiment of an input text. 

The `pipeline()` command is a high-level API that allow users to easily apply complex models to real-world problems.

In [7]:
# Define the model_name extracted form the HuggingFace_Hub
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"

# We call the model class
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# We call the tokenizer class
tokenizer = AutoTokenizer.from_pretrained(model_name)

# We define a pipeline object with the task to be performed, the selected model and the select tokenizer. 
# If we initialize the pipeline class with only the task names, it will be populated with default model and tokenizer. 
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)
#classifier = pipeline("sentiment-analysis")

# We can easily execute the model by submiting an input
input_  = "I've been waiting for this tutorial all my life!"
output_ = classifier(input_)
print(output_)

[{'label': 'positive', 'score': 0.56911301612854}]


In [8]:
# We can easily execute the model by submiting an input
input_ = ["I've been waiting for this tutorial all my life!","I hate this..." ]
output_ = classifier(input_)
print(output_)

[{'label': 'positive', 'score': 0.56911301612854}, {'label': 'negative', 'score': 0.9502121806144714}]


## Fine-tuning Models

In [9]:
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import evaluate

import pandas as pd

import numpy as np
#____________________________________________ 1. prepare dataset

model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

Checking the dataset to be used. 

In [10]:
pd.DataFrame.from_dict(dataset["train"])

Unnamed: 0,id,text,label,label_text
0,cb774db0d1,"I`d have responded, if I were going",1,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0,negative
2,088c60f138,my boss is bullying me...,0,negative
3,9642c003ef,what interview! leave me alone,0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0,negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,0,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,0,negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,2,positive
27479,ed167662a5,But it was worth it ****.,2,positive


In [11]:
#____________________________________________ 2. load pretrained Tokenizer, call it with a dataset -> encoding.
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)


#____________________________________________  3. Build a PyTorch Dataset with encodings.
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10))

#____________________________________________  4. Train the model
training_args = TrainingArguments(output_dir="trainer_output", evaluation_strategy="epoch")

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

#____________________________________________  5. Evaluate the model
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.794825,0.3
2,No log,1.625867,0.2
3,No log,1.597192,0.2


{'eval_loss': 1.597192406654358,
 'eval_accuracy': 0.2,
 'eval_runtime': 0.3246,
 'eval_samples_per_second': 30.809,
 'eval_steps_per_second': 6.162,
 'epoch': 3.0}

In [12]:
trainer.save_model("Fine_Tuned_Models")

## Sharing Models

HuggingFace is a community-driven platform. This means we all can share out models or fine-tuned versions with the whole community. 

To do so, we first need to log in the HuggingFace account. The HuggingFace_hub library contains a specific module to allows us to use it in Jupyter Notebook. 

In [13]:
# Replace with the actual path to your fine-tuned model
model_path = "Fine_Tuned_Models"  

# We get our model
finetuned_model = AutoModelForSequenceClassification.from_pretrained(model_path)


In [14]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [15]:
finetuned_model.push_to_hub("distilbert-base-multilingual-cased-sentiments-student-fine-tuned-data-camp")

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rfeers/distilbert-base-multilingual-cased-sentiments-student-fine-tuned-data-camp/commit/6e00368ebca1c2b68fd32b387338c948ea71aed8', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='6e00368ebca1c2b68fd32b387338c948ea71aed8', pr_url=None, pr_revision=None, pr_num=None)

# Use Cases and Applications

If we want to standardize any NLP process, with hugging face it usually involves of three simple steps that take less than 5 lines of code: 
1. Defining a model object with the pipeline class (and the corresponding model and tokenizer). 
2. Define the input text or prompt.
3. Execute the pre-trained model with our input and observe the output. 

## Text Classification
Text classification is a fundamental task in natural language processing (NLP) where a piece of text is assigned to one or more categories. This can be used for a variety of applications such as spam detection, sentiment analysis, topic labeling, and more. 

In [16]:
from transformers import pipeline

# Load the pre-trained text classification model.
classifier = pipeline("text-classification",model='lxyuan/distilbert-base-multilingual-cased-sentiments-student')

# Input to be classified
input_ = "I absolutely love the transformers library!"

# Perform classification
output_ = classifier(input_)

# Observer the result
print(output_)


[{'label': 'positive', 'score': 0.9909080266952515}]


## Text Generation
The process of generating text is a fascinating aspect of NLP where a model produces human-like text. 

It has a wide range of applications from creating chatbot responses to generating creative writing. 

**The core idea is to train a model on a large corpus of text**, enabling it to learn patterns, styles, and structures of language. 
As you can imagine, the most expensive part is precisely the training of the model. 

So let’s see how we can apply a Text Generation task with less than 5 lines of code!


In [17]:
from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')

prompt = "In a world dominated by AI,"

generated_text = generator(prompt, max_length=50)[0]['generated_text']

print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a world dominated by AI, the problem is often ignored, but these efforts prove remarkably effective. In a study published in February 2014 in the journal Nature Computational Biology, researchers at Harvard Medical School identified the next step to build a software system so


## Question Answering

Question Answering, or commonly referred as QA, is a field in NLP focused on building systems that automatically answer questions posed by humans in natural language. 

QA systems are widely used in various applications, such as virtual assistants, customer support, and information retrieval systems.

QA systems can be broadly categorized into two types:
- **Open-domain QA:** Answers questions based on a broad range of knowledge, often sourced from the internet or large databases.
- **Closed-domain QA:** Focuses on a specific domain, like medicine or law, and answers questions from a limited dataset.

These systems typically use a combination of natural language understanding to interpret the question and information retrieval to find relevant answers.

In [18]:
from transformers import pipeline

qa_pipeline = pipeline('question-answering', model='distilbert-base-uncased-distilled-squad')

context = """Paris is the capital and most populous city of France. The city has an area of 105 square kilometers and a population of 2,140,526 residents."""
question = "What is the population of Paris?"

answer = qa_pipeline(question=question, context=context)
print(answer)


{'score': 0.9549079537391663, 'start': 121, 'end': 130, 'answer': '2,140,526'}


## Translation
The final use-case is translation. Machine Translation  is a subfield of computational linguistics that focuses on translating text or speech from one language to another using software. With the advent of deep learning, machine translation has made significant strides, particularly with models like Neural Machine Translation (NMT) that use large neural networks.

Modern NMT systems, unlike traditional rule-based or statistical translation models, learn to translate by training on large datasets of bilingual text. They use sequence-to-sequence architectures, where one part of the network encodes the source text and another decodes it into the target language, often with impressive fluency and accuracy.

A simple example using hugging face would be:

In [19]:
from transformers import pipeline

# Load the translation pipeline for English to Spanish
translator = pipeline('translation_en_to_de')

# Text to translate from English to Spanish
text_to_translate = "This is a great day for science!"

# Perform the translation
translation = translator(text_to_translate, max_length=40)

# Print the translated text
print(translation[0]['translation_text'])

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Dies ist ein großer Tag für die Wissenschaft!
