# HuggingFace: A Deep Dive

HuggingFace is a technology company that has significantly influenced the landscape of natural language processing (NLP) and machine learning. Recognized primarily for its open-source library, Transformers, HuggingFace has democratized access to state-of-the-art NLP models and tools, making them accessible to both researchers and developers alike.

## Outline

Hugging Face boasts a plethora of components designed to facilitate advanced research and practical applications. For the scope of our discussion, we will concentrate on four key components that form the foundation of many NLP tasks:

* **Tokenizers**: The building blocks of NLP, tokenizers convert text into a binary format that models can understand.
* **Models**: Serving as the core of the platform, Hugging Face provides access to a vast library of pre-trained models. From GPT, to Qwen, and beyond.
* **Datasets**: Data is the lifeblood of machine learning. We'll discuss Hugging Face's datasets library, which offers a standardized way to access, process, and utilize a diverse range of datasets tailored for NLP tasks.
* **Trainers**: trainers in Hugging Face provide a unified interface to train models efficiently simplifying the model training process.

Last, we’ll focus on two exciting libraries created by huggingface, which allow you to use parameter-efficient fine tuning and reinforcement learning for developing transformer models. Let’s get started.

## Tokenizers

Tokenization involves breaking down text into smaller units, often referred to as tokens, which can be as small as characters or as long as words. Hugging Face offers a dedicated library for tokenization that is both robust and efficient.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(f"The number of distinct tokens that bert-base-uncased uses is {tokenizer.vocab_size}")

The number of distinct tokens that bert-base-uncased uses is 30522


Let’s see them in action! Here we load the tokenizer used for a BERT model, in this case using a pre-trained model with the name bert-base-uncased. Uncased refers to the model being trained using a tokenizer that does not distinguish between upper and lower case.

Here we see that the number of distinct tokens that bert-base-uncased uses is 30522 This is composed of individual letters, words, and word parts.

In [None]:
sentence = "I heart Generative AI."
tokens = tokenizer.tokenize(sentence)
print(tokens)

['i', 'heart', 'genera', '##tive', 'ai', '.']


We will tokenize the sentence "I heart Generative AI."

We use the tokenizer.tokenize method and print the result, showing that I heart generative AI was split into 5 different tokens. One of the words, generative, was split into two tokens.

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[1045, 2540, 11416, 6024, 9932, 1012]


We can also see the actual token values as numbers that the model uses internally, starting with 1045.

One notable feature of Hugging Face’s tokenizers is their speed. By leveraging a programming language called Rust under the hood, the library ensures rapid tokenization, even for vast amounts of text.

## Models

Let’s look at accessing the diverse and expansive collection of open-source models available on HuggingFace. On there, you’ll find lots of foundation models such as the GPT-OSS models, the Qwen models, and Google’s Gemma models. You’ll also find fine-tuned variants of all these the community has made!

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

model_name = 'textattack/bert-base-uncased-imdb'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

Let’s download a model from huggingface and use it for sentiment analysis!

Here's a quick demonstration using a pre-trained model to classify the sentiment of the sentence "I love Generative AI" First we load a model and a tokenizer using a model pre-trained by the community and available on hugging face. In this case we are using user text attack’s bert model, which does not differentiate between upper and lower case, and has been adapted to perform well on a movie sentiment dataset known as imdb.

In [None]:
sentence = "I love Generative AI"
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=1)
    prediction = torch.argmax(probabilities, dim=1).item()

if prediction == 1:
    print(f'The sentiment is positive with a probability of {probabilities[0][1]:.2f}')
else:
    print(f'The sentiment is negative with a probability of {probabilities[0][0]:.2f}')

The sentiment is positive with a probability of 0.89


In order to get the prediction, we go through a few steps

1. First we tokenize the input using the tokenizer
2. Next, we tell the model we are not going to update it, just use it for predictions. That is what no_grad means here.
3. We then get the output probabilities from the model for positive vs negative sentiment.
4. If the positive classification wins , meaning it is more than 50% likely, we print it out. And we do the same if the classification is negative.

So, I love generative AI has a positive sentiment according to this model, with a probability of 89%

## Datasets

Let’s now turn to accessing datasets. Hugging Face's "Datasets" library is designed to expedite and simplify the process of accessing, preprocessing, and managing vast amounts of data for AI projects.

In [None]:
from datasets import load_dataset

imdb_dataset = load_dataset('imdb')

How easy is it? Well datasets provides an efficient way to access a myriad of datasets, including the popular imdb dataset which contains movie reviews. In this example, we will load the imdb dataset and display one of its reviews.

Let start by importing load_dataset from datasets Next we use load_dataset of the string imdb to create a dataset object. Look, just one line!

In [None]:
review_number = 42
review_text = imdb_dataset['train'][review_number]['text']
review_label = imdb_dataset['train'][review_number]['label']

print(f"Review text: {review_text}")
print(f"Label: {'Positive' if review_label == 1 else 'Negative'}")

Label: Negative


There are a lot of reviews in here

We’ll look at number 42 from the train split of the dataset, since 42 is a great number

In order to access the text of the dataset we get the key named text

The label is provided as either 0 or 1 and obtained using the label key

Running this, we see the output The movie review says, WARNING: this review contains spoilers. It continues to say, “With a cast like this, you wonder whether or not the actors and actresses know exactly what they were getting into.” It actually contains a sarcastic remark about the movie Close Encounters of the Third Kind, which I thought was a great movie. Bom-Bom-bom-BOM_BOM! Still, in this dataset, the review was labelled as negative, which I agree with.

An important feature of the datasets library is its efficiency. Built on top of Apache Arrow, it allows for lightning-fast operations, ensuring that even large datasets can be processed seamlessly without hogging memory resources. This becomes particularly advantageous when working with extensive corpora or when performing large-scale data analysis.

## Trainer

Huggingface also make training a model easier! The Trainer class offers a streamlined solution for training and fine-tuning machine learning models. It encapsulates much of the complexity associated with training loops, evaluation, and optimization.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

# Define the pre-trained model name
model_name = "distilbert-base-uncased"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define a function to tokenize the dataset
def tokenize_function(examples):
    # Tokenize the text, pad to the max length, and truncate
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Load the IMDb dataset
imdb_dataset = load_dataset('imdb')

# Select a smaller subset of the dataset for faster training
subset_size = 1000 # Define the size of the subset
small_train_dataset = imdb_dataset["train"].shuffle(seed=42).select(range(subset_size))
small_test_dataset = imdb_dataset["test"].shuffle(seed=42).select(range(subset_size))


# Apply the tokenization function to the subset datasets
tokenized_datasets = {
    "train": small_train_dataset.map(tokenize_function, batched=True),
    "test": small_test_dataset.map(tokenize_function, batched=True)
}

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let’s start with loading the distilbert-base-uncased pre-trained model as well as the tokenizer it uses.

Next we will create a function called tokenize_function which takes a review, and converts it to tokens that the large language model will understand. Notice that we are all truncating long strings as well as padding shorter strings. We then load the IMDB movie review dataset. Notice again, that we are taking a subset of the IMDB dataset for demonstration purposes, you are welcome to train on the whole dataset. We map this subset of the IMDB dataset to a new dataset with the tokenized, padded, and truncated reviews.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    save_strategy="epoch",
    report_to='none', # Disable reporting to experiment tracking services
)

With that all set, we set some training arguments. Here we set the batch size for the dataset loader. We set an output directory for the results. We also set the learning rate and the number of epochs to train for. We also set the splits of the dataset to use for training and the final evaluation of the model. Being able to specify the dataset splits here by name is useful, if your dataset comes with them. If it doesn’t you’ll need to split manually.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Trainer(


Step,Training Loss


TrainOutput(global_step=48, training_loss=0.5464539925257365, metrics={'train_runtime': 153.2866, 'train_samples_per_second': 19.571, 'train_steps_per_second': 0.313, 'total_flos': 397402195968000.0, 'train_loss': 0.5464539925257365, 'epoch': 3.0})

Last, we run trainer.train(). Behind the scenes a training loop is being run, and in this case we are fine-tuning an existing model. We can be assured that if this model training fails, it will not be for a faulty implementation of our training loop, but for some other reason.

As it trains, we see it progress slowly. Step by step, we see the loss decrease, a sign our optimization of the model is working. It’s still not done yet here, several thousand steps in. Oh the anticipation!

Let’s start with loading the distilbert-base-uncased pre-trained model as well as the tokenizer it uses.

Next we will create a function called tokenize_function which takes a review, and converts it to tokens that the large language model will understand. Notice that we are also truncating long strings as well as padding shorter strings. We then load the IMDB movie review dataset, and **we are using a subset of the dataset for faster training**. We then map this dataset to a new dataset with the tokenized, padded, and truncated reviews.