# Hugging Face
The company is recognized for its open-source library, Transformers, which has democratized access to state-of-the-art NLP models and tools for both researchers and developers. It offers everything from tokenizers, which help computers make sense of text, to a huge variety of ready-to-go language models, and even a treasure trove of data suited for language tasks.

## Key Components
- Tokenizers: These are essential for converting text into a format that models can understand, acting like a translator for words. These work like a translator, converting the words we use into smaller parts and creating a secret code that computers can understand and work with.
- Models: Hugging Face offers a vast library of pre-trained models, including popular ones like BERT and GPT, which serve as the brains for NLP tasks. These are like the brain for computers, allowing them to learn and make decisions based on information they've been fed.
- Datasets: The datasets library provides a standardized way to access and utilize various datasets tailored for NLP tasks, akin to textbooks for machine learning models. They are collections of information that models study to learn and improve.
- Trainers: Trainers simplify the model training process by providing a unified interface to train models efficiently, implementing the PyTorch training loop. They are the coaches for computer models. They help these models get better at their tasks by practicing and providing guidance. HuggingFace Trainers implement the PyTorch training loop for you, so you can focus instead on other aspects of working on the model.

## Hugging Face Tokenizers
HuggingFace tokenizers help us break down text into smaller, manageable pieces called tokens. These tokenizers are easy to use and also remarkably fast due to their use of the Rust programming language.

Tokenization: It is a critical pre-processing step in NLP that involves breaking down text into smaller units called tokens, which can be words, characters, or subwords, to make it easier to analyze.

Tokens: These are the pieces you get after cutting up text during tokenization, kind of like individual Lego blocks that can be words, parts of words, or even single letters. These tokens are converted to numerical values for models to understand.

Pre-trained Model: This is a ready-made model that has been previously taught with a lot of data.

Uncased: This means that the model treats uppercase and lowercase letters as the same.

Numerical Representation: The tokens are converted into numerical values that the model uses internally.

In [3]:
!pip install -qU --user transformers


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# See how many tokens are in the vocabulary
tokenizer.vocab_size
# 30522

ModuleNotFoundError: No module named 'transformers'

In [None]:
# Tokenize the sentence
tokens = tokenizer.tokenize("I heart Generative AI")

# Print the tokens
print(tokens)
# ['i', 'heart', 'genera', '##tive', 'ai']

# Show the token ids assigned to each token
print(tokenizer.convert_tokens_to_ids(tokens))
# [1045, 2540, 11416, 6024, 9932]


## Hugging Face Models
Hugging Face models provide a quick way to get started using models trained by the community. The library includes state-of-the-art models such as BERT, GPT-2, and RoBERTa, which have set benchmarks in different NLP domains.

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer

# Load a pre-trained sentiment analysis model
model_name = "textattack/bert-base-uncased-imdb"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize the input sequence
tokenizer = BertTokenizer.from_pretrained(model_name)
inputs = tokenizer("I love Generative AI", return_tensors="pt")

# Make prediction
with torch.no_grad():
    outputs = model(**inputs).logits
    probabilities = torch.nn.functional.softmax(outputs, dim=1)
    predicted_class = torch.argmax(probabilities)

# Display sentiment result
if predicted_class == 1:
    print(f"Sentiment: Positive ({probabilities[0][1] * 100:.2f}%)")
else:
    print(f"Sentiment: Negative ({probabilities[0][0] * 100:.2f}%)")
# Sentiment: Positive (88.68%)

## Hugging Face Datasets
HuggingFace Datasets library is a powerful tool for managing a variety of data types, like text and images, efficiently and easily. This resource is incredibly fast and doesn't use a lot of computer memory, making it great for handling big projects without any hassle. It is designed to simplify the process of accessing, pre-processing, and managing large datasets for machine learning projects.

Purpose of the Library: The library provides a unified API that allows users to access a wide variety of datasets, including text, audio, and image datasets, making it easier to work with data in machine learning.

Efficiency: Built on Apache Arrow, the library is optimized for speed and memory efficiency, allowing for fast data processing even with large datasets.

Tokenization: The library supports tokenization, which is essential for preparing text data for model input. The video highlights how the dataset can be easily manipulated to extract and display relevant information.

Reproducibility and Versioning: The library promotes reproducibility in research by allowing users to access specific versions of datasets and even contribute their own datasets.

IMDb dataset: A dataset of movie reviews that can be used to train a machine learning model to understand human sentiments.

Apache Arrow: A software framework that allows for fast data processing

In [None]:
from datasets import load_dataset
from IPython.display import HTML, display

# Load the IMDB dataset, which contains movie reviews
# and sentiment labels (positive or negative)
dataset = load_dataset("imdb")

# Fetch a revie from the training set
review_number = 42
sample_review = dataset["train"][review_number]

display(HTML(sample_review["text"][:450] + "..."))
# WARNING: This review contains SPOILERS. Do not read if you don't want some points revealed to you before you watch the
# film.
# 
# With a cast like this, you wonder whether or not the actors and actresses knew exactly what they were getting into. Did they
# see the script and say, `Hey, Close Encounters of the Third Kind was such a hit that this one can't fail.' Unfortunately, it does.
# Did they even think to check on the director's credentials...

if sample_review["label"] == 1:
    print("Sentiment: Positive")
else:
    print("Sentiment: Negative")
# Sentiment: Negative

## Hugging Face Trainers
Hugging Face Trainer class simplifies the process of training and fine-tuning machine learning models. It offer a simplified approach to training generative AI models, making it easier to set up and run complex machine learning tasks. This tool wraps up the hard parts, like handling data and carrying out the training process, allowing us to focus on the big picture and achieve better outcomes with our AI endeavors.

Purpose of the Trainer: The Trainer class encapsulates the complexities of training loops, evaluation, and optimization, allowing users to focus on model performance rather than implementation details.

Tokenization Process: A function called tokenize_function is created to convert text reviews into tokens that the model can understand. It includes truncating long strings and padding shorter ones to ensure uniform input length.

Truncating: This refers to shortening longer pieces of text to fit a certain size limit.

Padding: Adding extra data to shorter texts to reach a uniform length for processing.

Batches: Batches are small, evenly divided parts of data that the AI looks at and learns from each step of the way.

Batch Size: The number of data samples that the machine considers in one go during training.

Epochs: A complete pass through the entire training dataset. The more epochs, the more the computer goes over the material to learn.

Dataset Splits: Dividing the dataset into parts for different uses, such as training the model and testing how well it works.

In [None]:
from transformers import (DistilBertForSequenceClassification,
    DistilBertTokenizer,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


dataset = load_dataset("imdb")
tokenized_datasets = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    per_device_train_batch_size=64,
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=3,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)
trainer.train()