# Hugging Face

## What is Hugging Face and how does it differ from other ML libraries?

Hugging Face provides:
- Pre-trained transformer models for NLP/CV/Audio
- Model Hub for sharing and discovering models
- Datasets library for ML datasets
- Tools for training and fine-tuning models
- Inference API and model deployment
- AutoML capabilities with AutoTrain
- Spaces for ML app deployment

Key differences from other frameworks:
- Focus on transformer architectures
- Largest collection of pre-trained models
- Stronger community and sharing features
- Better standardization across models
- Simpler fine-tuning workflows
- Integrated deployment solutions

## Different kinds of models available:

- Multimodal models
    - Audio-Text-to-Text, Image-Text-to-Text, Visual Question Answering, Document Question Answering, Video-Text-to-Text, Any-to-Any
- Computer Vision models
    - Depth Estimation, Image Classification, Object Detection, Image Segmentation, Text-to-Image, Image-to-Text, Image-to-Image, Image-to-Video, Unconditional Image Generation, Video Classification, Text-to-Video, Zero-Shot Image Classification, Mask Generation, Zero-Shot Object Detection, Text-to-3D, Image-to-3D, Image Feature Extraction, Keypoint Detection
- Natural Language Processing models
    - Text Classification, Token Classification, Table Question Answering, Question Answering, Zero-Shot Classification, Translation, Summarization, Feature Extraction, Text Generation, Text2Text Generation, Fill-Mask, Sentence Similarity, Audio, Text-to-Speech, Text-to-Audio, Automatic Speech Recognition, Audio-to-Audio, Audio Classification, Voice Activity Detection
- Tabular models
    - Tabular Classification, Tabular Regression, Time Series Forecasting
- Reinforcement Learning models
    - Reinforcement Learning, Robotics
- Other models
    - Graph Machine Learning

## How do you install this?

`!pip install transformers datasets evaluate accelerate torch sentencepiece sacremoses -U`

- transformers -> library for all kinds of NLP tasks
- datasets -> library for all kinds of datasets
- evaluate -> library for evaluation of models
- accelerate -> library for distributed training
- torch -> PyTorch library
- sentencepiece -> library for tokenization
- sacremoses -> library for tokenization

In [1]:
# !pip install transformers datasets evaluate accelerate torch sentencepiece sacremoses -U

## How can you use pipelines?

Pipelines are a high-level API for using pre-trained models for common NLP tasks. They are easy to use and require minimal code. You can select a task and a model, and then use the pipeline to perform the task.

Some parameters are `model="distilgpt2"`, `max_length=20`, `num_return_sequences=2`, etc.

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis") # distilbert-base-uncased-finetuned-sst-2-english model is used by default
print(classifier(["I love Transformers!", "I hate bugs."]))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998069405555725}, {'label': 'NEGATIVE', 'score': 0.9967179894447327}]


In [3]:
zero_shot = pipeline("zero-shot-classification")
print(zero_shot("This is a course about NLP models.", candidate_labels=["education", "technology", "sports"]))

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about NLP models.', 'labels': ['technology', 'education', 'sports'], 'scores': [0.9376302361488342, 0.05547460913658142, 0.006895212456583977]}


In [4]:
generator = pipeline("text-generation", model="distilgpt2")
print(generator("Transformers are great for", max_length=20, num_return_sequences=2))

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Transformers are great for debugging, testing and testing their performance. We highly recommend you look at this'}, {'generated_text': 'Transformers are great for their simplicity and clarity. They can take a while to be refined. I'}]


In [5]:
unmasker = pipeline("fill-mask", model='distilroberta-base')
print(unmasker("Hugging Face is a <mask> library.", top_k=2))

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.17664726078510284, 'token': 481, 'token_str': ' free', 'sequence': 'Hugging Face is a free library.'}, {'score': 0.07091275602579117, 'token': 285, 'token_str': ' public', 'sequence': 'Hugging Face is a public library.'}]


In [6]:
ner = pipeline("ner", grouped_entities=True) # Named Entity Recognition
print(ner("Hugging Face is based in New York City."))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'ORG', 'score': 0.89075655, 'word': 'Hugging Face', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': 0.9991805, 'word': 'New York City', 'start': 25, 'end': 38}]


In [7]:
qa = pipeline("question-answering")
print(qa(question="Where is Hugging Face based?", context="Hugging Face is based in New York City."))

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9694607853889465, 'start': 25, 'end': 38, 'answer': 'New York City'}


In [8]:
summarizer = pipeline("summarization")
print(summarizer("Hugging Face creates tools for NLP. These tools are widely used in AI and ML applications.", min_length=6, max_length=10))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' Hugging Face creates tools for N'}]


In [9]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
print(translator("Hugging Face est une bibliothèque populaire pour le NLP."))

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'Hugging Face is a popular library for the NLP.'}]


## How do you load and use pre-trained models?

In [10]:
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForQuestionAnswering,
    pipeline
)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Basic tokenization and inference
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

# Using pipelines (high-level API)
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")
display(result)

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
entities = ner("My name is Sarah and I live in London")
display(entities)

qa = pipeline("question-answering")
result = qa(question="Who was Jim Henson?",
           context="Jim Henson was a puppeteer")
display(result)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998775720596313}]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity': 'I-PER',
  'score': 0.9982994,
  'index': 4,
  'word': 'Sarah',
  'start': 11,
  'end': 16},
 {'entity': 'I-LOC',
  'score': 0.9983972,
  'index': 9,
  'word': 'London',
  'start': 31,
  'end': 37}]

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.8786673545837402, 'start': 15, 'end': 26, 'answer': 'a puppeteer'}

In [17]:
import torch
# Specific task models
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
classifier = AutoModelForSequenceClassification.from_pretrained(
    'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
    num_labels=2
)
inputs = tokenizer("I love this movie", return_tensors="pt")

outputs = classifier(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Display results
labels = ["Negative", "Positive"]
result = {labels[i]: float(predictions[0][i]) for i in range(len(labels))}
print(result)

{'Negative': 0.0001234115188708529, 'Positive': 0.9998766183853149}


In [18]:
ner_model = AutoModelForTokenClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=9
)
entities = ner("My name is Sarah and I live in London")
display(entities)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'entity': 'I-PER',
  'score': 0.9982994,
  'index': 4,
  'word': 'Sarah',
  'start': 11,
  'end': 16},
 {'entity': 'I-LOC',
  'score': 0.9983972,
  'index': 9,
  'word': 'London',
  'start': 31,
  'end': 37}]

In [19]:
qa_model = AutoModelForQuestionAnswering.from_pretrained(
    'bert-base-uncased'
)
result = qa(question="Who was Jim Henson?",
           context="Jim Henson was a puppeteer")
display(result)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.8786673545837402, 'start': 15, 'end': 26, 'answer': 'a puppeteer'}

## How do you handle tokenization?

In [21]:
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [22]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [23]:
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
display(tokenizer)

# Basic tokenization
tokens = tokenizer.tokenize("Hello, how are you?")
display(tokens)
input_ids = tokenizer.encode("Hello, how are you?")
display(input_ids)

# Batch tokenization
inputs = tokenizer(
    ["Hello, how are you?", "I'm fine, thanks!"],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"  # or 'tf' for TensorFlow
)
display(inputs)

# Access different components
input_ids = inputs['input_ids']
display(input_ids)
attention_mask = inputs['attention_mask']
display(attention_mask)
token_type_ids = inputs['token_type_ids']  # for some models
display(token_type_ids)

# Special tokens
cls_token_id = tokenizer.cls_token_id
sep_token_id = tokenizer.sep_token_id
pad_token_id = tokenizer.pad_token_id
print(cls_token_id, sep_token_id, pad_token_id)

# Decode tokens back to text
text = tokenizer.decode(input_ids[0])
display(text)

long_text = "This is a very long text that exceeds the model's maximum input length. " * 50

# Handle long sequences
inputs = tokenizer(
    long_text,
    max_length=512,
    truncation=True,
    stride=128,
    return_overflowing_tokens=True
)
display(inputs)

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

['hello', ',', 'how', 'are', 'you', '?']

[101, 7592, 1010, 2129, 2024, 2017, 1029, 102]

{'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 1029,  102,    0],
        [ 101, 1045, 1005, 1049, 2986, 1010, 4283,  999,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}

tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 1029,  102,    0],
        [ 101, 1045, 1005, 1049, 2986, 1010, 4283,  999,  102]])

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]])

101 102 0


'[CLS] hello, how are you? [SEP] [PAD]'

{'input_ids': [[101, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 2003, 1037, 2200, 2146, 3793, 2008, 23651, 1996, 2944, 1005, 1055, 4555, 7953, 3091, 1012, 2023, 200

## How do you implement fine-tuning?

In [None]:
from transformers import (
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
import datasets

# Load dataset
dataset = datasets.load_dataset('imdb')

# Prepare data
def preprocess_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=512
    )

tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer)
)

# Train model
trainer.train()

# Save model
trainer.save_model("./fine_tuned_model")

## How do you work with the datasets library?

In [None]:
from datasets import (
    load_dataset,
    Dataset,
    DatasetDict,
    Features,
    Value,
    ClassLabel
)

# Load existing dataset
dataset = load_dataset('imdb')
squad_dataset = load_dataset('squad')

# Create custom dataset
data = {
    'text': ["Hello", "World"],
    'label': [0, 1]
}

features = Features({
    'text': Value('string'),
    'label': ClassLabel(num_classes=2, names=['neg', 'pos'])
})

dataset = Dataset.from_dict(data, features=features)

# Dataset operations
# Filter
filtered = dataset.filter(lambda x: len(x['text']) > 100)

# Map
def uppercase(example):
    return {'text': example['text'].upper()}

dataset = dataset.map(uppercase, batched=True)

# Shuffle and select
shuffled = dataset.shuffle(seed=42)
subset = dataset.select(range(100))

# Split dataset
train_test = dataset.train_test_split(test_size=0.2)

# Save and load
dataset.save_to_disk('path/to/dataset')
loaded = Dataset.load_from_disk('path/to/dataset')

# Stream large datasets
streamed_dataset = load_dataset('large_dataset', streaming=True)
for example in streamed_dataset:
    process_example(example)

## How do you handle model evaluation?

In [None]:
from transformers import TrainerCallback
from evaluate import load

# Load evaluation metric
metric = load("accuracy")
f1_metric = load("f1")

# Custom evaluation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(-1)
    return metric.compute(predictions=predictions, references=labels)

# Add to trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

# Custom evaluation loop
model.eval()
for batch in eval_dataloader:
    with torch.no_grad():
        outputs = model(**batch)
        predictions = outputs.logits.argmax(-1)
        metric.add_batch(
            predictions=predictions,
            references=batch["labels"]
        )

final_score = metric.compute()

# Custom callback for logging
class EvaluationCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, metrics, **kwargs):
        print(f"Step {state.global_step}: {metrics}")

trainer.add_callback(EvaluationCallback())

## How do you implement custom training loops?

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import get_scheduler

# Initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

# Training loop
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        print(f"Loss: {loss.item()}")

## How do you use Accelerate for distributed training?

In [None]:
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import set_seed

# Initialize accelerator
accelerator = Accelerator()
logger = get_logger(__name__)

# Prepare model, dataloaders, optimizer
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        optimizer.zero_grad()

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)

# Save model
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
accelerator.save(unwrapped_model.state_dict(), "./model_state.pt")

## How do you deploy models?

In [None]:
# Local deployment with pipeline
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="fine_tuned_model",
    tokenizer="fine_tuned_model"
)

# Export to ONNX
from transformers.onnx import export
from pathlib import Path

export(
    tokenizer=tokenizer,
    model=model,
    output=Path("model.onnx"),
    opset=12
)

# Optimize with ONNX Runtime
import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
outputs = session.run(None, {"input_ids": input_ids})

# Export to TorchScript
traced_model = torch.jit.trace(
    model,
    (input_ids, attention_mask)
)
torch.jit.save(traced_model, "model.pt")

# Gradio interface
import gradio as gr

def predict(text):
    results = classifier(text)
    return results[0]['label'], results[0]['score']

iface = gr.Interface(
    fn=predict,
    inputs="text",
    outputs=["label", "number"]
)
iface.launch()

## How do you use the Hub?

In [None]:
from huggingface_hub import (
    HfApi,
    Repository,
    create_repo,
    upload_file
)

# Initialize API
api = HfApi()

# Create repository
create_repo("my-model")

# Clone repository
repo = Repository("path/to/local/repo", "username/my-model")
repo.git_pull()

# Push to hub
api.upload_file(
    path_or_fileobj="model.pt",
    path_in_repo="model.pt",
    repo_id="username/my-model"
)

# Upload model
model.push_to_hub("username/my-model")
tokenizer.push_to_hub("username/my-model")

# Download from hub
api.snapshot_download(
    repo_id="username/my-model",
    revision="main"
)

# Model cards
from huggingface_hub import ModelCard

card_content = """
---
language: en
tags:
- sentiment-analysis
- bert
---
# Model Card for my-model
"""

card = ModelCard(card_content)
card.push_to_hub("username/my-model")

## How do you use AutoTrain?

In [None]:
from autotrain.cli import AutoTrain

# Initialize AutoTrain project
project = AutoTrain(
    project_name="my_project",
    task="text_classification",
    model_name="bert-base-uncased",
    training_data="path/to/train.csv",
    validation_data="path/to/valid.csv"
)

# Configure training
project.config(
    num_epochs=3,
    learning_rate=2e-5,
    batch_size=16,
    max_seq_length=128
)

# Start training
project.train()

# Get best model
best_model = project.get_best_model()

# Deploy model
project.deploy()

## How do you implement text generation?

In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TextGenerationPipeline
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Basic generation
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Advanced generation parameters
outputs = model.generate(
    **inputs,
    max_length=100,
    num_beams=5,
    no_repeat_ngram_size=2,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    do_sample=True,
    num_return_sequences=3
)

# Using pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
results = generator(
    "Once upon a time",
    max_length=100,
    num_return_sequences=3
)