<a href="https://colab.research.google.com/github/rksab/NLP/blob/main/Hugging_face_Course_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The machine learning landscape is constantly evolving—but rarely has a shift felt as sweeping as the rise of large language models. LLMs haven’t just advanced the field; they’ve dominated it, capturing the imagination of both researchers and the general public in unprecedented ways.

Traditionally, NLP models were narrow in scope—trained from scratch for specific tasks like sentiment analysis, translation, or named entity recognition. Each task required its own dataset, its own architecture, its own evaluation pipeline.

That changed with the introduction of transformer architectures and large-scale pretraining. Together, they ushered in the era of LLMs.

Instead of building task-specific models, researchers began training enormous models—sometimes with hundreds of billions of parameters—on massive corpora of text. These models learned general-purpose language understanding, which could later be fine-tuned or prompted to perform a wide variety of downstream tasks.

It wasn’t just an architectural shift—it was a shift in mindset.

But do LLMs actually understand language? Not in the way humans do. They work on statistical patterns and are prone to hallucinations and bias. They require significant computational resources and text needs to be processed in a way that enables a model to learn from it.

## Pipeline Magic
Let's start with some hugging face magic. We'll start with pipeline, as it assumes minimal knowledge to begin with.

In [None]:
from transformers import pipeline


In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I love to hate this course")

Since we didn't supply a model, it picked up distilbert/distilbert-base-uncased-finetuned-sst-2-english. Under the hood, it tokenized the sentence, passed the tokens to the model and assigned it a label and score.

In [None]:
translator = pipeline("text2text-generation", model="google/flan-t5-base")
translator("Translate English to German: I love apples.")

Let's say we want to perform a classification task without training aka zero-shot classification task.

In [None]:
clf = pipeline("zero-shot-classification")
clf("This movie was so inspiring!", candidate_labels=["sports", "politics", "entertainment"])


In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

In [None]:
fill = pipeline("fill-mask")
fill("The capital of France is <mask>.", top_k = 2)

# ## A Bit More About Models
We’re about to dig into what happens behind the scenes in the deceptively simple pipeline() function. But first, let’s take a moment to understand the different types of models it might use. There are three main architectural variants of Transformer models: encoder, decoder, and encoder-decoder.

## Encoder Models
Also called autoencoding models.

The attention layers in encoder models have access to the entire input sequence at once (bidirectional attention).

These models are useful when global understanding of the text is needed.

They are typically trained to predict masked words, using both left and right context.

Examples: bert-base-uncased, roberta-base

Use cases: Text classification, sentence similarity, token classification (e.g., named entity recognition)

## Decoder Models
Also called autoregressive models.

These models have access only to the tokens that come before the current one (unidirectional or causal attention).

They are trained using causal language modeling, where the task is to predict the next token given the previous ones.

Examples: gpt2, gpt-neo-125M

Use cases: Text generation, code completion, chatbots

##Encoder-Decoder Models
Also called sequence-to-sequence models.

The encoder reads the entire input, producing a hidden representation of it.

The decoder then generates output one token at a time, using both the encoder’s output and its own previously generated tokens.

These models are often trained by corrupting the input and asking the model to reconstruct or translate it.

Examples: t5-small, facebook/bart-large-cnn

Use cases: Summarization, translation, question answering, and any task that involves generating one text based on another

##What Makes These Models Effective?
Attention.
The attention mechanism allows the model to focus on the most relevant parts of the input while making predictions. Whether it's understanding the context of a word in a sentence or generating the next token in a sequence, attention plays a central role in making Transformers so powerful.


## Tokenization
✨ The First Magical Part of pipeline():
The first thing that makes the pipeline() function so magical is the tokenizer.

Models don’t understand words — they understand numbers. So to perform any computational task, we first need to convert text into numbers. That’s where tokenization comes in.


🔹 What happens during tokenization?
We split the sentence into smaller parts, called tokens.

Then, we map those tokens to unique integers using the model’s vocabulary.

There’s also some additional processing, like adding special tokens to mark the beginning and end of a sentence (for example, [CLS] and [SEP] in BERT).


For tokenzation, hugging face has AutoTokenizer class and its from_pretrained() method. When no checkpoint is specified, pipeline uses the default one as we saw above.

In [None]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


The above code fetches the data associated with the tokenizer and caches it. Let's take a toy sentence

In [None]:
sentence = ['apples are sour. ']
tokens = tokenizer(sentence)

In [None]:
tokens

Say we want multiple sentences.

In [None]:
sentences = ['apples are sour. ', 'We have promises to keep.']
tokens = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
print(tokens)

In Feedforward Neural Networks (FFNNs), the input size must be fixed. However, in natural language processing, input sentences can vary in length. To handle this variability, we typically use padding and truncation.

For models like RNNs or Transformers, we pad shorter sentences with special tokens (like [PAD]) so that all sentences in a batch have the same length. We may also truncate longer sentences to a specified maximum length to fit within model constraints or memory limits.

Most modern language models define their own maximum input length (e.g., 512 tokens for BERT). Therefore, we often rely on the tokenizer to handle padding and truncation automatically, ensuring inputs are formatted correctly for the model. return_tensors specifies the type of tensor you want to get back.

### AutoModel
No need to dive deep yet — we’ll explore internals of tokenizer later. For now, let’s take a look at the next component of pipeline magic: AutoModel.

We’ll start with AutoModel. It gives us access to the base model, which returns raw hidden states (not final predictions). It’s useful when you want to build custom heads on top or just understand what’s going on inside.

Let’s see it in action.

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [None]:
outputs = model(**tokens)
print(outputs.last_hidden_state.shape)

Transformer models are large. Our output is of the size (batch_size, sequence length, hidden size). So we gave it tokens with max_size 8 and got a higher dimensional output. Our batch size was 2 as we just had 2 sentences. And 768 is embedding size, more abt that later.

What can we do with the outputs? Say we turn it into a classification task. We want to label sentences as positive and negative. We need classification head for that. As you guessed, we'll use hugging face magic.

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**tokens)

In [None]:
print(outputs.logits)

We've logits. Let's interpret them and convert them into probabilities. We need a softmax layer.

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label


First sentence is negative and second one is positive.