<a href="https://colab.research.google.com/github/rksab/NLP/blob/main/Hugging_face_Course_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The machine learning landscape is constantly evolving—but rarely has a shift felt as sweeping as the rise of large language models. LLMs haven’t just advanced the field; they’ve dominated it, capturing the imagination of both researchers and the general public in unprecedented ways.

Traditionally, NLP models were narrow in scope—trained from scratch for specific tasks like sentiment analysis, translation, or named entity recognition. Each task required its own dataset, its own architecture, its own evaluation pipeline.

That changed with the introduction of transformer architectures and large-scale pretraining. Together, they ushered in the era of LLMs.

Instead of building task-specific models, researchers began training enormous models—sometimes with hundreds of billions of parameters—on massive corpora of text. These models learned general-purpose language understanding, which could later be fine-tuned or prompted to perform a wide variety of downstream tasks.

It wasn’t just an architectural shift—it was a shift in mindset.

But do LLMs actually understand language? Not in the way humans do. They work on statistical patterns and are prone to hallucinations and bias. They require significant computational resources and text needs to be processed in a way that enables a model to learn from it.

Let's start with some hugging face magic. We'll start with pipeline, as it assumes minimal knowledge to begin with.

In [None]:
from transformers import pipeline


In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I love to hate this course")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9996345043182373}]

Since we didn't supply a model, it picked up distilbert/distilbert-base-uncased-finetuned-sst-2-english. Under the hood, it tokenized the sentence, passed the tokens to the model and assigned it a label and score.

In [None]:
translator = pipeline("text2text-generation", model="google/flan-t5-base")
translator("Translate English to German: I love apples.")

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


[{'generated_text': 'Ich liebe Apples.'}]

Let's say we want to perform a classification task without training aka zero-shot classification task.

In [None]:
clf = pipeline("zero-shot-classification")
clf("This movie was so inspiring!", candidate_labels=["sports", "politics", "entertainment"])


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'This movie was so inspiring!',
 'labels': ['entertainment', 'sports', 'politics'],
 'scores': [0.984281063079834, 0.008338303305208683, 0.007380666211247444]}

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how to use the new JavaScript language for data visualization. Through practice and discussion, you will explore different data visualization techniques and learn how to make beautiful, informative plots to communicate your research.\n\n### 2.2.2. Data Manipulation and Modeling\n\nData manipulation and modeling are crucial parts of data science and machine learning. In this course, we will teach you how to use the new JavaScript language for data manipulation and modeling. Through practice and discussion, you will explore different data manipulation techniques and learn how to use different models to analyze and explore your data.\n\n### 2.2.3. Exploring and Analyzing Data\n\nIn this course, we will teach you how to explore and analyze data using the new JavaScript language. Through practice and discussion, you will explore your data and learn how to summarize, visualize, and present your data in a meaningful way.\n\n### 2.2.4. Dev

In [None]:
fill = pipeline("fill-mask")
fill("The capital of France is <mask>.", top_k = 2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.27037179470062256,
  'token': 2201,
  'token_str': ' Paris',
  'sequence': 'The capital of France is Paris.'},
 {'score': 0.05588330328464508,
  'token': 12790,
  'token_str': ' Lyon',
  'sequence': 'The capital of France is Lyon.'}]

Let's dig into what's going on and has been abstracted away by a simple pipeline function. First, a bit about models. There are three types of architectural variants of Transformer models: encoder, decoder and encode-decoder models.

Encoder Models: also called cuto encoding models. Attention layer gives access to the whole input, are useful when global understanding is needed. Typically, they try predicted a masked work and have access to words that come before and after it. e.g, bert-base-uncased, roberta-base. Useful for text-classification, sentence similarity, token classification etc.

Decoder Models: also called autoregressive models. Have access to words that come before and try predicting the next word (Causal Language Modelling). Useful for text generation, code-completion etc. gpt2, gpt-neo-125M

Encoder-Decoder Models: Encoder reads the full input and decoder generates output sequentially based on the encoded input. Useful for summarization, text-translation.