### üìö Key Concepts in Pretraining

- **Why Pretrain?**
  - Pretraining allows models to learn from massive unlabeled data.
  - It boosts performance across diverse NLP tasks with minimal task-specific supervision.
  - The pretraining step boosts the model's intelligence across diverse tasks because of the magnitude of data that is available for pretraining (no labelling needed) and the variety within the data. While in finetuning, it is carefully curated, labelled data, the process is expensive and that results in way less data than what is available for pretraining.
  - In terms of the mathematics of it, it is deemed to start the finetuning gradient descent with the weights learnt from pretraining as oppossed to starting from a random starting point. The starting point is extremely crucial when it comes to well known first order gradient descent algorithms.

- **Subword Modeling**
  - Fixed word vocabularies are limiting‚Äînovel words become `<UNK>`.
  - Subword tokenization (e.g., Byte-Pair Encoding, WordPiece) helps handle rare and misspelled words.
  - Words are split into meaningful subword units, improving generalization.

- **Three Pretraining Architectures**
  - **Encoder-only** (e.g., BERT): Good for understanding tasks like classification.
  - **Decoder-only** (e.g., GPT): Suited for generation tasks like text completion.
  - **Encoder-Decoder** (e.g., T5, BART): Ideal for translation and summarization.

- **Distributional Semantics**
  - ‚ÄúYou shall know a word by the company it keeps‚Äù ‚Äî motivates learning word meaning from context.
  - This principle underlies models like word2vec and modern transformers.

- **In-Context Learning**
  - Large models can learn tasks from examples in the input without updating weights.
  - This is a key capability of models like GPT-3 and beyond.

- **Scaling Laws**
  - Bigger models + more data + more compute = better performance.
  - Pretraining benefits scale predictably with resources.

---
</br>
</br>

### The **pretraining task** is crucial because it shapes what each Transformer architecture learns to do best. Here are the key details:



### üß† Transformer Pretraining Architectures & Their Tasks

#### 1. **Encoder-only** (e.g., BERT)
- **Architecture**: Only the encoder stack is used.
- **Pretraining Task**:  
  - **Masked Language Modeling (MLM)**: Since the architecture of encoders gets them bidirectional context, it doesn't make sense to do causal language modeling tasks with encoder only arch. The process of bidirectionality will allow the first step to be able to see the next word already. Therefore we go for Masked Language Modeling with this arch where random tokens are masked, and the model predicts them using the rest of the full context.
- **Best For**:  
  - Representational tasks, embedding generation and on top of that understanding tasks like classification, sentiment analysis, named entity recognition (NER), etc.
- Tip: if you are going to use BERT, use RoBERTa due to similar arch but better training

#### 2. **Decoder-only** (e.g., GPT)
- **Architecture**: Only the decoder stack is used.
- **Pretraining Task**:  
  - **Causal Language Modeling (CLM)**: Predict the next token given previous ones (left-to-right).
- **Best For**:  
  - Generation tasks like story writing, code generation, dialogue, and text completion.

#### 3. **Encoder-Decoder** (e.g., T5, BART)
- **Architecture**: Full Transformer with both encoder and decoder.
- **Pretraining Tasks**:
  - **T5**: Text-to-text format using multiple tasks (e.g., translation, summarization, question answering).
  - **BART**: Denoising autoencoder ‚Äî corrupt input and train to reconstruct it.
- **Best For**:  
  - Sequence-to-sequence tasks like translation, summarization, and question answering.

---

### üîë Takeaway

- The **pretraining task** determines what the model learns:
  - MLM ‚Üí deep understanding of context.
  - CLM ‚Üí fluent generation.
  - Denoising / text-to-text ‚Üí flexible input-output mapping.

Pretraining benefits:
Tons of data as it is unlabelled and easy to get
Variety of data
makes the model more generalised
A careful selection is need to make sure garbage doesn't go in

### üß† FineTuning Insights

When we want to take a pre-trained model and finetune it on our custom data, we don't want it to completely lose the knowledge it gained in the pretraining step. If we by default finetune all the parameters we end up losing a lot of valuable information learnt through expensive training step. Therefore, we would like to only tune a few parameters to make only enough change to the model to incorporate expertise on our data.
This is where PEFT comes in.  