# Let us start by fetching data

### go to notebook 1-GetData.ipynb

# Let us now create the datasets and vocab for pre-training

### go to notebook 2-PreparePre-TrainingDataset.ipynb

# GPT (Generative Pre-Trained Transformer)

<img src="../assets/module_3/gpt_architecture.png" >

source: https://commons.wikimedia.org/wiki/File:Full_GPT_architecture.png

## Go to model_exercise1.py
- Go through the modules
- Complete the tasks marked as "to-do"

# Pre-train the character LM
### Use notebook 3-TrainCharLM.ipynb

## Generate names
### Use notebook 4-Sample.ipynb

## What did the model learn?
### Explore that in 5-Embeddings.ipynb

# Prompt the model
- to generate names starting from 'a'
- to generate names starting from 'kr

## Can you prompt the model to generate names ending with 'a'?

# Fine-tune the pre-trained model

## task = endswith_a

### Use notebook 6-PrepareFine-TuningDatasets.ipynb to generate all fine-tuning datasets

### Then proceed to the trainer notebook

## Great !  We fined-tuned the pre-trained model on task - endswith_a

### Try generating few such names from the model

## But is it efficient?

# Let us fine-tune for a classification task

## Gender Classification: Given a name, predict the gender from male or female

We do have a generative pre-trained model which is trained of names.

It understands the structures of names, which should help us classify by "transfering" its learning.

<img src="../assets/module_3/task-agnostic_sft.png">
source=https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

## Go to model_exercise2.py

### Make the appropriate changes to the model and use the trainer to fine-tune

## Train the full network, or train only the classifier head
- Compare the number of trainable parameters
- Compare the performance

# Adapter: a PEFT technique

## What is an adapter?

## light weight task specific module plugged at the end of each layer
<img src="../assets/module_3/adapter.png">
source=https://aclanthology.org/D19-1165.pdf

- Layer Norm is applied to the input to the adapter. This makes it pluggable anywhere irrespective of the variations in the activation distributions/patterns
    - This parametrized normalization layer allows the module to learn the activa- tion pattern of the layer it’s injected into
    
- The inner dimension of these two projections is the only knob to tune
    - This allows us to adjust the capacity of the adapter module easily, depending on the complexity of the target task
    
- Residual connection allows the adapter to represent a no-op if necessary
- Multiple task-specific adapters can be traind simultaneously
- During inference adapters can be plugged to transform the model into the task specific function

## Go to model_exercise3.py

### Make the appropriate changes to the model and use the trainer to fine-tune

## Experiment with different adapter sizes for our gender classification task
- Compare the parameter efficiency and the performance against the full model tuning

# Low Rank Adaptation (LoRA)

## Inspiration
Learned over-parametrized models in fact reside on a low intrinsic dimension

Hence ..

Change in weights during model adaptation also has a low “intrinsic rank”, leading to LoRA

<img src="../assets/module_3/lora.png">
source=https://arxiv.org/pdf/2106.09685.pdf

### The authors show that for GPT-3 175B, a very low rank of 1 or 2 suffices even when the full rank is as high as 12,288, making LoRA both storage and compute efficient


### Simple linear design allows us to merge the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model, by construction

# LoRA advantages

### Adapter add mode layers (even though they are small), which increase latency
### Large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to be processed sequentially. This makes a difference in the online inference setting where the batch size is typically as small as one.


<img src="../assets/module_3/lora_vs_adapter.png">

## No additional inference latency

### When deployed in production, we can explicitly compute and store W = W0 + BA and perform inference as usual

### When we need to switch to another downstream task, we can recover Wo by subtracting BA and then adding a different B′A′, a quick operation with very little memory overhead

<img src="../assets/module_3/lora_equation.png">

## A Generalization of Full Fine-tuning

### When applying LoRA to all weight matrices and training all biases, we roughly recover the expressiveness of full fine-tuning by setting the LoRA rank r to the rank of the pre-trained weight matrices

### As we increase the number of trainable parameters, training LoRA roughly converges to training the original model, while adapter-based methods converges to an MLP

# LoRA for GPT

### There are two linear layers in Attention
- one is used to project query, key and value matrices
- other to project the output of attention

### There are two linear layers in MLP

### LoRA can be applied to any weight matrices to achieve parameter efficiency

## Let us apply to these four linear layers' weights matrices

### The paper uses a random Gaussian initialization for A and zero for B, so ∆W = BA is zero at the beginning of training

### They then scale ∆W x by α/r , where α is a constant in r. Let us use α=1

## Go to model_exercise4.py

### Make the appropriate changes to the model and use the trainer to fine-tune

## Play Around

Compare parameter efficiency

Compare performance

You can compare the latency too

# Prompt Tuning

## In-context learning via hard prompts

<img src="../assets/module_3/icl.png">

### hard prompt because we’re using actual tokens that are not differentiable (think of “hard” as something static or set in stone)

### The problem here is that the output of our LLM is highly-dependent on how we constructed our prompt.

### Hard prompts are compute hungry as much of the block/context is spent on them and much less is left for the actual "context" or the input

## what if we can learn our prompts?

## Prompt and prefix tuning solves this by making use of soft prompts— a vector attached to the input embedding that we train while keeping the pretrained LM frozen

### soft prompt, because they are differentiable continuous vectors, which we can optimize/learn

### In prompt tuning we add soft prompts only at the input, and in prefix tuning we prefix soft prompts at each decoder layer

### Today we will explore prompt-tuning

## Prompt Tuning: a parameter efficient fine-tuning technique

<img src="../assets/module_3/prompt_tuning.png">

source=https://aclanthology.org/2021.emnlp-main.243.pdf

## Let us experiment with prompt tuning using the task: last_char

### Here, given a word, model needs to give the last character

### We are only taking up one task, although like the original paper shows one can perform multi-task learning where prefix vectors are different for each task

### Config paramater "prompt_vocab_size" specifies total number of prefix vectors to be learned

- In our case it is same as prefix vectors per task, because we have only one task

### We treat this as language generation task and hence use the default "lm_head"

## Thats it ! Tune your prompts

### Go to model_exercise5.py and complete the tasks

### train, compare, and sample from your model

## Scaling of Prompt-tuning

<img src="../assets/module_3/prompt_tuning_scaling.png">

# Instruction Tuning

## Can models learn to follow instructions?


## Can models perform tasks described purely via instructions?


## Can models perform unseen tasks?

## What is instruction tuning?

<img src="../assets/module_3/instruction_tuning_overview.png">

source=https://arxiv.org/pdf/2109.01652.pdf

## How does instruction tuning compare against other methods?

<img src="../assets/module_3/instruction_tuning_comparison.png">

## Using various ways to demonstrate a task

<img src="../assets/module_3/instruction_tuning_templates.png">

## Scaling Laws

<img src="../assets/module_3/instruction_tuning_scaling.png">

### For smallers models, all their capacity is probably used to learn to do the shown tasks

### Larger models use some of their capacity to perform the tasks. But they have some remaining capacity to learn to follow instructions too. This helps them generalize to new tasks

## Prompt-tuning works better on instruction-tuned models

<img src="../assets/module_3/instruction_tuning_prompt_tuning.png">

## More so in low resource setting

# Role of Instructions

## A possibility: performance gains are due to multi-task fine-tuning and not due to instructions

<img src="../assets/module_3/role_of_instructions.png">

# Let us Instruction-tune our pre-trained model 

### Startswith
- St0{tanu}
- Sgu0{gurleen}

### Endswith
- Edu0{paddu}
- Ene0{arianne}
- En0{parthiban}

### Gender Classification
- G{priyadarsini0=F}
- G{naran0=M}

### Indian Classification ('I' is Indian, 'O' is Other)
- C{shafeeque0=I}
- C{jullian0=O}
- C{vineeta0=I}

### 'S','E','G','C' are the instructions

## Note that classification tasks are also converted to language modeling/generation tasks

## Go ahead! Try it out!

### Also use generate method to sample from model

# Ok enough with the baby models !!

## I want to play with some large models

# let us inject some Common Sense into the GPT2 models

https://inklab.usc.edu/CommonGen/

Run the script 7-GetCommongenData.py

## Go to 8-SampleGPT.ipynb
- Try out differenet gpt models
- Try out different prompts

## Go to model_exercise6.py and complete the tasks
- Use trainer 9-Fine-TuneGPT.ipynb to prompt-tune
- Sample from tuned model

# Challenges and Limitations

## Gargantuan datasets
- The size of modern pre-training datasets renders it impractical for any individual to read or conduct quality assessments on the encompassed documents thoroughly
- Datasets need to be cleaned and quality checked thoroughly before going into pre-training
    - Cannot afford to train multiple times

- Near Duplicates
    - degrade performance
    - leads to memorization by models sometimes
- Benchmark Data Contamination
    - leads to inflation of performance metrics
    - when chatGPT was asked generate instances of academic benchmarks, they found it had memorized some test splits too
- Personally Identifiable Information
    - models typically need to memorize for strong performance

- Pre-training Domain Mixtures
    - mixture benefecial for transferability and generazibility
        - But what amount of data from different sources needed for better downstream performance
    - heterogeneous data sources more important than data quality
        - motivates smaller yet more diverse pre-training datasets
- Fine-Tuning Task Mixtures
    - How to balance tasks in multi-task tine-tuning setup?
    - negative task transfer, where learning multiple tasks at once hinders the learning of some specific tasks
    - catastrophic forgetting of previous tasks when learning new tasks
    - right proportion depends on the downstream end goals

# Tokenizer Reliance
- Generally, Unsupervised
- more tokens (sub-word tokeniation) lead to computational complexity
    - but necessary to rare and handle out-of-vocabulary words

## Tight coupling between pre-training data and tokenizer
- discrepancies between the data that a tokenizer and a model have been trained on can lead to glitch tokens
    - cause unexpected model behavior as their corresponding embeddings are essentially untrained
    - needs re-training of tokenizer when the pre-training corpus is changed
- Different languages requires different amount of tokens needs to express the same meaning
    - interoperability becomes a challenge in multi-lingual setting
    - also this tends to become unfair as different languages use different size of prompts which consume part of context

# Power Law of Scaling
- model performances scale as a power law with model size, dataset size, and the amount of compute used for training
- Unsustainable
- state-of-the-art results are essentially “bought” by spending massive computational resources

- when selecting a model size, the computation resources for later usage (inference) should be considered, not just the one-time training costs
    - it is shown that many llms are undertrained
- train a smaller model more intensively upfront to offset larger inference costs in the future

# Pre-Training Objectives
- choice of PTO heavily influences the model’s data efficiency during pre-training
- which in turn can reduce the number of iterations required
- pre-training objective is typically function of 
    - architecture
    - input/targets constrution
    - masking strategy

<img src="../assets/module_3/masking.png">

source=https://arxiv.org/pdf/2307.10169.pdf

# Parallelism Strategies
- divide and conquery strategy
- model parallelism
    - waiting times
    - underutilized resources
- pipeline parallelism
    - combine with data parallelism
    - data divided into minibatches

# Fine-Tuning Overhead
- Fine-tuning entire LLMs requires the same amount of memory as pre-training
- When adapting an LLM via full-model fine-tuning, an individual copy of the model must be stored (consuming data storage) and loaded (expending memory allocation, etc.) for each task
- Parameter-efficient fine-tuning of LLMs still requires computing full forward/back- ward passes throughout the whole network
- Fine-tuning an LLM, even with PEFT methods, still requires full gradient computation


# High Inference Latency
- LLM inference latencies remain high because of low parallelizability and large memory footprints
- Quantization helps a great deal
- Mixture of Experts
    - a set of experts (modules), each with unique weights
    - a router (or gating) network, which determines which expert module processes an input
- Cascading
    - refers to the idea of employing differently-sized models for different queries
- Decoding Strategies
    - can greatly impact the computational cost of performing inference

# Limited Context Length
- Limited context lengths are a barrier for handling long inputs well to facilitate applications like novel or textbook writing or summarizing
- Length Generalization
    - short lengths during training but should generalize to longer lengths during inference
- Positional Embeddings play a big role in constraining or generalizing to different lengths
    - AliBi
    - Learned vs Fixed Embeddings

# Prompt Brittleness
- Models are very sensitive to syntax of prompts
    - length
    - blanks
    - ordering of examples
- Even semantics
    - wording
    - selection of examples
    - instructions

<img src="../assets/module_3/prompt_brittleness.png">

# Hallucinations
- How to measure hallucinations?
- we can distinguish between intrinsic and extrinsic hallucinations
    - intrinsic: the generated text logically contradicts the source content
    - extrinsic: we cannot verify the output correctness from the provided source
- Retrieval Augmentation
    - mitigates hallucinations by grounding model's input on external knowledge

# Misaligned Behavior
- Harmful/abusive/toxic/biased content
- Not aligned to user's query

- Instruction Tuning
- Using Human Feedback during pre-training/fine-tuning

# Outdated Knowledge

- Model Editing
    - bug-fixing
        - locate the bug
        - apply the update
    - meta-learning
        - uses external model to update the weights
- Retrieval Augmentation

## Papers presenting novel LLMs often lack controlled experiments, likely due to the prohibitive costs of training enough models

## Parallelism strategies designed to distribute the training process across many accelerators are typically non-deterministic, rendering LLM training irreproducible