# Parameter Efficient Fine-tuning (PEFT)

- As we know in Full Fine-tuning, all the parameters of the LLM model gets updated and a new version gets released. But in PEFT, it only updates a small set of parameters. 

- Based on the chosen technique, sometimes the underlying parameters might get updated or else, original parameters gets freezed and very small set of parameters gets updated or a new layer will get added. 

- PEFT can be performed on a single GPU unlike Full fine-tuning requiring the same compute of pre-training the LLMs. 

- This solves the Catastrophic forgetting problem as some of zero paramters of the LLM are getting updated. 

- PEFT saves the memory and very flexible. It is unlike full fine-tuning where a new version of model gets released post fine-tuning. 

## Tradeoffs of PEFT 

- Parameter Efficiency 

- Improved Model peformance with less inference costs

- Reduction in Training speed

- Memory Efficiency

## Techniques of PEFT

There are three main techniques of Parameter Efficient Fine-tuning (PEFT). They are as mentioned below: 

### 1. Selective 

Only 'select' subset parameters of the LLM gets updated for fine-tuning and freezes the original LLM parameters.

### 2. Additive

We will add trainable layers on top of the LLM model without disturbing the model's original parameters.  There are two ways which we can add:

- Adapters - adds a new layer

- Soft Prompts - focuses on the input prompt also called as Prompt Tuning (remember not Prompt Engineering).

### 3. Reparameterization

Leverage LoRA (Low Rank Adaption) Technique and re-parameterize the model weights during training. 

## LoRA (Low Rank Adaption of LLMs) - Reparameterization Technique

- Aims to reduce the number of parameters that need to be updated during fine-tuning, making the process more parameter-efficient.

- During fine-tuning, the Model weights are reparameterized using a low-rank factorization. This means that the original weight matrix is decomposed into two smaller matrices, one with lower rank making the parameters that are to be updated during fine-tuning much smaller in size.

- The low-rank matrices are then used to reconstruct the original weight matrix, allowing the model to maintain its performance while only updating a fraction of the parameters during fine-tuning.

- During fine-tuning, the low-rank reparameterization allows the model to adapt to new tasks while preserving the knowledge learned during pre-training by avoiding catastrophic forgetting.

- This is particularly useful for large language models (LLMs) where the number of parameters can be in the billions, making full fine-tuning computationally expensive and time-consuming.

## Prompt Tuning - Additive Technique

- LoRA was a technique, where in which we have used an efficient way of updating the model parameters without re-training each and every single parameter again. 

- Prompt Tuning is a technique where we will add a new layer to the existing LLM without changing the weights at all. 

### Is it related to Prompt Engineering? 

- It looks like its related to "Prompt Engineering" but this technique is completely different. Prompt Engineering means refining our prompt, using one or few-shot inference. 

- But there are drawbacks for the same where we require lot of manual efforts to write the prompts, and the context window limitation when we use one or few shot inference.

### What is Prompt Tuning? 

- Prompt Tuning is a technique where we add a new set of layer consisting of additional trainable tokens called as Soft Prompts. 

- During the supervised learning process, these trainable tokens will be assigned an optimal value.

- These set of trainable tokens called as Soft Prompts, are the virtual tokens comprising of the same size as the embedding vectors. 

- These soft prompts are not fixed for some words, instead these can be trained with any set of words like virtual tokens. During the supervised learning, these virtual tokens will be trained and assigned some value. 

- In Full fine-tuning, all the model parameter weights are updated, whereas in Prompt Tuning the original LLM weights are left untouched and a new layer is added. Only these embedding vectors of the soft prompts gets updated over time to optimize the model performance. 

- We can train different set of soft prompts for different tasks, and we can change therm during the time of inference. We can write our input prompt to change the soft prompts. 

- These Soft prompts (trainable tokens) occupies very less size on disk making it a powerful fine-tuning strategy.