#### Introduction to Fine Tuning
- Fine-tuning a large language model (LLM) is the process of retraining a pre-trained model (or a foundation model) for a specific task, format, dataset or tone.
- Full Fine-Tuning: This approach updates all the model's parameters, offering high adaptability but requiring substantial computational resources and posing risks like catastrophic forgetting, where the model loses previously learned information.
- Parameter-Efficient Fine-Tuning (PEFT): Techniques like Low-Rank Adaptation (LoRA) adjust only a subset of model parameters, significantly reducing computational demands while maintaining performance, making fine-tuning more accessible.

#### LoRA: Low Rank Adaptation
- LoRA (Low-Rank Adaptation) fine-tunes LLMs by adding trainable low-rank matrices to frozen model weights, drastically reducing training cost.
- Instead of updating all parameters, LoRA learns only a small number of parameters, making it efficient and memory-friendly.
- Compared to full fine tuning, LoRA is much more efficient in terms of training time and hardware resource utilization.
- In Transformer model we were updating the parameters
  - Attension layer: Wq, Wk, Wv, Wo and Feedforward layer: W1, W2 : let's combinedly call them as W and keep W as frozen.
- PEFT (Parameter Efficient Finetuning): In PEFT, the whole model is frozen but a small set of trainable parameter is added to the model.
- Low rank hyperparameter can be anything but common values are 8 and 16 and avoid overfitting.
- Ex: ![image.png](attachment:dfc3a208-467f-4afe-a388-b074261c0429.png)

#### Quantization Basics
- Quantization reduces model size and inference cost by converting high-precision weights (e.g., float32) to lower-precision formats like int8 or 4-bit representations, enabling faster and more efficient deployment on resource-constrained devices.
- NF4 (NormalFloat4) is a 4-bit quantization method optimized for neural network weights that typically follow a normal distribution. It maps values into 16 non-uniform bins centered around zero, providing higher precision near zero where most weights lie.
- NF4 is widely used in large language model compression strategies like QLoRA, allowing models to be fine-tuned and deployed with significantly reduced memory and compute requirements without substantial accuracy loss.
- NF4 is supported by libraries such as Hugging Face's bitsandbytes, enabling seamless integration into PyTorch pipelines for 4-bit quantized model loading and inference.

#### MCQ
- 