## Llama 2 7B
- 28 GB storage in FP32 (32-bit precision)
- Reduce to ~4 GB if store in 4-bit precision, in "GGUF" format
- can run locally?!


##### Huggingface has llama-2-7B GGUF model 
- perfomance degradation? check out Hugging Face open LLM leaderboard. (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), (https://huggingface.co/open-llm-leaderboard)


#### Fine Tuning a quantized model

- benefits of fine-tuning a quantized model:
    - recover the accuracy from quantization (a)
    - tailor your model for specific use-cases and applications (b)
    

#### fine tune with quantization aware training (QAT)

- Fine-tune the model in a way that its quantized version will perform optimally. 
    - Not compatible with Post Traning Quantization (PTQ) techniques. 
    - The linear quantization method you learned in this course is an example of Post Traning Quantization. 
    
    
#### Parmeters efficient fine-tuning (PEFT)
- Significantly reduce the number of trainable parameters of a model while keeping the same performance as full fine-tuning 
    - PEFT + QLoRA
    - https://pytorch.org/blog/finetune-llms (good)



### LLM int8 paper:

- ####  1. Vector-Wise Quantization

    - When transformers do matrix multiplication -- a key operation in deep learning models, we want to reduce the precision of the numbers (such as from 16-bit or 32-bit to 8-bit covered above) to save memory and computation power. However, if we just apply 8-bit quantization across the whole matrix, it can reduce the precision of some important numbers, leading to performance loss.

    - Key Idea: Instead of applying one scaling factor to the entire matrix, vector-wise quantization applies a separate scaling factor to each row and column of the matrix. This way, each row and column can maintain the precision of its own values without being overly influenced by the largest values in the matrix.

    - How it works in reality: Let’s say you have a matrix 
    X of size  s×h (where s is the sequence length and h is the hidden dimension).
    Each row of this matrix represents a sequence of tokens, and each column represents some feature.
    Instead of scaling the entire matrix with one number, vector-wise quantization scales each row of 
    X using a unique normalization constant C_x, and each column of 
    W (the weight matrix) with another constant c_w.   
    

- The formula for quantization is: $X_{\text{int8}} = \left\lfloor \frac{\max\left(\left| X_i \right|\right)}{127} \cdot X_i \right\rfloor$ This means you scale each element of X_i to fit into the 8-bit range [−127,127], using the max value from that row (or column).


- So.. the maxtrix, can be represented as: (if we have to put this in math form)

    - $\hat{C} = \frac{1}{C_x \otimes C_w} \cdot \hat{X} \cdot \hat{W}$

    where: 

    C_hat is the quantized result.
    C_xand C_w are the row and column scaling factors.
    X_hat and W_hat are the quantized versions of the original matrices.


- ####  2. Mixed-Precision Decomposition 

    - As transformers get even larger (like above 6.7 billion parameters), certain "outliers" start to appear in the matrix. These outliers are large, systematic values that significantly affect performance if quantized to 8-bit. If we quantize these large outliers along with the rest of the matrix, they would cause severe performance degradation.

    - Introducing mixed-precision decomposition, which separates these outliers and processes them in higher precision (16-bit), while keeping the rest of the matrix in 8-bit. This way, most of the matrix can stay in memory-efficient 8-bit, but the outliers are handled with greater precision to avoid performance loss.

- #### How it works in reality: 

    1) Identify Outliers: In the matrix multiplication, outliers are dimensions (or features) that have very large values (about 20 times larger than other values). These outliers start to appear in about 25% of transformer layers and affect around 0.1% of the features.

    2) Separate Outliers: The matrix is decomposed into two parts: one containing the outliers, and the other containing the regular values. Outliers are processed with 16-bit precision, while regular values are processed with 8-bit precision.

    3) Perform Mixed-Precision Matrix Multiplication: For the outliers: the matrix multiplication is done using 16-bit floating point (FP16) precision. For the regular values: the matrix multiplication is done using 8-bit precision, using vector-wise quantizatio (there is a formular to nicely summarize this. see paper)
    
    4) Recombine the Results: After the matrix multiplication, the outputs from the 8-bit and 16-bit operations are combined to give the final result.

    5) Why It Works?:
    By isolating the outliers and processing them in 16-bit, the method avoids the performance degradation caused by quantizing these critical values to 8-bit. Meanwhile, more than 99.9% of the matrix is still processed in 8-bit, maintaining the memory and computational efficiency.

in conclution: vector-wise quantization improves the precision of matrix multiplication by applying different scaling factors for rows and columns. Mixed-precision decomposition handles large, systematic outliers in higher precision (16-bit) while keeping the rest of the matrix in 8-bit, ensuring that performance is maintained even for large transformer models. This combined approach allows transformer models with up to 175 billion parameters to run with reduced memory requirements and no loss in performance.