# Pre-training LLMs

Let us understand how the LLMs that we are currently using are trained. We will also have a look on the memory and compute related details.

## Training Process of LLMs

The process of training LLMs from scratch is termed as **Pre-training**. 

- Vast amount of data ranging from Gigabytes, Terabytes to Petabytes (GB - TB - PB) of data is gathered which is mostly in unstructured format for training the LLM model.

- This data will be used for training the LLMs, during the training LLMs will learn the patterns in the provided data. 

- The model weights are optimized during the training processes and loss function is used to minimize the loss. 

- The training data needs to be optimized and cleaned first. Training data will be converted into Tokens assigning IDs to each and Embeddings are created. 

> Pre-training requires large amount of compute and memory for the model to be perfectly trained.

## CUDA: Out of Memory Error

When loading the models on NVIDIA GPU, mostly we receive a error as mentioned "CUDA: Out of Memory Error". 

CUDA stands for Compute Unified Device Architecture. CUDA plays a major role in helping Deep Learning models trained. 

Why we face this error? Due to the large size of LLM, the CUDA memory gets exhausted and hence this error is thrown. 

> So we have the problem now, LLM models are taking large amount of memory and due to this we are not able to leverage CUDA which plays a major role in training our model. 

## Understanding Parameters and Memory in terms of LLM

As we already know the higher the model parameter size, the efficient and powerful the LLM model. 

Consider 

1 Parameter = 4 Bytes of memory,
1 Billion parameters = 4 * 1 Billion Bytes = 4 * 10^9 Bytes = 4 GB @ 32-bit. 

Additionally, along with each parameter we need Adam Optimizer, Gradients and Activations at temporary memory which might take 20 bytes of extra memory. 

Total Size for per parameter = 4 bytes + 20 bytes = 24 bytes.

### Key Calculations 

Memory needed to Store the model - 4 GB @ 32-bit full precision

Memory needed to Train the model - 24 GB @ 32-bit full precision 

So, we need 24 GB of GPU RAM to train such model.

## Techniques to reduce Memory Consumption

### Quantization

It is a technique where we will reduce the memory where our weights are stored. Usually, all the weights are stored in FP32 format (Floating point 32-bit). 

As part of Quantization, we are going to make it reduce till FP16 (16-bit). This way, we are reducing nearly 50% storage. 

> Do you know? BFLOAT16 is a latest technique developed by Google Brain. FLAN-T5 model is pre-trained on BFLOAT16.

### Goals of Optimization

- Reduces memory required to store the weights and for training the models.

- Projects original 32-bit floating point to lower precision values.
- Quantization aware training (QAT) learns the quantization scaling factors during training. Modern deep learning libraries are aware of the QAT. 
- Let us have a look on the below example where the model remains same, but how it varied after doing quantization. 

    | Model | Model | Model |
    | :--: | :--: | :--: |
    | 4GB of GPU RAM @ 32-bit precision |  2GB of GPU RAM @ 16-bit precision |  1GB of GPU RAM @ 8-bit precision | 