# Memory Estimator 

This notebook will provide some examples on how to use the memory_estimator API
to estimate the amount of GPU memory consumed when fine-tuning in Training Hub.
This notebook will cover:
1. How the package's primary class implemented, 
2. How it can be subclassed for further extensions,
3. How it can be used via both class instantiation and via convenience function,

Tips on how LLM memory usage is calculated and how the memory can be reduced will also be mentioned as needed.

## Setup

In [None]:
from training_hub import BasicEstimator, OSFTEstimator, OSFTEstimatorExperimental, estimate

The estimation depends on several key factors that should be user inputted. These are:

#### The Pre-Trained Model to be Fine-Tuned

In [None]:
model_path = "ibm-granite/granite-3.3-2b-instruct"  

#### The Number and Size of Your GPUs

The given default values will assume you are training on 2x L40s, each containing 48 GB of memory.

In [None]:
num_gpus = 2
gpu_memory = 48 * (2**30) # 48 GB in bytes

#### The Maximum Number of Tokens You'll Place Onto a GPU

Note that in training hub, minibatches will be operated in such a way that
the number of tokens on the GPU never exceeds this value

In [None]:
max_tokens_per_gpu = 8192

#### The Unfreeze Rank Ratio

This is the OSFT parameter that determines what proportion of the parameters can be updated
during the OSFT fine-tuning step. Setting this to 1/3 should give you an estimation similar to SFT,
and setting this to 1 should you give you an estimation about twice as large as SFT's

In [None]:
unfreeze_rank_ratio = 0.25

## Profiler Overview

At a lower level, the profiling module provides a class `BasicEstimator` that implements the memory estimation for training an LLM normally (via SFT).

The estimator computes this values in the `estimate` function through the following procedure:

1. Calculate the memory needed to store the model parameters (`_calc_model_params`)

2. Calculate the memory needed to store the model's gradients (`_calc_gradients`)

3. Calculate the memory needed to store the model's optimizer states (`_calc_optimizer`)
    - The values of Steps 1-3 is proportional to the number of parameters within the the model.
    - This estimator assumes the AdamW optimizer, which stores 2 optimizer parameters per model parameter
        - Some non-Adam optimizers use only 1 optimizer parameter, although training hub uses AdamW by default

4. Calculate the memory needed to store the intermediate activations within the model (`_calc_intermediate_activations`)
    - This value is the product of the number of tokens being passed onto a GPU, the number of layers in the model, and the model's hidden dimensionality

5. Calculate the memory needed to store the activated output the model (`_calc_outputs`)
    - This value is the product of the number of tokens being passed onto a GPU and the vocabulary size of the model.

6. Calculate any additional memory the model might use (this value is 0 for SFT) (`_calc_additional`)

7. Sum up the memory calculated in Steps 1-6

8. Apply multiplers representing possible overhead to get the low bound (1x), expected (1.1x), and upper bound (1.3x) for the memory usage of this model (`_apply_overhead`)

Note that training hub assumes that all of the above values are stored in Float32 (4 bytes per tensor entry)


## Basic SFT Estimation

In [None]:
my_sft_estimator = BasicEstimator(num_gpus=num_gpus,
                                    gpu_memory=gpu_memory,
                                    model_path=model_path,
                                    max_tokens_per_gpu=max_tokens_per_gpu,
                                    verbose=2
                                )

sft_lower_bound, sft_expected, sft_upper_bound = my_sft_estimator.estimate()

## OSFT Estimation and Subclassing
Training Hub plans to implement a wide variety of different methods for training LLMs,
with OSFT having been recently implemented.

Because the estimator is implemented as a class, the individual components for
calculating the memory are their own functions, and LLM methods tend to have similarities
in how they consume memories, we can create new estimators by simply subclassing `BasicEstimator`
and overriding any of the respective methods for the individual pieces of memory computation
with formulas that are more accurate for that training method.

For example, the estimator for OSFT is implemented as the subclass `OSFTEstimator`.
On top of some under-the-hood changes, its main adjustment is overriding `_calc_model_params`
to use the U, Sigma, and V matrices obtained through SVD calculation instead of the typical
model weight matrix.

In [None]:
my_osft_estimator = OSFTEstimator(num_gpus=num_gpus,
                                    gpu_memory=gpu_memory,
                                    model_path=model_path,
                                    max_tokens_per_gpu=max_tokens_per_gpu,
                                    verbose=2,
                                    unfreeze_rank_ratio=unfreeze_rank_ratio
                                )

osft_lower_bound, osft_expected, osft_upper_bound = my_osft_estimator.estimate()

## OSFT Estimation with Liger Kernels
`BasicEstimator` includes support for Liger Kernels. Liger Kernels aim to drastically
speed up the time needed to fine-tune LLM models as well as reduce the memory footprint
of the fine-tuning process.

Empirically, the main memory optimization of Liger Kernels is to recalculate the activated outputs
of the model rather than directly storing them on the GPU for future use. This can drastically
improve the memory footprint when training use very large batch sizes. 

For the purposes of this estimator, enabling Liger Kernels will force `_calc_outputs` to always be 0.

In Training Hub, OSFT uses Liger Kernels by default.

In [None]:
my_liger_estimator = OSFTEstimator(num_gpus=num_gpus,
                                    gpu_memory=gpu_memory,
                                    model_path=model_path,
                                    max_tokens_per_gpu=max_tokens_per_gpu,
                                    verbose=2,
                                    use_liger=True,
                                    unfreeze_rank_ratio=unfreeze_rank_ratio
                                )

liger_lower_bound, liger_expected, liger_upper_bound = my_liger_estimator.estimate()

## Perform Estimation with the convenience function

For higher level usage, rather than needing to directly instantiate an estimator object,
we have provided a simple convenience function named `estimate`, in which you can
provide the standard initialization arguments for your estimator as well as the
type of training method you want to estimate for, and you can immediately obtain the estimation bounds.

To specify the estimation type, you can pass in `"sft"` to the `training_method` argument to
estimate for SFT, or `"osft"` to estimate for OSFT.

In [None]:
conv_sft_lower_bound, conv_sft_expected, conv_sft_upper_bound = estimate(
                                                                    training_method="sft",
                                                                    num_gpus=num_gpus,
                                                                    gpu_memory=gpu_memory,
                                                                    model_path=model_path,
                                                                    max_tokens_per_gpu=max_tokens_per_gpu,
                                                                    verbose=2
                                                                )

In [None]:
conv_osft_lower_bound, conv_osft_expected, conv_osft_upper_bound = estimate(
                                                                        training_method="osft",
                                                                        num_gpus=num_gpus,
                                                                        gpu_memory=gpu_memory,
                                                                        model_path=model_path,
                                                                        max_tokens_per_gpu=max_tokens_per_gpu,
                                                                        verbose=2,
                                                                        unfreeze_rank_ratio=unfreeze_rank_ratio
                                                                    )

In [None]:
conv_liger_lower_bound, conv_liger_expected, conv_liger_upper_bound = estimate(
                                                                        training_method="osft",
                                                                        num_gpus=num_gpus,
                                                                        gpu_memory=gpu_memory,
                                                                        model_path=model_path,
                                                                        max_tokens_per_gpu=max_tokens_per_gpu,
                                                                        verbose=2,
                                                                        use_liger=True,
                                                                        unfreeze_rank_ratio=unfreeze_rank_ratio
                                                                    )