# Estimate Memory Requirement

The `deepspeed` library provides several functions to estimate the memory requirement during model training. The estimates provided are useful even if you are not using Deepspeed.

### Estimate Based on Actual Model

This requires loading the actual model, which takes time and memory.

In [16]:
# Stage 1 and 2
from transformers import AutoModel
from deepspeed.runtime.zero.stage_1_and_2 import estimate_zero2_model_states_mem_needs_all_live
model = AutoModel.from_pretrained("meta-llama/Llama-2-13b-chat-hf",
                                  token="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")
estimate_zero2_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)

# Stage 3
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 12852M total params.
  per CPU  |  per GPU |   Options
  287.27GB |  23.94GB | offload_optimizer=cpu 
   71.82GB | 239.39GB | offload_optimizer=none
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 12852M total params, 163M largest layer params.
  per CPU  |  per GPU |   Options
  323.17GB |   0.61GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  323.17GB |   0.61GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  287.27GB |  24.55GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  287.27GB |  24.55GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.92GB | 216.06GB | offload_param=none, offload_optimizer=none, zero_init=1
   71.82GB | 216.06GB | offload_param=none, offload_optimizer=none, zero_init=0


### Estimate Based on Theoretical Values

This does not require loading the actual model, 
but you need to know the model's parameter count.

In [19]:
# Hypothetical estimate: stage 1 and 2
from deepspeed.runtime.zero.stage_1_and_2 import estimate_zero2_model_states_mem_needs_all_cold
for i in [1,2,4,8]:
    print(i,"GPU:")
    estimate_zero2_model_states_mem_needs_all_cold(13e9,i,1)
    print("")

1 GPU:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 13000M total params.
  per CPU  |  per GPU |   Options
  290.57GB |  24.21GB | offload_optimizer=cpu 
   72.64GB | 242.14GB | offload_optimizer=none

2 GPU:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 2 GPUs per node.
SW: Model with 13000M total params.
  per CPU  |  per GPU |   Options
  290.57GB |  24.21GB | offload_optimizer=cpu 
  145.29GB | 145.29GB | offload_optimizer=none

4 GPU:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 4 GPUs per node.
SW: Model with 13000M total params.
  per CPU  |  per GPU |   Options
  290.57GB |  24.21GB | offload_optimizer=cpu 
  290.57GB |  96.86GB | offload_optimizer=none

8 GPU:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 13000M total params.
  

In [20]:
# Hypothetical estimate: stage 3
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_cold
for i in [1,2,4,8]:
    print(i,"GPU:")
    estimate_zero3_model_states_mem_needs_all_cold(13e9,163e6,i,1)
    print("")

1 GPU:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 13000M total params, 163M largest layer params.
  per CPU  |  per GPU |   Options
  326.89GB |   0.61GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  326.89GB |   0.61GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  290.57GB |  24.82GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  290.57GB |  24.82GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.91GB | 218.54GB | offload_param=none, offload_optimizer=none, zero_init=1
   72.64GB | 218.54GB | offload_param=none, offload_optimizer=none, zero_init=0

2 GPU:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 2 GPUs per node.
SW: Model with 13000M total params, 163M largest layer params.
  per CPU  |  per GPU |   Options
  326.89GB |   0.61GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  326.89GB |