## GPU Footprint

This notebook computes gpu requirements of models with different number of model parameters. The results are shown in the outputs.

Source: https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_pipeline.py

In [1]:
import torch
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
import GPUtil
from numba import cuda
import gc

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [2]:
# View GPU utilization when nothing is running on it
device = -1  # cpu
if torch.cuda.is_available():
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
    device = 0  # first GPU
# Get all available GPUs
gpus = GPUtil.getGPUs()
GPUtil.showUtilization(all=True)

| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  0 | Tesla T4 | 1323921034295 | GPU-634f99e4-90c8-d992-a802-6d4246152a1c ||       22C |        0% |           0% ||      15360MB |         5MB |     14923MB || Enabled      | Disabled       |


In [3]:
# Look at the GPU utilization when kernels are loaded
torch.ones((1, 1)).to(device)
GPUtil.showUtilization(all=True)

| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  0 | Tesla T4 | 1323921034295 | GPU-634f99e4-90c8-d992-a802-6d4246152a1c ||       23C |        6% |           3% ||      15360MB |       481MB |     14447MB || Enabled      | Disabled       |


We see a memory of 481MB used just for loading kernels. Let's see what happens when the models are loaded. 

In [4]:
# Model selection
# A dictionary of models and their corresponding tasks
models = {
    "bloom-1b7": {
        "task": "text-generation",
        "model": "bigscience/bloom-1b7",
        "extra_args": {"max_new_tokens": 100},
    },
    "bloomz-1b7": {
        "task": "text-generation",
        "model": "bigscience/bloomz-1b7",
        "extra_args": {"max_new_tokens": 500, "temperature": 0.9},
    },
    "gpt2": {
        "task": "text-generation",
        "model": "gpt2",
        "extra_args": {"max_new_tokens": 30},
    },
    "gpt2-large": {
        "task": "text-generation",
        "model": "gpt2-large",
        "extra_args": {"max_new_tokens": 50},
    },
    "mt0-large": {
        "task": "text2text-generation",
        "model": "bigscience/mt0-large",
        "extra_args": {},
    },
    "opt-1b3": {
        "task": "text-generation",
        "model": "facebook/opt-1.3b",
        "extra_args": {"max_new_tokens": 100},
    },
    "bloom-3b": {
        "task": "text-generation",
        "model": "bigscience/bloom-3b",
        "extra_args": {"max_new_tokens": 100},
    },
    "bloom-7b1": {
        "task": "text-generation",
        "model": "bigscience/bloom-7b1",
        "extra_args": {"max_new_tokens": 100},
    },
    "gpt-6b":{
        "task": "text-generation",
        "model": "EleutherAI/gpt-j-6B",
        "extra_args": {"max_new_tokens": 100},
    }
}


In [5]:
def load_model_on_gpu(model_id):
    """
    The function loads the model (from its model_id) on the available GPU
    """
    # A mapping between the task and the corresponding transformers class
    auto_classes = {
        "text-generation": AutoModelForCausalLM,
        "text2text-generation": AutoModelForSeq2SeqLM,
    }

    task = models[model_id]["task"]
    model_name = models[model_id]["model"]
    auto_class = auto_classes[task]
    print(f"Using a {task} pipeline with {model_name}")

    model = auto_class.from_pretrained(model_name).to(device)

def show_utilization_models(model_ids):
    """
    This function loads all the models from their model_ids, shows their gpu utilization and cleans up memory.
    """
    for model_id in model_ids:
        load_model_on_gpu(model_id)
        GPUtil.showUtilization(all=True)
        torch.cuda.empty_cache()
        cuda.get_current_device().reset()
        gc.collect()

In [6]:
model_ids_1 = ["bloom-1b7", "bloomz-1b7", "gpt2-large", "opt-1b3", "mt0-large"]
show_utilization_models(model_ids_1)

Using a text-generation pipeline with bigscience/bloom-1b7
| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  0 | Tesla T4 | 1320220104182 | GPU-e038c2e6-e540-d3ae-f932-68e51a562fd3 ||       33C |       20% |          59% ||      15360MB |      9011MB |      5917MB || Enabled      | Disabled       |
Using a text-generation pipeline with bigscience/bloomz-1b7
| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |
---------------------------------------------------------------------------------------------------------

In [7]:
model_ids_2 = ["bloom-3b"]
show_utilization_models(model_ids_2)

Using a text-generation pipeline with bigscience/bloom-3b
| ID | Name     | Serial        | UUID                                     || GPU temp. | GPU util. | Memory util. || Memory total | Memory used | Memory free || Display mode | Display active |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  0 | Tesla T4 | 1320220104182 | GPU-e038c2e6-e540-d3ae-f932-68e51a562fd3 ||       35C |       17% |          94% ||      15360MB |     14443MB |       485MB || Enabled      | Disabled       |


In [None]:
model_ids_3 = ["gpt-6b", "bloom-7b1"]
show_utilization_models(model_ids_3)

Using a text-generation pipeline with EleutherAI/gpt-j-6B


Downloading (…)lve/main/config.json:   0%|          | 0.00/930 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/24.2G [00:00<?, ?B/s]

The 3B model takes around 94% of the GPU RAM and therefore 6B and 7B models can't be loaded into the GPU in fp32 precision.

Conclusion
This notebook loads models on the GPU and checks their GPU memory utilization. This cluster has Tesla T4 16GB GPUs out of which 15.36GB are available for use. The kernels take an additional 500MB. The results show memory taken up by different small models. The models that have no results were not able to run in the environment resources.