<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/02-running-inference/02_accelerating_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Accelerating inference for GPT-Neo with DeepSpeed

The code in this notebook is to introduce readers to the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library to accelerate inference for the [GPT-Neo model](https://github.com/EleutherAI/gpt-neo) for text generation tasks. It can be executed in the Colab free tier with hardware acceleration (GPU).  

Install the missing dependencies in the Colab VM (DeepSpeed and HF's Accelerate only).

In [None]:
!pip install -q deepspeed accelerate

Before loading the model, let's define a custom function to be used for benchmarking (latency measurement).

In [None]:
from time import perf_counter
import numpy as np

def measure_latency(model, tokenizer, payload, device, generation_args={}):
    input_ids = tokenizer(payload, return_tensors="pt").input_ids.to(device)
    latencies = []
    # Do GPU warm up before benchmarking
    for _ in range(2):
        _ =  model.generate(input_ids, **generation_args)
    # Runs used for measuring the latency
    for _ in range(20):
        start_time = perf_counter()
        _ = model.generate(input_ids, **generation_args)
        latency = perf_counter() - start_time
        latencies.append(latency)

    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies,95)

    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms


Download the base GPT-Neo 2.7B model in half precision and the related tokenizer from the HF's Hub.

In [None]:
import torch
from transformers import GPTNeoForCausalLM, GPT2Tokenizer, AutoTokenizer, AutoModelForCausalLM

model_id = "EleutherAI/gpt-neo-2.7B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                          torch_dtype=torch.float16,
                                          device_map="auto")

In [4]:
print(f"model is loaded on device {model.device.type}")

model is loaded on device cuda


Do inference with the downloaded model to verify that everything is working as expected.

In [6]:
example = "The story so far: in the beginning, the universe was created."

input_ids = tokenizer(example, return_tensors="pt").input_ids.to(model.device)
logits = model.generate(input_ids,
                        do_sample=True,
                        num_beams=1,
                        min_length=128,
                        max_new_tokens=128,
                        pad_token_id=50256)

print(f"prediction: \n \n {tokenizer.decode(logits[0].tolist()[len(input_ids[0]):])}")

prediction: 
 
  The Big Bang Theory states that at a very early stage in a protons, and electrons were created out of the vaccuum of space. The universe expands exponentially, creating enormous amounts of matter and energy, and eventually, stars begin to form. Eventually, when the universe is about 13.7 billion years old, planets form. Then, life evolves on these planets, developing multicellular life and complex life forms, such as humans.

In the second law of thermodynamics, everything eventually cools down. Eventually, the universe (with the vast majority of species) goes into a rapid cooling down to about a millionth of


Perform benchmark for the vanilla model. The previously defined ```measure_latency``` function is used.



In [7]:
generation_args = dict(do_sample=True,
                      max_length=300,
                      pad_token_id=50256,
                      use_cache=True
)
vanilla_results = measure_latency(model, tokenizer, example,
                                  model.device, generation_args)

print(f"Vanilla model: {vanilla_results[0]}")

Vanilla model: P95 latency (ms) - 12297.61378600016; Average latency (ms) - 10779.98 +\- 721.81;


## Optimize model with DeepSpeed

Let's now optimize the base GPT-Neo 2.7B model for inference on GPU with DeepSpeed. The decision about which of the original model's layers have to be replaced is left to DeepSpeed itself here.

In [8]:
import os

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9999'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

In [9]:
import deepspeed

ds_model = deepspeed.init_inference(
    model=model,
    mp_size=1,
    dtype=torch.float16,
    replace_method="auto",
    replace_with_kernel_inject=True,
)
print(f"model is loaded on device {ds_model.module.device}")

model is loaded on device cuda:0


Print the optimized model's architecture to this cell output to verify that some of the original model's layers have been replaced with DeepSpeed optimized kernel implementations.

In [10]:
ds_model

InferenceEngine(
  (module): GPTNeoForCausalLM(
    (transformer): GPTNeoModel(
      (wte): Embedding(50257, 2560)
      (wpe): Embedding(2048, 2560)
      (drop): Dropout(p=0.0, inplace=False)
      (h): ModuleList(
        (0-31): 32 x DeepSpeedGPTInference(
          (attention): DeepSpeedSelfAttention(
            (qkv_func): QKVGemmOp()
            (score_context_func): SoftmaxContextOp()
            (linear_func): LinearOp()
            (vector_matmul_func): VectorMatMulOp()
          )
          (mlp): DeepSpeedMLP(
            (mlp_gemm_func): MLPGemmOp(
              (pre_rms_norm): PreRMSNormOp()
            )
            (vector_matmul_func): VectorMatMulOp()
            (fused_gemm_gelu): GELUGemmOp()
            (residual_add_func): ResidualAddOp(
              (vector_add): VectorAddOp()
            )
          )
          (layer_norm): LayerNormOp()
        )
      )
      (ln_f): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_feat

Do inference with the optimized model to verify that everything is working as expected.

In [13]:
input_ids = tokenizer(example, return_tensors="pt").input_ids.to(model.device)
logits = ds_model.generate(input_ids,
                           do_sample=True,
                           num_beams=1,
                           min_length=128,
                           max_new_tokens=128,
                           pad_token_id=50256,
                           use_cache=False
                          )
print(tokenizer.decode(logits[0].tolist()))

The story so far: in the beginning, the universe was created. It was a hot universe called the multiverse. There were many universes where it appeared that matter and energy were everywhere; there was nothing but energy, but in the multiverse all energy was made out of matter and was everywhere. This way, nothing that we know of today would exist or at least wouldn't be allowed to exist.

Back then, matter was energy that existed as particles, and the energy of matter was negative, and negative energy was bad.

Things were like this for some time. Then suddenly, there was an explosion, we're not exactly sure what happened, perhaps the explosion created something, an intelligent


Perform now benchmark for the DeepSpeed optimized model. The previously defined ```measure_latency``` function is used.

In [12]:
generation_args = dict(do_sample=True,
                       max_length=300,
                       pad_token_id=50256,
                       use_cache=True)
ds_results = measure_latency(ds_model, tokenizer, example,
                             ds_model.module.device, generation_args)

print(f"DeepSpeed model: {ds_results[0]}")

DeepSpeed model: P95 latency (ms) - 8276.489515550087; Average latency (ms) - 8207.76 +\- 71.77;
