# Whisper for Inferentia2

This sample shows how to compile & run Whisper models (different sizes) on Inferentia2. It makes use of the HF weights:  
  - Tiny: https://huggingface.co/openai/whisper-tiny
  - Small: https://huggingface.co/openai/whisper-small
  - Medium: https://huggingface.co/openai/whisper-medium
  - Large-v3: https://huggingface.co/openai/whisper-large-v3

Given the largest model has only 1.5B params, it fits into just 1 core when quantized to bf16. Also, this model is an encoder-decoder, so the strategy is to compile both components individually and then put them back into the original model structure. After that, both encoder and decoder will be accelerated on inf2.

You can use the smallest instance for this experiment: inf2.xlarge, but to achieve a higher througput by launching multiple copies of the model to serve clients in parallel, it is recommended to use a larger instance like:

**Instance**: EC2 ml.inf2.24xlarge  

Follow the [instructions from this page to setup the environment.](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx) I also recommend the usage of the following container (DLC) to run your experiments: **Deep Learning Container**: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:1.13.1-neuronx-py310-sdk2.18.1-ubuntu20.04

This will guarantee you're using the exact same libraries I used in this experiment.

Also, make sure you install the following libraries in your environment. Pay attention to the transformers version, newer versions will not work.

### Requirements
 - transformers==4.36.2
 - soundfile==0.12.1
 - datasets==2.18.0
 - librosa==0.10.1

In [None]:
%pip install -U transformers==4.36.2 datasets==2.18.0 soundfile==0.12.1 librosa==0.10.1

In [None]:
import os
import types
import torch
from datasets import load_dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration


# please, start by selecting the desired model size
suffix="tiny"
#suffix="small"
#suffix="medium"
#suffix="large-v3"
model_id=f"openai/whisper-{suffix}"

# this will load the tokenizer + two copies of the model. cpu_model will be used later for results comparison
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id, torchscript=True)
cpu_model = WhisperForConditionalGeneration.from_pretrained(model_id, torchscript=True)

# Load a sample from the dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# sample #3 is ~9.9seconds and produces 33 output tokens + pad token
sample = dataset[3]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

batch_size=1
# this is the maximum number of tokens the model will be able to decode
# for the sample #3 we selected above, this is enough. If you're planning to 
# process larger samples, you need to adjust it accordinly.
max_dec_len = 64
# num_mel_bins,d_model --> these parameters where copied from model.conf (found on HF repo)
# we need them to correctly generate dummy inputs during compilation
if suffix in "tiny":
    dim_enc,dim_dec=80,384
elif suffix in "small":
    dim_enc,dim_dec=80,768
elif suffix in "medium":
    dim_enc,dim_dec=80,1024
elif suffix == "large-v3":
    dim_enc,dim_dec=128,1280

In [None]:
import types
import torch.nn.functional as F
from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions,BaseModelOutput

# Now we need to simplify both encoding & decoding forward methods to make them 
# compilable. Please notice that these methods overwrite the original ones, but
# keeps retro-compatibility. Also, we'll use use a new variable "forward_neuron"
# to invoke the model on inf2
def enc_f(self, input_features, attention_mask, **kwargs):
    if hasattr(self, 'forward_neuron'):
        out = self.forward_neuron(input_features, attention_mask)['last_hidden_state']
    else:
        out = self.forward_(input_features, attention_mask, return_dict=False)[0]
    return BaseModelOutput(last_hidden_state=out)

def dec_f(self, input_ids, attention_mask=None, encoder_hidden_states=None, **kwargs):
    out = None        
    if not attention_mask is None and encoder_hidden_states is None:
        # this is a workaround to align the input parameters for NeuronSDK tracer
        # None values are not allowed during compilation
        encoder_hidden_states, attention_mask = attention_mask,encoder_hidden_states
    inp = [input_ids, encoder_hidden_states]
    
    # here we pad the input to max_dec_len
    if inp[0].shape[1] > self.max_length:
        raise Exception(f"The decoded sequence is not supported. Max: {self.max_length}")
    pad_size = torch.as_tensor(self.max_length - inp[0].shape[1])
    inp[0] = F.pad(inp[0], (0, pad_size), "constant", processor.tokenizer.pad_token_id)
    
    if hasattr(self, 'forward_neuron'):
        out = self.forward_neuron(*inp)['last_hidden_state']
    else:        
        out = self.forward_(input_ids=inp[0], encoder_hidden_states=inp[1], return_dict=False, use_cache=False)[0]        
    # unpad the output
    last_hidden_state = out[:, :input_ids.shape[1], :]
    
    return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=last_hidden_state)
    
# save a backup of the original forward method
if not hasattr(model.model.encoder, 'forward_'): model.model.encoder.forward_ = model.model.encoder.forward
if not hasattr(model.model.decoder, 'forward_'): model.model.decoder.forward_ = model.model.decoder.forward

# overwrite with the new methods
model.model.encoder.forward = types.MethodType(enc_f, model.model.encoder)
model.model.decoder.forward = types.MethodType(dec_f, model.model.decoder)

model.model.decoder.max_length = max_dec_len

In [None]:
# warmup model
y1 = model.generate(input_features)

## Trace Encoder

In [None]:
import os
import torch
import torch_neuronx

model_filename=f"whisper_{suffix}_{batch_size}_neuron_encoder.pt"
if not os.path.isfile(model_filename):
    inp = (torch.zeros([1, dim_enc, 3000], dtype=torch.float32), torch.zeros([1, dim_enc], dtype=torch.int64))
    if hasattr(model.model.encoder, 'forward_neuron'): del model.model.encoder.forward_neuron
    neuron_encoder = torch_neuronx.trace(
        model.model.encoder, 
        inp,
        compiler_args='--model-type=transformer --enable-saturate-infinity --auto-cast=all', 
        compiler_workdir='./enc_dir',      
        inline_weights_to_neff=False)
    neuron_encoder.save(model_filename)
    model.model.encoder.forward_neuron = neuron_encoder
else:
    model.model.encoder.forward_neuron = torch.jit.load(model_filename)

## Trace decoder

In [None]:
import torch
import torch_neuronx

model_filename=f"whisper_{suffix}_{batch_size}_neuron_decoder.pt"
if not os.path.isfile(model_filename):
    inp = (torch.zeros([1, max_dec_len], dtype=torch.int64), torch.zeros([1, 1500, dim_dec], dtype=torch.float32))
    if hasattr(model.model.decoder, 'forward_neuron'): del model.model.decoder.forward_neuron
    neuron_decoder = torch_neuronx.trace(
        model.model.decoder, 
        inp,
        compiler_args='--model-type=transformer --enable-saturate-infinity  --auto-cast=all',
        compiler_workdir='./dec_dir',      
        inline_weights_to_neff=True)
    neuron_decoder.save(model_filename)
    model.model.decoder.forward_neuron = neuron_decoder
else:
    model.model.decoder.forward_neuron = torch.jit.load(model_filename)

## Test

In [None]:
# warmup inf2 model
y1 = model.generate(input_features)

In [7]:
import time
t=time.time()
y1 = model.generate(input_features)
print(f"Elapsed inf2: {time.time()-t}")
t=time.time()
y2 = cpu_model.generate(input_features)
print(f"Elapsed cpu: {time.time()-t}")
print(f"Tokens inf2: {y1}")
print(f"Tokens cpu: {y2}")
t1 = processor.batch_decode(y1, skip_special_tokens=True)
t2 = processor.batch_decode(y2, skip_special_tokens=True)
print(f"Out inf2: {t1}")
print(f"Out cpu: {t2}")

Elapsed inf2: 0.06917572021484375
Elapsed cpu: 1.6093239784240723
Tokens inf2: tensor([[50258, 50259, 50359, 50363,   634,   575, 12525, 22618,  1968,  6144,
         35617, 20084,  1756,   311,   589,   307,   534, 10281,   934,   439,
           293,   393,  4411,   294,   309,   457,   707,   295, 26916,   286,
           392,  6628,    13, 50257]])
Tokens cpu: tensor([[50258, 50259, 50359, 50363,   634,   575, 12525, 22618,  1968,  6144,
         35617, 20084,  1756,   311,   589,   307,   534, 10281,   934,   439,
           293,   393,  4411,   294,   309,   457,   707,   295, 26916,   286,
           392,  6628,    13, 50257]])
Out inf2: [" He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca."]
Out cpu: [" He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca."]


## Pipeline Mode

In [8]:
import torch
import torch_neuronx
from datasets import load_dataset
from transformers import pipeline, WhisperProcessor

pipe = pipeline(
  "automatic-speech-recognition",
  model=model_id,
  chunk_length_s=30,
)
pipe.model = model
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[3]["audio"]

# we can also return timestamps for the predictions
prediction = pipe(sample.copy(), batch_size=1, return_timestamps=True)["chunks"]
for p in prediction:
    print(p)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


{'timestamp': (0.0, 6.4), 'text': " He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can"}
{'timestamp': (6.4, 9.4), 'text': ' discover in it but little of Rocky Ithaca.'}
