# Whisper for Inferentia2

This sample shows how to compile & run Whisper models (different sizes) on Inferentia2. It makes use of the HF weights:  
  - Tiny: https://huggingface.co/openai/whisper-tiny
  - Small: https://huggingface.co/openai/whisper-small
  - Medium: https://huggingface.co/openai/whisper-medium
  - Large-v3: https://huggingface.co/openai/whisper-large-v3

Given the largest model has only 1.5B params, it fits into just 1 core when quantized to bf16. Also, this model is an encoder-decoder, so the strategy is to compile both components individually and then put them back into the original model structure. After that, both encoder and decoder will be accelerated on inf2.

You can use the smallest instance for this experiment: inf2.xlarge, but to achieve a higher througput by launching multiple copies of the model to serve clients in parallel, it is recommended to use a larger instance like ml.inf2.24xlarge or trn1.32xlarge.

Follow the [instructions from this page to setup the environment.](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx) It is recommended the usage of the following container (DLC) to run your experiments: **Deep Learning Container**: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:1.13.1-neuronx-py310-sdk2.19.1-ubuntu20.04

This guarantees you'll be using the exact same libraries of this experimentation.

Also, make sure you install the following additional libraries in your environment. Pay attention to the transformers version, newer versions might not work.

## Install Dependencies
This tutorial requires the following pip packages:

- `transformers==4.36.2`
- `soundfile==0.12.1`
- `datasets==2.18.0`
- `librosa==0.10.1`

In [1]:
import transformers
transformers.__version__

  _torch_pytree._register_pytree_node(


'4.36.2'

In [1]:
!export NEURON_RT_NUM_CORES='2'
!export NEURON_RT_VISIBLE_CORES='0,1'

In [1]:
import os
os.environ['NEURON_RT_NUM_CORES']= "2"
os.environ["NEURON_RT_VISIBLE_CORES"] = "0,1"  # 使用するコア番号
import types
import torch
from datasets import load_dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# please, start by selecting the desired model size
#suffix="tiny"
#suffix="small"
#suffix="medium"
suffix="large-v3"
model_id=f"openai/whisper-{suffix}"

# this will load the tokenizer + two copies of the model. cpu_model will be used later for results comparison
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id, torchscript=True)
processor_2 = WhisperProcessor.from_pretrained(model_id)
model_2 = WhisperForConditionalGeneration.from_pretrained(model_id, torchscript=True)
cpu_model = WhisperForConditionalGeneration.from_pretrained(model_id, torchscript=True)

# Load a sample from the dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# sample #3 is ~9.9seconds and produces 33 output tokens + pad token
sample = dataset[3]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

# output_attentions is required if you want to return word timestamps
# if you don't need timestamps, just set this to False and get some better latency
output_attentions = True

batch_size = 1
# this is the maximum number of tokens the model will be able to decode
# for the sample #3 we selected above, this is enough. If you're planning to 
# process larger samples, you need to adjust it accordinly.
max_dec_len = 128
# num_mel_bins,d_model --> these parameters where copied from model.conf (found on HF repo)
# we need them to correctly generate dummy inputs during compilation
dim_enc = model.config.num_mel_bins
dim_dec = model.config.d_model
print(f'Dim enc: {dim_enc}; Dim dec: {dim_dec}')

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of WhisperForConditionalGeneration were not initialized from the model checkpoint at openai/whisper-large-v3 and are newly initialized: ['proj_out.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of WhisperForConditionalGeneration were not initialized from the model checkpoint at openai/whisper-large-v3 and are newly initialized: ['proj_out.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of WhisperForConditionalGeneration were not initialized from the model che

Dim enc: 128; Dim dec: 1280


In [2]:
# inference.py での動作確認用のセル
import os
os.environ['NEURON_RT_NUM_CORES']='1'
import types
import torch
from datasets import load_dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration, WhisperConfig

# please, start by selecting the desired model size
#suffix="tiny"
#suffix="small"
#suffix="medium"
suffix="large-v3"
model_id=f"openai/whisper-{suffix}"

# this will load the tokenizer + two copies of the model. cpu_model will be used later for results comparison
processor = WhisperProcessor.from_pretrained(model_id)
config = WhisperConfig.from_pretrained(model_id)
model = WhisperForConditionalGeneration(config)

max_dec_len = 128

# Load a sample from the dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# sample #3 is ~9.9seconds and produces 33 output tokens + pad token
sample = dataset[3]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

output_attentions = True

batch_size = 1
# this is the maximum number of tokens the model will be able to decode
# for the sample #3 we selected above, this is enough. If you're planning to 
# process larger samples, you need to adjust it accordinly.
max_dec_len = 128
# num_mel_bins,d_model --> these parameters where copied from model.conf (found on HF repo)
# we need them to correctly generate dummy inputs during compilation
dim_enc = model.config.num_mel_bins
dim_dec = model.config.d_model
print(f'Dim enc: {dim_enc}; Dim dec: {dim_dec}')


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dim enc: 128; Dim dec: 1280


In [2]:
import types
import torch.nn.functional as F
from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions,BaseModelOutput

# Now we need to simplify both encoding & decoding forward methods to make them 
# compilable. Please notice that these methods overwrite the original ones, but
# keeps retro-compatibility. Also, we'll use use a new variable "forward_neuron"
# to invoke the model on inf2
def enc_f(self, input_features, attention_mask, **kwargs):
    if hasattr(self, 'forward_neuron'):
        out = self.forward_neuron(input_features, attention_mask)
    else:
        out = self.forward_(input_features, attention_mask, return_dict=True)
    return BaseModelOutput(**out)

def dec_f(self, input_ids, attention_mask=None, encoder_hidden_states=None, **kwargs):
    output = None        
    if not attention_mask is None and encoder_hidden_states is None:
        # this is a workaround to align the input parameters for NeuronSDK tracer
        # None values are not allowed during compilation
        encoder_hidden_states, attention_mask = attention_mask,encoder_hidden_states
    inputs = [input_ids, encoder_hidden_states]
    
    # pad the input to max_dec_len
    if inputs[0].shape[1] > self.max_length:
        raise Exception(f"The decoded sequence is not supported. Max: {self.max_length}")
    pad_size = torch.as_tensor(self.max_length - inputs[0].shape[1])
    inputs[0] = F.pad(inputs[0], (0, pad_size), "constant", processor.tokenizer.pad_token_id)
    
    if hasattr(self, 'forward_neuron'):
        output = self.forward_neuron(*inputs)
    else:
        # output_attentions is required if you want timestamps
        output = self.forward_(input_ids=inputs[0], encoder_hidden_states=inputs[1], return_dict=True, use_cache=False, output_attentions=output_attentions)
    # unpad the output
    output['last_hidden_state'] = output['last_hidden_state'][:, :input_ids.shape[1], :]
    # neuron compiler doesn't like tuples as values of dicts, so we stack them into tensors
    # also, we need to average axis=2 given we're not using cache (use_cache=False)
    # that way, to avoid an issue with the pipeline we change the shape from:
    #  bs,num selected,num_tokens,1500 --> bs,1,num_tokens,1500
    # I suspect there is a bug in the HF pipeline code that doesn't support use_cache=False for
    # word timestamps, that's why we need that.
    if not output.get('attentions') is None:
        output['attentions'] = torch.stack([torch.mean(o[:, :, :input_ids.shape[1], :input_ids.shape[1]], axis=2, keepdim=True) for o in output['attentions']])
    if not output.get('cross_attentions') is None:
        output['cross_attentions'] = torch.stack([torch.mean(o[:, :, :input_ids.shape[1], :], axis=2, keepdim=True) for o in output['cross_attentions']])
    return BaseModelOutputWithPastAndCrossAttentions(**output)

def proj_out_f(self, inp):
    pad_size = torch.as_tensor(self.max_length - inp.shape[1], device=inp.device)
    # pad the input to max_dec_len
    if inp.shape[1] > self.max_length:
        raise Exception(f"The decoded sequence is not supported. Max: {self.max_length}")
    x = F.pad(inp, (0,0,0,pad_size), "constant", processor.tokenizer.pad_token_id)
    
    if hasattr(self, 'forward_neuron'):
        out = self.forward_neuron(x)
    else:
        out = self.forward_(x)
    # unpad the output before returning
    out = out[:, :inp.shape[1], :]
    return out
    
if not hasattr(model.model.encoder, 'forward_'): model.model.encoder.forward_ = model.model.encoder.forward
if not hasattr(model.model.decoder, 'forward_'): model.model.decoder.forward_ = model.model.decoder.forward
if not hasattr(model.proj_out, 'forward_'): model.proj_out.forward_ = model.proj_out.forward

model.model.encoder.forward = types.MethodType(enc_f, model.model.encoder)
model.model.decoder.forward = types.MethodType(dec_f, model.model.decoder)
model.proj_out.forward = types.MethodType(proj_out_f, model.proj_out)

model.model.decoder.max_length = max_dec_len
model.proj_out.max_length = max_dec_len

In [3]:
if not hasattr(model_2.model.encoder, 'forward_'): model_2.model.encoder.forward_ = model_2.model.encoder.forward
if not hasattr(model_2.model.decoder, 'forward_'): model_2.model.decoder.forward_ = model_2.model.decoder.forward
if not hasattr(model_2.proj_out, 'forward_'): model_2.proj_out.forward_ = model_2.proj_out.forward

model_2.model.encoder.forward = types.MethodType(enc_f, model_2.model.encoder)
model_2.model.decoder.forward = types.MethodType(dec_f, model_2.model.decoder)
model_2.proj_out.forward = types.MethodType(proj_out_f, model_2.proj_out)

model_2.model.decoder.max_length = max_dec_len
model_2.proj_out.max_length = max_dec_len

In [3]:
# warmup model
y1 = model.generate(input_features)
# y2 = model_2.generate(input_features)



## Trace Encoder

In [None]:
suffix, batch_size

In [2]:
device_ids = [0,1]

In [4]:
# 写経
import os
import torch
import torch_neuronx

model_filename = f"neuron_model/whisper_{suffix}_{batch_size}_neuron_encoder.pt"

if not os.path.isfile(model_filename):
    inputs = (
        torch.zeros([1, dim_enc, 3000],dtype=torch.float32),
        torch.zeros([1, dim_enc], dtype=torch.int64))
    if hasattr(model.model.encoder, "forward_neuron"): del model.model.encoder.forward_neuron
    neuron_encoder = torch_neuronx.trace(
        model.model.encoder,
        inputs,
        compiler_args="--model-type=transformer --auto-cast=all --auto-cast-type=bf16",
        compiler_workdir="./enc_dir",
        inline_weights_to_neff=False)
    neuron_encoder.save(model_filename)
    model.model.encoder.forward_neuron = neuron_encoder
else:
    with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=1):
        model_2.model.encoder.forward_neuron = torch.jit.load(model_filename)
    # model.model.encoder.forward_neuron = torch_neuronx.DataParallel(torch.jit.load(model_filename), device_ids, set_dynamic_batching=False)
with torch_neuronx.experimental.neuron_cores_context(start_nc=1, nc_count=1):
    model_2.model.encoder.forward_neuron = torch.jit.load(model_filename)

## Trace decoder

In [5]:
import torch
import torch_neuronx

model_filename=f"neuron_model/whisper_{suffix}_{batch_size}_{max_dec_len}_neuron_decoder.pt"
# モデル1にNeuron Core 0-1を割り当て

if not os.path.isfile(model_filename):
    inputs = (torch.zeros([1, max_dec_len], dtype=torch.int64), torch.zeros([1, 1500, dim_dec], dtype=torch.float32))
    if hasattr(model.model.decoder, 'forward_neuron'): del model.model.decoder.forward_neuron
    neuron_decoder = torch_neuronx.trace(
        model.model.decoder, 
        inputs,
        compiler_args='--model-type=transformer --auto-cast=all --auto-cast-type=bf16',
        compiler_workdir='./dec_dir',      
        inline_weights_to_neff=True)
    neuron_decoder.save(model_filename)
    model.model.decoder.forward_neuron = neuron_decoder
else:
    with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=1):
        # model.model.decoder.forward_neuron = torch_neuronx.DataParallel(torch.jit.load(model_filename), device_ids, set_dynamic_batching=False)
        model.model.decoder.forward_neuron = torch.jit.load(model_filename)
with torch_neuronx.experimental.neuron_cores_context(start_nc=1, nc_count=1):
    model_2.model.decoder.forward_neuron = torch.jit.load(model_filename)

## Trace Projection Output

In [6]:
import torch
import torch_neuronx

model_filename=f"neuron_model/whisper_{suffix}_{batch_size}_{max_dec_len}_neuron_proj.pt"
if not os.path.isfile(model_filename):
    inputs = torch.zeros([1, max_dec_len, dim_dec], dtype=torch.float32)
    if hasattr(model.proj_out, 'forward_neuron'): del model.proj_out.forward_neuron
    neuron_decoder = torch_neuronx.trace(
        model.proj_out, 
        inputs,
        compiler_args='--model-type=transformer --auto-cast=all --auto-cast-type=bf16',
        compiler_workdir='./proj_out_dir',      
        inline_weights_to_neff=True)
    neuron_decoder.save(model_filename)
    model.proj_out.forward_neuron = neuron_decoder
else:
    # model.proj_out.forward_neuron = torch_neuronx.DataParallel(torch.jit.load(model_filename), device_ids, set_dynamic_batching=False)
    with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=1):
        model.proj_out.forward_neuron = torch.jit.load(model_filename)
with torch_neuronx.experimental.neuron_cores_context(start_nc=1, nc_count=1):
    model_2.proj_out.forward_neuron = torch.jit.load(model_filename)

## Test

In [9]:
# warmup inf2 model
y1 = model.generate(input_features)

In [10]:
y1

tensor([[50258, 50259, 50360, 50364,   634,   575, 12525, 22618,  1968,  6144,
         35617,  1456,   397,   266,   311,   589,   307,   534, 10281,   934,
           439,    11,   293,   393,  4411,   294,   309,   457,   707,   295,
         33301,   286,   392,  6628,    13, 50257]])

In [11]:
torch.set_num_threads(1)

In [10]:
import time
t = time.time()
y1 = model.generate(input_features)
print(f"Elapsed inf2: {time.time()-t}")
t = time.time()
y2 = cpu_model.generate(input_features)
print(f"Elapsed cpu: {time.time()-t}")
print(f"Tokens inf2: {y1}")
print(f"Tokens cpu: {y2}")
t1 = processor.batch_decode(y1, skip_special_tokens=True)
t2 = processor.batch_decode(y2, skip_special_tokens=True)
print(f"Out inf2: {t1}")
print(f"Out cpu: {t2}")

Elapsed inf2: 8.855262994766235
Elapsed cpu: 17.03957748413086
Tokens inf2: tensor([[50258, 50259, 50360, 50364,   634,   575, 12525, 22618,  1968,  6144,
         35617,  1456,   397,   266,   311,   589,   307,   534, 10281,   934,
           439,    11,   293,   393,  4411,   294,   309,   457,   707,   295,
         33301,   286,   392,  6628,    13, 50257]])
Tokens cpu: tensor([[50258, 50259, 50360, 50364,   634,   575, 12525, 22618,  1968,  6144,
         35617,  1456,   397,   266,   311,   589,   307,   534, 10281,   934,
           439,    11,   293,   393,  4411,   294,   309,   457,   707,   295,
         33301,   286,   392,  6628,    13, 50257]])
Out inf2: [" He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca."]
Out cpu: [" He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca."]


## Pipeline Mode

In [5]:
import torch
import time
# import torch_neuronx
from datasets import load_dataset
from transformers import pipeline, WhisperProcessor

model_id = "openai/whisper-large-v3"
cpu_pipe = pipeline(
  "automatic-speech-recognition",
  model=model_id,
  chunk_length_s=30
)
# cpu_pipe.model = cpu_model
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[3]["audio"]

# we can also return timestamps for the predictions
## Option return_timestamps can be: True, False, "word" or "char"
t = time.time()
prediction = cpu_pipe(sample.copy(), batch_size=1, return_timestamps="word")["chunks"]
print(f"Elapsed: {time.time()-t}")
for p in prediction:
    print(p)

Device set to use cpu
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


Elapsed: 7.2860918045043945
{'text': ' He', 'timestamp': (0.0, 0.56)}
{'text': ' has', 'timestamp': (0.56, 0.76)}
{'text': ' grave', 'timestamp': (0.76, 1.06)}
{'text': ' doubts', 'timestamp': (1.06, 1.42)}
{'text': ' whether', 'timestamp': (1.42, 1.88)}
{'text': ' Sir', 'timestamp': (1.88, 2.28)}
{'text': ' Frederick', 'timestamp': (2.28, 2.62)}
{'text': " Leighton's", 'timestamp': (2.62, 3.2)}
{'text': ' work', 'timestamp': (3.2, 3.46)}
{'text': ' is', 'timestamp': (3.46, 3.68)}
{'text': ' really', 'timestamp': (3.68, 4.02)}
{'text': ' Greek', 'timestamp': (4.02, 4.62)}
{'text': ' after', 'timestamp': (4.62, 4.98)}
{'text': ' all,', 'timestamp': (4.98, 5.5)}
{'text': ' and', 'timestamp': (5.5, 6.16)}
{'text': ' can', 'timestamp': (6.16, 6.32)}
{'text': ' discover', 'timestamp': (6.32, 6.74)}
{'text': ' in', 'timestamp': (6.74, 7.02)}
{'text': ' it', 'timestamp': (7.02, 7.22)}
{'text': ' but', 'timestamp': (7.22, 7.38)}
{'text': ' little', 'timestamp': (7.38, 7.76)}
{'text': ' of', 't

In [33]:
import torch
import torch_neuronx
from datasets import load_dataset
from transformers import pipeline, WhisperProcessor

if not output_attentions:
    raise Exception("Word timestamp not supported. Please set output_attentions=True and recompile the model")

pipe = pipeline(
  "automatic-speech-recognition",
  model=model_id,
  chunk_length_s=30
)
pipe.model = model
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[3]["audio"]

# we can also return timestamps for the predictions
## Option return_timestamps can be: True, False, "word" or "char"
t=time.time()
prediction = pipe(sample.copy(), batch_size=1, return_timestamps="word")["chunks"]
print(f"Elapsed: {time.time()-t}")
for p in prediction:
    print(p)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


Elapsed: 1.8384795188903809
{'text': ' He', 'timestamp': (0.64, 0.64)}
{'text': ' has', 'timestamp': (0.64, 0.76)}
{'text': ' grave', 'timestamp': (0.76, 1.02)}
{'text': ' doubts', 'timestamp': (1.02, 1.46)}
{'text': ' whether', 'timestamp': (1.46, 1.78)}
{'text': ' Sir', 'timestamp': (1.78, 2.22)}
{'text': ' Frederick', 'timestamp': (2.22, 2.66)}
{'text': " Leighton's", 'timestamp': (2.66, 3.1)}
{'text': ' work', 'timestamp': (3.1, 3.44)}
{'text': ' is', 'timestamp': (3.44, 3.7)}
{'text': ' really', 'timestamp': (3.7, 4.06)}
{'text': ' Greek', 'timestamp': (4.06, 4.68)}
{'text': ' after', 'timestamp': (4.68, 4.94)}
{'text': ' all,', 'timestamp': (4.94, 5.42)}
{'text': ' and', 'timestamp': (5.42, 6.1)}
{'text': ' can', 'timestamp': (6.1, 6.36)}
{'text': ' discover', 'timestamp': (6.36, 6.76)}
{'text': ' in', 'timestamp': (6.76, 7.0)}
{'text': ' it', 'timestamp': (7.0, 7.22)}
{'text': ' but', 'timestamp': (7.22, 7.4)}
{'text': ' little', 'timestamp': (7.4, 7.82)}
{'text': ' of', 'timest

## Performance benchmnark

In [7]:
import soundfile as sf
import time
import concurrent.futures
from typing import Dict, Any

def process_with_model(sample: Dict[str, Any]) -> Dict[str, Any]:
    """モデル1で音声処理を行う関数"""
    input_features = processor(
        sample["audio"]["array"], 
        sampling_rate=sample["audio"]["sampling_rate"], 
        return_tensors="pt"
    ).input_features
    
    start_time = time.time()
    generated = model.generate(input_features)
    inference_time = time.time() - start_time
    
    transcription = processor.batch_decode(generated, skip_special_tokens=True)
    print("Inference time: ", inference_time)
    return {
        "model": "model_1",
        "inference_time": inference_time,
        "transcription": transcription
    }

def process_with_model_2(sample: Dict[str, Any]) -> Dict[str, Any]:
    """モデル2で音声処理を行う関数"""
    input_features = processor_2(
        sample["audio"]["array"], 
        sampling_rate=sample["audio"]["sampling_rate"], 
        return_tensors="pt"
    ).input_features
    
    start_time = time.time()
    generated = model_2.generate(input_features)
    inference_time = time.time() - start_time
    
    transcription = processor_2.batch_decode(generated, skip_special_tokens=True)
    print("Inference time: ", inference_time)
    return {
        "model": "model_2",
        "inference_time": inference_time,
        "transcription": transcription
    }

In [8]:
def parallel_inference():
    total_start_time = time.time()
    total_samples = 0
    futures = []
    try:
        with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
            for sample in dataset_iter:
                total_samples += 1
                if total_samples % 2 == 0:
                    # 結果を返すFutureオブジェクトを保存
                    future = executor.submit(process_with_model, sample)
                    futures.append(future)
                else:
                    future = executor.submit(process_with_model_2, sample)
                    futures.append(future)
            
            # すべてのタスクが完了するまで待つ
            for future in concurrent.futures.as_completed(futures):
                try:
                    # タスクの結果を取得（エラーがあれば例外が発生）
                    result = future.result()
                    # 必要に応じて結果を処理
                    print(f"Model: {result['model']}, Inference time: {result['inference_time']:.4f}s")
                    # print(f"Transcription: {result['transcription']}")
                except Exception as exc:
                    print(f"Task generated an exception: {exc}")            
    except StopIteration:
        pass
    
    total_time = time.time() - total_start_time
    print(f"Total processing time: {total_time:.4f}s")
    print(f"Total samples processed: {total_samples}")
    return total_time

In [9]:
dataset = load_dataset('MLCommons/peoples_speech', "microset", split='train', streaming=True)
dataset_iter = iter(dataset)

Resolving data files:   0%|          | 0/804 [00:00<?, ?it/s]

In [10]:
%time
parallel_inference()

CPU times: user 5 μs, sys: 5 μs, total: 10 μs
Wall time: 21.2 μs
Inference time:  4.084258794784546
Model: model_1, Inference time: 4.0843s
Inference time:  4.208552598953247
Model: model_2, Inference time: 4.2086s
Inference time:  2.32397198677063
Model: model_2, Inference time: 2.3240s
Inference time:  2.65513277053833
Model: model_1, Inference time: 2.6551s
Inference time:  2.3822758197784424
Model: model_2, Inference time: 2.3823s
Inference time:  2.5961320400238037
Model: model_1, Inference time: 2.5961s
Inference time:  2.4981768131256104
Model: model_2, Inference time: 2.4982s
Inference time:  3.8852834701538086
Model: model_1, Inference time: 3.8853s
Inference time:  2.8445322513580322
Model: model_2, Inference time: 2.8445s
Inference time:  2.201923370361328
Model: model_2, Inference time: 2.2019s
Inference time:  10.19257640838623
Model: model_1, Inference time: 10.1926s
Inference time:  7.156795978546143
Model: model_1, Inference time: 7.1568s
Inference time:  2.614192008972

475.11402463912964

In [None]:
%time

start_time = time.time()
dataset = load_dataset('MLCommons/peoples_speech', "microset", split='train', streaming=True)
for sample in dataset:
    input_features = processor(sample["audio"]["array"], sampling_rate=sample["audio"]["sampling_rate"], return_tensors="pt").input_features
    t = time.time()
    y1 = model.generate(input_features)
    print(f"Elapsed inf2: {time.time()-t}")
    t = time.time()
    # print(f"Tokens inf2: {y1}")
    t1 = processor.batch_decode(y1, skip_special_tokens=True)
    print(t1)
end_time = time.time()
print(end_time - start_time)

## Deploy compiled models on SageMaker

In [13]:
!tar -czvf model.tar.gz -C neuron_model .

./
./whisper_large-v3_1_neuron_encoder.pt
./whisper_large-v3_1_128_neuron_proj.pt
./whisper_large-v3_1_128_neuron_decoder.pt


In [43]:
import os
os.environ["AWS_REGION"] = "us-west-2"

In [14]:
import boto3
import sagemaker
from sagemaker import Model, serializers, deserializers

boto_session = boto3.Session(region_name="us-west-2")
sess = sagemaker.Session(boto_session=boto_session)
role = sagemaker.get_execution_role(sagemaker_session=sess)
sess_bucket = sess.default_bucket()



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [15]:
print(f'sagemaker role arn: {role}')
print(f'sagemaker bucket: {sess_bucket}')
print(f'sagemaker session region: {sess.boto_region_name}')

sagemaker role arn: arn:aws:iam::392304288222:role/EC2SageMakerPlayGroundRole
sagemaker bucket: sagemaker-us-west-2-392304288222
sagemaker session region: us-west-2


In [16]:
from sagemaker.s3 import S3Uploader

prefix = "inf2_compiled_whisper_model"
s3_model_path = f"s3://{sess_bucket}/{prefix}"

# upload model.tar.gz
s3_model_uri = S3Uploader.upload(
    local_path="model.tar.gz", desired_s3_uri=s3_model_path,
    sagemaker_session=sess
)
print(f"model artifacts uploaded to {s3_model_uri}")

model artifacts uploaded to s3://sagemaker-us-west-2-392304288222/inf2_compiled_whisper_model/model.tar.gz


### 推論用のコードの作成

https://github.com/aws/deep-learning-containers/blob/master/available_images.md で最適なイメージを見つけられる。

In [17]:
sagemaker_role = "arn:aws:iam::392304288222:role/service-role/AmazonSageMaker-ExecutionRole-20250130T094469"

In [18]:
s3_model_uri = "s3://sagemaker-us-west-2-392304288222/inf2_compiled_whisper_model/model.tar.gz"

In [19]:
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer

# Define serializers and deserializer
audio_serializer = DataSerializer(content_type="audio/x-audio")
deserializer = JSONDeserializer()

ecr_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuronx:2.5.1-neuronx-py310-sdk2.21.0-ubuntu22.04"
# for hugging face container
# ecr_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference-neuronx:2.1.2-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04"


pytorch_model = PyTorchModel(
    model_data=s3_model_uri,
    role=sagemaker_role,
    source_dir="code",
    entry_point="inference.py",
    image_uri=ecr_image,
    model_server_workers=1,
    sagemaker_session=sess,
    env={
        "chunk_length_s":"30",
        'MMS_MAX_REQUEST_SIZE': '2000000000',
        'MMS_MAX_RESPONSE_SIZE': '2000000000',
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '900'
    }
)

pytorch_model._is_compiled_model = True

In [36]:
from sagemaker import Predictor
predictor = Predictor(
    endpoint_name="pytorch-inference-neuronx-ml-inf2-2025-03-19-11-30-08-598",
    sagemaker_session=sess,
    serializer=audio_serializer,
    deserializer=deserializer
)

In [25]:
%%time

predictor = pytorch_model.deploy(
    instance_type="ml.inf2.8xlarge",
    initial_instance_count=1,
    serializer=audio_serializer,
    deserializer=deserializer
)
print(predictor.endpoint_name)

----------------!pytorch-inference-neuronx-ml-inf2-ml-in-2025-03-19-09-03-40-434
CPU times: user 15min 9s, sys: 17.2 s, total: 15min 27s
Wall time: 24min


In [32]:
%pip install soundfile

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [32]:
from datasets import load_dataset
import soundfile as sf
import time

In [35]:
sample

{'id': '07282016HFUUforum_SLASH_07-28-2016_HFUUforum_DOT_mp3_00335.flac',
 'audio': {'path': '07282016HFUUforum_SLASH_07-28-2016_HFUUforum_DOT_mp3_00335.flac',
  'array': array([-8.28857422e-02, -6.21337891e-02, -4.33654785e-02, ...,
          3.05175781e-05, -6.83593750e-03, -7.62939453e-03]),
  'sampling_rate': 16000},
 'duration_ms': 14800,
 'text': "are actually farming so that we can then bring back and collect our tax rate on a true level it's just you know they file the paper there's no follow through in the taxation collection so"}

In [34]:
%time

start_time = time.time()
dataset = load_dataset('MLCommons/peoples_speech', "microset", split='train', streaming=True)
for sample in dataset:
    input_features = processor(sample["audio"]["array"], sampling_rate=sample["audio"]["sampling_rate"], return_tensors="pt").input_features
    t = time.time()
    y1 = model.generate(input_features)
    print(f"Elapsed inf2: {time.time()-t}")
    t = time.time()
    # print(f"Tokens inf2: {y1}")
    t1 = processor.batch_decode(y1, skip_special_tokens=True)
    print(t1)
end_time = time.time()
print(end_time - start_time)


CPU times: user 6 μs, sys: 0 ns, total: 6 μs
Wall time: 13.4 μs


Resolving data files:   0%|          | 0/804 [00:00<?, ?it/s]

Elapsed inf2: 2.4147889614105225
[" I wanted to just share a few things, but I'm gonna not share as much as I wanted to share because we are starting late. I'd like to get this thing going so we all get home at a decent hour. This election is very important to us."]
Elapsed inf2: 1.9875483512878418
[" we support agriculture to the tune of 0.4%. Oh no wait, I made a mistake. This year they lowered it from 0.4% to 0.38%. And in the same breath they're saying food"]
Elapsed inf2: 1.4311702251434326
[" So it doesn't feel very secure to me as a farmer to hear that. And my family and I, we've been farming here since 1993."]
Elapsed inf2: 1.494741678237915
[' Last year we produced 21,000 pounds of food for the community on 2,000 square feet. So unless we were that efficient to produce that much food,']
Elapsed inf2: 1.7296266555786133
[' commons here in the spirit of being able to grow our food here locally. So we started as an organization actually in 2009, but really in 2010. March of 2010,

時間: 659.0221, 654.232

In [38]:
import json

audio_path = "sample_audio.wav"
response = predictor.predict(data=audio_path)

print(json.loads(response))