Voxtral runner with raw audio erroring out

Following https://github.com/pytorch/executorch/tree/main/examples/models/voxtral/README.md

```
./cmake-out/examples/models/voxtral/voxtral_runner \
 --model_path ../optimum-executorch/voxtral/model.pte \
 --tokenizer_path tekken.json \
 --prompt "What can you tell me about this audio?" \
 --audio_path audio_input.bin \
 --processor_path voxtral_preprocessor.pte
```

Where audio_input.bin is generated by the following command:

```
ffmpeg -i audio.mp3 -f f32le -acodec pcm_f32le audio_input.bin
```

I download the first sample from https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples

Here's the error from the voxtral_runner

```
I 00:00:00.000925 executorch:cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
I 00:00:00.000964 executorch:cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.000969 executorch:cpuinfo_utils.cpp:167] Number of efficient cores 4
I 00:00:00.000971 executorch:multimodal.cpp:292] Resetting threadpool with num threads = 6
I tokenizers:tekken.cpp:88] Loading Tekken tokenizer from: tekken.json
I tokenizers:tekken.cpp:117] Tekken version: v7, vocab_size: 131072, special_tokens: 1000
I tokenizers:tekken.cpp:123] Loading special tokens from JSON
I tokenizers:tekken.cpp:287] Initialized 1000 special tokens (1000 defined, 0 placeholders)
I tokenizers:tekken.cpp:140] Loading 130072 vocabulary tokens
I tokenizers:tekken.cpp:227] Processing 130072 vocabulary entries (limit: 130072)
I tokenizers:tekken.cpp:260] Built vocabulary with 130072 tokens
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1757106050.257168 15371296 re2.cc:237] Error parsing '([^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+|[^\r\n\p{L}\p{N}]?[\p{...': invalid perl operator: (?!
E tokenizers:re2_regex.cpp:26] Failed to compile regex: ([^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid$
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:tekken.cpp:186] Tekken tokenizer loaded successfully. Vocab size: 131072, BOS: 1, EOS: 2
I 00:00:00.220129 executorch:llm_runner_helper.cpp:48] Loaded tekken tokenizer
I 00:00:00.220152 executorch:llm_runner_helper.cpp:270] Reading metadata from model
I 00:00:00.267935 executorch:llm_runner_helper.cpp:133] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:00.267953 executorch:llm_runner_helper.cpp:133] Metadata: use_kv_cache = 1
I 00:00:00.267955 executorch:llm_runner_helper.cpp:131] Method get_max_context_len not found, using the default value 128
I 00:00:00.267957 executorch:llm_runner_helper.cpp:133] Metadata: get_max_context_len = 128
I 00:00:00.267959 executorch:llm_runner_helper.cpp:133] Metadata: get_max_seq_len = 2048
I 00:00:00.267960 executorch:llm_runner_helper.cpp:131] Method enable_dynamic_shape not found, using the default value 0
I 00:00:00.267961 executorch:llm_runner_helper.cpp:133] Metadata: enable_dynamic_shape = 0
I 00:00:00.267963 executorch:llm_runner_helper.cpp:144] Setting kMaxContextLen to kMaxSeqLen value: 2048
I 00:00:03.546667 executorch:multimodal.cpp:175] Loaded .bin file: audio_input.bin, 529920 floats
I 00:00:03.546686 executorch:multimodal.cpp:183] Processing audio through processor module...
I 00:00:03.582118 executorch:multimodal.cpp:206] Processed audio tensor shape: [1, 128, 6000]
I 00:00:03.582489 executorch:multimodal.cpp:231] Created processed Audio: batch_size=1, n_bins=128, n_frames=6000
I 00:00:03.585696 executorch:multimodal.cpp:346] Starting generation...
I 00:00:03.585715 executorch:multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported)
E 00:00:03.830725 executorch:et_view.cpp:110] Check failed (self.numel() == out.numel()): self.numel(): 3840000, out.numel(): 1920000
E 00:00:03.830743 executorch:method.cpp:1355] KernelCall failed at instruction 0:7 in operator executorch_prim::et_view.default: 0x12
E 00:00:03.830745 executorch:method.cpp:1361] arg 0 with type id 1
E 00:00:03.830747 executorch:method.cpp:1361] arg 1 with type id 8
E 00:00:03.830748 executorch:method.cpp:1361] arg 2 with type id 1
E 00:00:03.830753 executorch:multimodal.cpp:349] Failed to generate with multimodal runner
```

I'm guessing that the following is the culprit: 

```
 Check failed (self.numel() == out.numel()): self.numel(): 3840000, out.numel(): 1920000
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Voxtral runner with raw audio erroring out #14025

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Voxtral runner with raw audio erroring out #14025

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions