@FIR-708: Added code to print out the profile information from runtime #3

atrivedi-tsavoritesi · 2025-05-27T04:59:07Z

This change adds following

Add std=c++20 for using the aot-tests used profiler from runtime
Add initialization and dumping of profile informaion There is one crash which needs debugging but the results look as follows

register_backend: registered backend Tsavorite (1 devices)
register_device: registered device Tsavorite (txe)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
load_backend: failed to find ggml_backend_init in ./libggml-tsavorite.so
load_backend: failed to find ggml_backend_init in ./libggml-cpu.so
build: 4826 (5255890) with gcc (GCC) 13.3.0 for x86_64-pc-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any

TXE Device MEMORY Summary llama_model_load_from_file_impl: llama_model_loader: - kv 0: llama_model_loader: - kv 1: llama_model_loader: - kv 2: llama_model_loader: - kv 3: llama_model_loader: - kv 4: llama_model_loader: - kv 5: llama_model_loader: - kv 6: llama_model_loader: - kv 7: llama_model_loader: - kv 8: llama_model_loader: - kv 9: llama_model_loader: - kv 10: llama_model_loader: - kv 11: llama_model_loader: - kv 12: llama_model_loader: - kv 13: llama_model_loader: - kv 14: llama_model_loader: - kv 15: llama_model_loader: - kv 16: llama_model_loader: - kv 17: llama_model_loader: - kv 18: llama_model_loader: - kv 19: llama_model_loader: - kv 20: llama_model_loader: - kv 21: llama_model_loader: - kv 22: llama_model_loader: - kv 23: llama_model_loader: - kv 24: llama_model_loader: - kv 25: llama_model_loader: - kv 26: llama_model_loader: - kv 27: llama_model_loader: - kv 28: llama_model_loader: - kv 29: llama_model_loader: - kv 30: llama_model_loader: - kv 31: llama_model_loader: - kv 32: llama_model_loader: - kv 33: llama_model_loader: - kv 34: llama_model_loader: - kv 35: llama_model_loader: - kv 36: llama_model_loader: - kv 37: llama_model_loader: - type print_info: file format print_info: file type = all F32
print_info: file size load: special_eos_id load: special tokens cache size = 6
load: token to piece print_info: arch print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 2048
print_info: n_layer = 22
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps print_info: f_norm_rms_eps print_info: f_clamp_kqv print_info: f_max_alibi_bias print_info: f_logit_scale print_info: n_ff = 5632
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling print_info: freq_base_train print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 1B
print_info: model params print_info: general.name print_info: vocab type = SPM
print_info: n_vocab print_info: n_merges = 0
print_info: BOS token print_info: EOS token print_info: EOT token print_info: UNK token = 0 ''
print_info: PAD token print_info: LF token print_info: EOG token = 2 ''
print_info: EOG token print_info: max token length = 48
load_tensors: loading total 134217728 and free 134217728
using device Tsavorite (txe) - 128 MiB free llama_model_loader: loaded meta data with 38 key-value pairs and 201 tensors from /tsi/akapoor/ggml/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
general.architecture str = llama
general.type str = model
general.name str = Tiny Llama v0.3 FP32
general.size_label str = 1.1B
general.license str = apache-2.0
general.dataset.count u32 = 3
general.dataset.0.name str = SlimPajama 627B
general.dataset.0.organization str = Cerebras
general.dataset.0.repo_url str = https://huggingface.co/cerebras/SlimP...
general.dataset.1.name str = Starcoderdata
general.dataset.1.organization str = Bigcode
general.dataset.1.repo_url str = https://huggingface.co/bigcode/starco...
general.dataset.2.name str = Oasst_Top1_2023 08 25
general.dataset.2.version str = 08-25
general.dataset.2.organization str = OpenAssistant
general.dataset.2.repo_url str = https://huggingface.co/OpenAssistant/...
general.languages arr[str,1] = ["en"]
llama.block_count u32 = 22
llama.context_length u32 = 2048
llama.embedding_length u32 = 2048
llama.feed_forward_length u32 = 5632
llama.attention.head_count u32 = 32
llama.attention.head_count_kv u32 = 4
llama.rope.freq_base f32 = 10000.000000
llama.attention.layer_norm_rms_epsilon f32 = 0.000010
general.file_type u32 = 0
llama.vocab_size u32 = 32003
llama.rope.dimension_count u32 = 64
tokenizer.ggml.model str = llama
tokenizer.ggml.pre str = default
tokenizer.ggml.tokens arr[str,32003] = ["", "~~", "~~", "<0x00>", "<...
tokenizer.ggml.scores arr[f32,32003] = [-1000.000000, -1000.000000, -1000.00...
tokenizer.ggml.token_type arr[i32,32003] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
tokenizer.ggml.bos_token_id u32 = 1
tokenizer.ggml.eos_token_id u32 = 2
tokenizer.ggml.unknown_token_id u32 = 0
tokenizer.ggml.padding_token_id u32 = 32000
general.quantization_version u32 = 2
f32: 201 tensors
= GGUF V3 (latest)
= 4.10 GiB (32.00 BPW)
is not in special_eog_ids - the tokenizer config may be incorrect
cache size = 0.1684 MB
= llama
= 0.0e+00
= 1.0e-05
= 0.0e+00
= 0.0e+00
= 0.0e+00
= linear
= 10000.0
= unknown
= 1.10 B
= Tiny Llama v0.3 FP32
= 32003
= 1 ''
= 2 ''
= 32002 '<|im_end|>'
= 32000 '[PAD]'
= 13 '<0x0A>'
= 32002 '<|im_end|>'
model tensors, this can take a while... (mmap = true)

TXE Device MEMORY Summary total 134217728 and free 134217728
load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/23 layers to GPU
load_tensors: CPU_Mapped model buffer size = 4196.40 MiB
..........................................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 12288
llama_init_from_model: n_ctx_per_seq = 12288
llama_init_from_model: n_batch = 1024
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_pre_seq (12288) > n_ctx_train (2048) -- possible training context overflow
[2018-03-09 12:35:52.849854] 272:273 [ info] :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:129> TXE resource allocation request processed successfully.
llama_kv_cache_init: kv_size = 12288, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 22, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 264.00 MiB
llama_init_from_model: KV self size = 264.00 MiB, K (f16): 132.00 MiB, V (f16): 132.00 MiB
llama_init_from_model: CPU output buffer size = 0.12 MiB
ggml_backend_tsavorite_buffer_type_alloc_buffer is called from llama data Loader

ANoop Allocating memory from tsi_alloc with size 8400896

Allocating memory from tsi_alloc with size 8400896 starting memory 0xfffe83cb3080

Address of Newly Created BUffer 0xfffe83cb3080 and size 8400896
llama_init_from_model: tsavorite compute buffer size = 8.01 MiB
llama_init_from_model: CPU compute buffer size = 808.01 MiB
llama_init_from_model: graph nodes = 710
llama_init_from_model: graph splits = 179 (with bs=512), 93 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 12288
main: llama threadpool init, n_threads = 4
main: model was trained on only 2048 context tokens (12288 specified)

sampler seed: 2241389473
sampler params:
repeat_last_n = 5, repeat_penalty = 1.500, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 12288
top_k = 50, top_p = 0.900, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 12288, n_batch = 1024, n_predict = 10, n_keep = 1

my cat’s name is Luna.
I’m a software

llama_perf_sampler_print: sampling time = 207.24 ms / 16 runs ( 12.95 ms per token, 77.21 tokens per second)
llama_perf_context_print: load time = 85299.65 ms
llama_perf_context_print: prompt eval time = 61885.34 ms / 6 tokens (10314.22 ms per token, 0.10 tokens per second)
llama_perf_context_print: eval time = 528257.38 ms / 9 runs (58695.26 ms per token, 0.02 tokens per second)
llama_perf_context_print: total time = 613808.39 ms / 15 tokens

LLAMA SP Profiling Results:
+++++
Calls Total(ms) T/call Self(ms) Function
+++++
0 0.000 0.000 0.000 [ 0%] LLAMA SP Main
21600 23152.000 1.072 0.000 └─ [ 4%] RuntimeHostShim::awaitCommandListCompletion
21280 33107.073 1.556 33107.073 └─ [ 5%] [ txe_mult_blob ]
320 500.108 1.563 500.108 └─ [ 0%] [ txe_add_blob ]
1 87.000 87.000 87.000 └─ [ 0%] RuntimeHostShim::initialize
21600 25.000 0.001 25.000 └─ [ 0%] RuntimeHostShim::loadBlob
21600 11.000 0.001 11.000 └─ [ 0%] RuntimeHostShim::finalizeCommandList
21600 7.000 0.000 7.000 └─ [ 0%] RuntimeHostShim::createCommandList
86400 6.000 0.000 6.000 └─ [ 0%] RuntimeHostShim::getShmemManager
21600 4.000 0.000 4.000 └─ [ 0%] RuntimeHostShim::deallocate
21601 3.000 0.000 3.000 └─ [ 0%] RuntimeHostShim::allocate
21600 3.000 0.000 3.000 └─ [ 0%] RuntimeHostShim::launchBlob
21600 3.000 0.000 3.000 └─ [ 0%] RuntimeHostShim::addCommandToList
21600 0.000 0.000 0.000 └─ [ 0%] RuntimeHostShim::unloadBlob
+++++
0 613845.000 0.000613845.000 [100%] TOTAL
+++++
TXE_ADD Operation, total tensor: 10 Number of Kernel Call: 320 Number of tensor got spilt: 10 Min Num of Elem 2048 Max Num of Elem 2048

TXE_SUB Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

TXE_MULT Operation, total tensor: 450 Number of Kernel Call: 21280 Number of tensor got spilt: 450 Min Num of Elem 2048 Max Num of Elem 12288

TXE_DIV Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

TXE_SQRT Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

TXE_NEG Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

TXE_ABS Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

TXE_SIN Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

TXE_SIGMOID Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0
terminate called after throwing an instance of 'std::runtime_error'
what(): Profiler not initialized!
Stack trace:
./llama-cli(tsi::runtime::utils::ScopedProfiler::ScopedProfiler(tsi::runtime::utils::ProfileLocation)+0x8c) [0x478d04]
/usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin/aot-tests/lib/libTsavRTFPGAShimCAPI.so(tsi_finalize+0xb0) [0xffff93ce83d8]
libggml-tsavorite.so(+0x57f0) [0xffff945c57f0]
libggml-tsavorite.so(+0x7bb4) [0xffff945c7bb4]
libggml-base.so(ggml_backend_free+0x28) [0xffff944d89d8]
libllama.so(ggml_backend_deleter::operator()(ggml_backend*)+0x18) [0xffff949aa7a0]
libllama.so(std::unique_ptr<ggml_backend, ggml_backend_deleter>::~unique_ptr()+0x50) [0xffff949df3bc]
libllama.so(void std::_Destroy<std::unique_ptr<ggml_backend, ggml_backend_deleter> >(std::unique_ptr<ggml_backend, ggml_backend_deleter>)+0x14) [0xffff94a0d32c]
libllama.so(void std::_Destroy_aux::__destroy<std::unique_ptr<ggml_backend, ggml_backend_deleter>>(std::unique_ptr<ggml_backend, ggml_backend_deleter>, std::unique_ptr<ggml_backend, ggml_backend_deleter>)+0x20) [0xffff94a0d1a8]
libllama.so(void std::_Destroy<std::unique_ptr<ggml_backend, ggml_backend_deleter>>(std::unique_ptr<ggml_backend, ggml_backend_deleter>, std::unique_ptr<ggml_backend, ggml_backend_deleter>*)+0x1c) [0xffff94a0cc28]
libllama.so(std::vector<std::unique_ptr<ggml_backend, ggml_backend_deleter>, std::allocator<std::unique_ptr<ggml_backend, ggml_backend_deleter> > >::~vector()+0x40) [0xffff94a0b0d4]
libllama.so(llama_context::~llama_context()+0x78) [0xffff94a07ee0]
libllama.so(llama_free+0x24) [0xffff94a05d3c]
./llama-cli() [0x472488]
./llama-cli() [0x478f60]
./llama-cli() [0x476300]
./llama-cli() [0x471074]
/lib/libc.so.6(+0x22104) [0xffff933a2104]
/lib/libc.so.6(__libc_start_main+0x9c) [0xffff933a21e4]
./llama-cli() [0x46c1b0]

./run_platform_test.sh: line 55: 559 Aborted ./llama-cli -p "my cat’s name" -m /tsi/akapoor/ggml/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin#

Make sure to read the contributing guidelines before submitting a PR

This change adds following 1. Add std=c++20 for using the aot-tests used profiler from runtime 2. Add initialization and dumping of profile informaion There is one crash which needs debugging but the results look as follows register_backend: registered backend Tsavorite (1 devices) register_device: registered device Tsavorite (txe) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (CPU) load_backend: failed to find ggml_backend_init in ./libggml-tsavorite.so load_backend: failed to find ggml_backend_init in ./libggml-cpu.so build: 4826 (5255890) with gcc (GCC) 13.3.0 for x86_64-pc-linux-gnu (debug) main: llama backend init main: load the model and apply lora adapter, if any TXE Device MEMORY Summary total 134217728 and free 134217728 llama_model_load_from_file_impl: using device Tsavorite (txe) - 128 MiB free llama_model_loader: loaded meta data with 38 key-value pairs and 201 tensors from /tsi/akapoor/ggml/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Tiny Llama v0.3 FP32 llama_model_loader: - kv 3: general.size_label str = 1.1B llama_model_loader: - kv 4: general.license str = apache-2.0 llama_model_loader: - kv 5: general.dataset.count u32 = 3 llama_model_loader: - kv 6: general.dataset.0.name str = SlimPajama 627B llama_model_loader: - kv 7: general.dataset.0.organization str = Cerebras llama_model_loader: - kv 8: general.dataset.0.repo_url str = https://huggingface.co/cerebras/SlimP... llama_model_loader: - kv 9: general.dataset.1.name str = Starcoderdata llama_model_loader: - kv 10: general.dataset.1.organization str = Bigcode llama_model_loader: - kv 11: general.dataset.1.repo_url str = https://huggingface.co/bigcode/starco... llama_model_loader: - kv 12: general.dataset.2.name str = Oasst_Top1_2023 08 25 llama_model_loader: - kv 13: general.dataset.2.version str = 08-25 llama_model_loader: - kv 14: general.dataset.2.organization str = OpenAssistant llama_model_loader: - kv 15: general.dataset.2.repo_url str = https://huggingface.co/OpenAssistant/... llama_model_loader: - kv 16: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 17: llama.block_count u32 = 22 llama_model_loader: - kv 18: llama.context_length u32 = 2048 llama_model_loader: - kv 19: llama.embedding_length u32 = 2048 llama_model_loader: - kv 20: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 21: llama.attention.head_count u32 = 32 llama_model_loader: - kv 22: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 23: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 24: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 25: general.file_type u32 = 0 llama_model_loader: - kv 26: llama.vocab_size u32 = 32003 llama_model_loader: - kv 27: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 28: tokenizer.ggml.model str = llama llama_model_loader: - kv 29: tokenizer.ggml.pre str = default llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,32003] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 31: tokenizer.ggml.scores arr[f32,32003] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,32003] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 201 tensors print_info: file format = GGUF V3 (latest) print_info: file type = all F32 print_info: file size = 4.10 GiB (32.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 6 load: token to piece cache size = 0.1684 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 2048 print_info: n_embd = 2048 print_info: n_layer = 22 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 5632 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 2048 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1B print_info: model params = 1.10 B print_info: general.name = Tiny Llama v0.3 FP32 print_info: vocab type = SPM print_info: n_vocab = 32003 print_info: n_merges = 0 print_info: BOS token = 1 '<s>' print_info: EOS token = 2 '</s>' print_info: EOT token = 32002 '<|im_end|>' print_info: UNK token = 0 '<unk>' print_info: PAD token = 32000 '[PAD]' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '</s>' print_info: EOG token = 32002 '<|im_end|>' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = true) TXE Device MEMORY Summary total 134217728 and free 134217728 load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/23 layers to GPU load_tensors: CPU_Mapped model buffer size = 4196.40 MiB .......................................................................................... llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 12288 llama_init_from_model: n_ctx_per_seq = 12288 llama_init_from_model: n_batch = 1024 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 10000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_pre_seq (12288) > n_ctx_train (2048) -- possible training context overflow [2018-03-09 12:35:52.849854] 272:273 [ info] :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:129> TXE resource allocation request processed successfully. llama_kv_cache_init: kv_size = 12288, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 22, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 264.00 MiB llama_init_from_model: KV self size = 264.00 MiB, K (f16): 132.00 MiB, V (f16): 132.00 MiB llama_init_from_model: CPU output buffer size = 0.12 MiB ggml_backend_tsavorite_buffer_type_alloc_buffer is called from llama data Loader ANoop Allocating memory from tsi_alloc with size 8400896 Allocating memory from tsi_alloc with size 8400896 starting memory 0xfffe83cb3080 Address of Newly Created BUffer 0xfffe83cb3080 and size 8400896 llama_init_from_model: tsavorite compute buffer size = 8.01 MiB llama_init_from_model: CPU compute buffer size = 808.01 MiB llama_init_from_model: graph nodes = 710 llama_init_from_model: graph splits = 179 (with bs=512), 93 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 12288 main: llama threadpool init, n_threads = 4 main: model was trained on only 2048 context tokens (12288 specified) system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | sampler seed: 2241389473 sampler params: repeat_last_n = 5, repeat_penalty = 1.500, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 12288 top_k = 50, top_p = 0.900, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 12288, n_batch = 1024, n_predict = 10, n_keep = 1 my cat’s name is Luna. I’m a software llama_perf_sampler_print: sampling time = 207.24 ms / 16 runs ( 12.95 ms per token, 77.21 tokens per second) llama_perf_context_print: load time = 85299.65 ms llama_perf_context_print: prompt eval time = 61885.34 ms / 6 tokens (10314.22 ms per token, 0.10 tokens per second) llama_perf_context_print: eval time = 528257.38 ms / 9 runs (58695.26 ms per token, 0.02 tokens per second) llama_perf_context_print: total time = 613808.39 ms / 15 tokens LLAMA SP Profiling Results: ------------------------------------------------------------------------------------------------------------------------ Calls Total(ms) T/call Self(ms) Function ------------------------------------------------------------------------------------------------------------------------ 0 0.000 0.000 0.000 [ 0%] LLAMA SP Main 21600 23152.000 1.072 0.000 └─ [ 4%] RuntimeHostShim::awaitCommandListCompletion 21280 33107.073 1.556 33107.073 └─ [ 5%] [ txe_mult_blob ] 320 500.108 1.563 500.108 └─ [ 0%] [ txe_add_blob ] 1 87.000 87.000 87.000 └─ [ 0%] RuntimeHostShim::initialize 21600 25.000 0.001 25.000 └─ [ 0%] RuntimeHostShim::loadBlob 21600 11.000 0.001 11.000 └─ [ 0%] RuntimeHostShim::finalizeCommandList 21600 7.000 0.000 7.000 └─ [ 0%] RuntimeHostShim::createCommandList 86400 6.000 0.000 6.000 └─ [ 0%] RuntimeHostShim::getShmemManager 21600 4.000 0.000 4.000 └─ [ 0%] RuntimeHostShim::deallocate 21601 3.000 0.000 3.000 └─ [ 0%] RuntimeHostShim::allocate 21600 3.000 0.000 3.000 └─ [ 0%] RuntimeHostShim::launchBlob 21600 3.000 0.000 3.000 └─ [ 0%] RuntimeHostShim::addCommandToList 21600 0.000 0.000 0.000 └─ [ 0%] RuntimeHostShim::unloadBlob ======================================================================================================================== 0 613845.000 0.000613845.000 [100%] TOTAL ======================================================================================================================== TXE_ADD Operation, total tensor: 10 Number of Kernel Call: 320 Number of tensor got spilt: 10 Min Num of Elem 2048 Max Num of Elem 2048 TXE_SUB Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_MULT Operation, total tensor: 450 Number of Kernel Call: 21280 Number of tensor got spilt: 450 Min Num of Elem 2048 Max Num of Elem 12288 TXE_DIV Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SQRT Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_NEG Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_ABS Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIN Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIGMOID Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 terminate called after throwing an instance of 'std::runtime_error' what(): Profiler not initialized! Stack trace: ./llama-cli(tsi::runtime::utils::ScopedProfiler<true>::ScopedProfiler(tsi::runtime::utils::ProfileLocation)+0x8c) [0x478d04] /usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin/aot-tests/lib/libTsavRTFPGAShimCAPI.so(tsi_finalize+0xb0) [0xffff93ce83d8] libggml-tsavorite.so(+0x57f0) [0xffff945c57f0] libggml-tsavorite.so(+0x7bb4) [0xffff945c7bb4] libggml-base.so(ggml_backend_free+0x28) [0xffff944d89d8] libllama.so(ggml_backend_deleter::operator()(ggml_backend*)+0x18) [0xffff949aa7a0] libllama.so(std::unique_ptr<ggml_backend, ggml_backend_deleter>::~unique_ptr()+0x50) [0xffff949df3bc] libllama.so(void std::_Destroy<std::unique_ptr<ggml_backend, ggml_backend_deleter> >(std::unique_ptr<ggml_backend, ggml_backend_deleter>*)+0x14) [0xffff94a0d32c] libllama.so(void std::_Destroy_aux<false>::__destroy<std::unique_ptr<ggml_backend, ggml_backend_deleter>*>(std::unique_ptr<ggml_backend, ggml_backend_deleter>*, std::unique_ptr<ggml_backend, ggml_backend_deleter>*)+0x20) [0xffff94a0d1a8] libllama.so(void std::_Destroy<std::unique_ptr<ggml_backend, ggml_backend_deleter>*>(std::unique_ptr<ggml_backend, ggml_backend_deleter>*, std::unique_ptr<ggml_backend, ggml_backend_deleter>*)+0x1c) [0xffff94a0cc28] libllama.so(std::vector<std::unique_ptr<ggml_backend, ggml_backend_deleter>, std::allocator<std::unique_ptr<ggml_backend, ggml_backend_deleter> > >::~vector()+0x40) [0xffff94a0b0d4] libllama.so(llama_context::~llama_context()+0x78) [0xffff94a07ee0] libllama.so(llama_free+0x24) [0xffff94a05d3c] ./llama-cli() [0x472488] ./llama-cli() [0x478f60] ./llama-cli() [0x476300] ./llama-cli() [0x471074] /lib/libc.so.6(+0x22104) [0xffff933a2104] /lib/libc.so.6(__libc_start_main+0x9c) [0xffff933a21e4] ./llama-cli() [0x46c1b0] ./run_platform_test.sh: line 55: 559 Aborted ./llama-cli -p "my cat’s name" -m /tsi/akapoor/ggml/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin#

This change has following 1. Removed changes from examples/main/main.cpp 2. Added profiling change to ggml-tsavorite.cpp

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

atrivedi-tsavoritesi requested review from DashingR, akapoor3518, dineshReddy6381, dmpatra, gkethamallax, mmankal and reach2shaunak May 27, 2025 04:59

atrivedi-tsavoritesi self-assigned this May 27, 2025

dmpatra approved these changes May 27, 2025

View reviewed changes

@FIR-708: Fixed the crash and moved profiling to inside tsavorite

5511ddb

This change has following 1. Removed changes from examples/main/main.cpp 2. Added profiling change to ggml-tsavorite.cpp

atrivedi-tsavoritesi merged commit 7e5f682 into FIR-699 May 27, 2025

atrivedi-tsavoritesi deleted the FIR-708 branch May 27, 2025 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@FIR-708: Added code to print out the profile information from runtime #3

@FIR-708: Added code to print out the profile information from runtime #3

Uh oh!

atrivedi-tsavoritesi commented May 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@FIR-708: Added code to print out the profile information from runtime #3

@FIR-708: Added code to print out the profile information from runtime #3

Uh oh!

Conversation

atrivedi-tsavoritesi commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

atrivedi-tsavoritesi commented May 27, 2025 •

edited

Loading