@FIR-1063 - llama.cpp/ggml/tsavorite support for sakanaAI Model #76

akapoor3518 · 2025-11-05T17:08:48Z

To address the Sakuna issue, I’ve implemented a temporary fix. With this change, prompt responses are now working correctly for Tiny, Gemma (4, 16, 32), and Sakuna across both Tsavorite and CPU backends.
The fix involves identifying RESHAPE, TRANSPOSE, and VIEW operations. If these ops have a source (SRC) tensor, they are offloaded to the CPU. As a result, for Sakuna, some ADD operations (specifically those acting as sources for RESHAPE or VIEW) are now offloaded to the CPU, while the remaining ADD ops continue to run on Tsavorite.
To unlock the teamwe will commit this fix. As longer term solution i will be workin with my priority

I am currently testing on FPGA

FPGA
*** exit status: 0 ***
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# tar -zxvf tsi-ggml-0.2.0.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_sigmoid_16.blob
tsi-ggml/blobs/txe_silu_16.blob
tsi-ggml/blobs/txe_swiglu_16.blob
tsi-ggml/blobs/txe_rms_norm_16.blob
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_swiglu.blob
tsi-ggml/blobs/txe_rms_norm.blob
tsi-ggml/blobs/txe_soft_max.blob
tsi-ggml/blobs/txe_add_16.blob
tsi-ggml/blobs/txe_sub_16.blob
tsi-ggml/blobs/txe_mult_16.blob
tsi-ggml/blobs/txe_div_16.blob
tsi-ggml/blobs/txe_abs_16.blob
tsi-ggml/blobs/txe_neg_16.blob
tsi-ggml/blobs/txe_sqrt_16.blob
tsi-ggml/blobs/txe_sqr_16.blob
tsi-ggml/blobs/txe_inv_16.blob
tsi-ggml/blobs/txe_sin_16.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ls -lrt
-rwxr--r-- 1 root root 47540 Jan 1 1970 vecadd_blob_tvu_main.so
-rwxr--r-- 1 root root 47540 Jan 1 1970 vecadd_blob_tvu_main.blob
-rwxr-xr-x 1 root root 389008 Jan 1 1970 tsi_txe_kernel.bin
-rwxr-xr-x 1 root root 821 Jan 1 1970 tsi_shutdown.sh
-rwxr-xr-x 1 root root 617 Jan 1 1970 tsi_env.sh
-rw-r--r-- 1 root root 207 Jan 1 1970 tsiShutdown.service
-rwxr-xr-x 1 root root 1231 Jan 1 1970 tnApcMgr_run.sh
-rw-r--r-- 1 root root 239 Jan 1 1970 tnApcMgr.service
-rwxr-xr-x 1 root root 327040 Jan 1 1970 tnApcMgr
-rwxr-xr-x 1 root root 80784 Jan 1 1970 sys-diagtool
-rwxr-xr-x 1 root root 1962 Jan 1 1970 run_platform_test.sh
-rwxr-xr-x 1 root root 1557 Jan 1 1970 run_llama_cli.sh
-rwxr-xr-x 1 root root 81256 Jan 1 1970 recvFromHost
-rw-r--r-- 1 root root 546 Jan 1 1970 platform_layout.json
-rwxr-xr-x 1 root root 154944 Jan 1 1970 UAP
drwx------ 7 101006 100003 504 Jan 1 1970 aot-tests
drwxr-xr-x 3 root root 808 Jan 1 1970 tsi-ggml-orig
drwxr-xr-x 3 root root 808 Mar 9 14:02 tsi-ggml
-rw-r--r-- 1 root root 16601481 Nov 5 2025 tsi-ggml-0.2.0.tz
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh

is Luna.

llama_perf_sampler_print: sampling time = 107.93 ms / 11 runs ( 9.81 ms per token, 101.91 tokens per second)
llama_perf_context_print: load time = 86881.95 ms
llama_perf_context_print: prompt eval time = 44125.64 ms / 6 tokens ( 7354.27 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 50143.09 ms / 4 runs (12535.77 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 137167.17 ms / 10 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 484 694 1718635 3550.90
MUL OPU 495 710 943556 1906.17
RMS_NORM OPU 495 495 997128 2014.40
MUL_MAT CPU 8167 0 521220669 63820.33
CONT CPU 1379 0 1672561 1212.88
RESHAPE CPU 1253 0 16040 12.80
VIEW CPU 2010 0 2970 1.48
PERMUTE CPU 1506 0 2556 1.70
TRANSPOSE CPU 413 0 783 1.90
GET_ROWS CPU 84 0 18691 222.51
SET_ROWS CPU 1557 0 597034 383.45
SOFT_MAX CPU 586 0 1002339 1710.48
ROPE CPU 1482 0 101612 68.56
GLU OPU 242 347 1088490 4497.89

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    46.4750   46.4750     37.2300  [3.33e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     5.1040    5.1040      5.1040  └─ [3.66e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     2.4600    2.4600      1.6300  └─ [1.77e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.8300    0.4150      0.8300    └─ [5.96e-04%] tsi::runtime::executeWithTimeout
1     1.6810    1.6810      1.6810  └─ [1.21e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1310 1014.8600 0.7747 0.0000 [7.28e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1310 1.36e+05 103.6987 1.36e+05 └─ [97.48%] TXE 0 Idle
215 196.9513 0.9161 196.9513 └─ [1.41e-01%] [ txe_swiglu ]
225 140.6387 0.6251 140.6387 └─ [1.01e-01%] [ txe_rms_norm ]
440 133.3295 0.3030 133.3295 └─ [9.57e-02%] [ txe_mult ]
430 126.4495 0.2941 126.4495 └─ [9.07e-02%] [ txe_add ]

[Thread] OPU (cumulative over all threads)

##########
lrwxrwxrwx 1 root root 36 Mar 12 2018 Tiny-Llama-v0.3-FP32-1.1B-F16.gguf -> Tiny-Llama-v0.3-FP32-1.1B-F16:latest
�パには何人の国がありますか" 10 SakanaAI-TinySwallow-1.5B-Instruct-F32:latest ama_cli.sh "ヨーロッ

ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45

llama_perf_sampler_print: sampling time = 931.55 ms / 70 runs ( 13.31 ms per token, 75.14 tokens per second)
llama_perf_context_print: load time = 522881.40 ms
llama_perf_context_print: prompt eval time = 495618.08 ms / 60 tokens ( 8260.30 ms per token, 0.12 tokens per second)
llama_perf_context_print: eval time = 199627.62 ms / 9 runs (22180.85 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 723601.15 ms / 69 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 11802 0 529967 44.90
ADD OPU 2576 5762 11165138 4334.29
MUL OPU 2622 5867 7685276 2931.07
RMS_NORM OPU 2622 2622 5023000 1915.71
MUL_MAT CPU 45495 0 3835513054 84306.25
CONT CPU 7607 0 6103730 802.38
RESHAPE CPU 6730 0 15790 2.35
VIEW CPU 10151 0 8373 0.82
PERMUTE CPU 8313 0 8930 1.07
TRANSPOSE CPU 2131 0 2826 1.33
GET_ROWS CPU 356 0 80435 225.94
SET_ROWS CPU 8229 0 254505 30.93
SOFT_MAX CPU 3615 0 3176380 878.67
ROPE CPU 6875 0 636442 92.57
GLU OPU 1288 2881 9329732 7243.58

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    53.9420   53.9420     44.3180  [7.43e-03%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     5.1850    5.1850      5.1850  └─ [7.14e-04%] tsi::runtime::TsavRTFPGA::initializeQueues
1     2.4930    2.4930      1.6540  └─ [3.43e-04%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.8390    0.4195      0.8390    └─ [1.16e-04%] tsi::runtime::executeWithTimeout
1     1.9460    1.9460      1.9460  └─ [2.68e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

10004 6879.7890 0.6877 0.0000 [9.48e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
10004 7.15e+05 71.5179 7.15e+05 └─ [98.56%] TXE 0 Idle
1873 2679.1265 1.4304 2679.1265 └─ [3.69e-01%] [ txe_swiglu ]
3815 946.7990 0.2482 946.7990 └─ [1.30e-01%] [ txe_mult ]
3746 918.4545 0.2452 918.4545 └─ [1.27e-01%] [ txe_add ]
570 800.3795 1.4042 800.3795 └─ [1.10e-01%] [ txe_rms_norm ]

[Thread] OPU (cumulative over all threads)

1     5.1420    5.1420      4.9490  [7.08e-04%] [Thread] OPU 
1     0.1930    0.1930      0.1930  └─ [2.66e-05%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

10004 5465.9730 0.5464 5343.0030 [7.53e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
10004 122.9700 0.0123 122.9700 └─ [1.69e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

10004 11013.0930 1.1009 259.8990 [ 1.52%] [Thread] tsi::runtime::TsavRT::processResponses
10004 10753.1940 1.0749 10753.1940 └─ [ 1.48%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    91.4580   91.4580     68.5330  [1.26e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    22.9250   22.9250     22.9250  └─ [3.16e-03%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

10006 210.4630 0.0210 210.4630 [2.90e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

10004 1800.3680 0.1800 1800.3680 [2.48e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

10004 401.7860 0.0402 401.7860 [5.54e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

10004 748.1970 0.0748 748.1970 [1.03e-01%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

10004 101.4360 0.0101 101.4360 [1.40e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   7.26e+05    0.0000    7.26e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9483

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38

FPGA CPU

oot@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
�パには何人の国がありますか" 10 SakanaAI-TinySwallow-1.5B-Instruct-F32:latest none i.sh "ヨーロッ

ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45

llama_perf_sampler_print: sampling time = 932.43 ms / 70 runs ( 13.32 ms per token, 75.07 tokens per second)
llama_perf_context_print: load time = 46228.02 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 164685.49 ms / 9 runs (18298.39 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 193636.26 ms / 10 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 19495 0 975511 50.04
MUL CPU 7718 0 572598 74.19
RMS_NORM CPU 8163 0 299411 36.68
MUL_MAT CPU 35395 0 3096367955 87480.38
CPY CPU 122 0 133717 1096.04
RESHAPE CPU 8242 0 33716 4.09
VIEW CPU 10324 0 15942 1.54
PERMUTE CPU 6099 0 9819 1.61
GET_ROWS CPU 458 0 73570 160.63
SET_ROWS CPU 7842 0 139176 17.75
ROPE CPU 7186 0 528814 73.59
FLASH_ATTN_EXT CPU 4855 0 8201764 1689.34
GLU CPU 4197 0 2337854 557.03

##########
POSIX LOGS
akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "ヨーロッパには何人の国がありますか" -m /proj/rel/sw/ggml/models/SakanaAI-TinySwallow-1.5B-Instruct-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn

ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45 の国があります。

ただし、国境の変更や再編成により、この数

llama_perf_sampler_print: sampling time = 315.10 ms / 49 runs ( 6.43 ms per token, 155.50 tokens per second)
llama_perf_context_print: load time = 12221.21 ms
llama_perf_context_print: prompt eval time = 10035.68 ms / 19 tokens ( 528.19 ms per token, 1.89 tokens per second)
llama_perf_context_print: eval time = 17384.59 ms / 29 runs ( 599.47 ms per token, 1.67 tokens per second)
llama_perf_context_print: total time = 29940.84 ms / 48 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 136214 0 372414 2.73
ADD OPU 24416 25388 34265322 1403.40
MUL OPU 24852 25842 15672949 630.65
RMS_NORM OPU 24852 24852 16474278 662.90
MUL_MAT CPU 439954 0 661593306 1503.78
CONT CPU 93635 0 4786123 51.11
RESHAPE CPU 121538 0 28315 0.23
VIEW CPU 208741 0 27167 0.13
PERMUTE CPU 169804 0 30144 0.18
TRANSPOSE CPU 41690 0 7460 0.18
GET_ROWS CPU 3590 0 7322 2.04
SET_ROWS CPU 84933 0 62283 0.73
SOFT_MAX CPU 43515 0 1392158 31.99
ROPE CPU 95391 0 476439 4.99
GLU OPU 12208 12694 22051834 1806.34

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    24.5350   24.5350      7.5330  [8.18e-02%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    16.8630   16.8630      3.8580  └─ [5.62e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1    11.6510   11.6510     11.6510    └─ [3.88e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.2780    1.2780      1.2780    └─ [4.26e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0760    0.0760      0.0690    └─ [2.53e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0070    0.0070      0.0070      └─ [2.33e-05%] tsi::runtime::executeWithTimeout
1     0.1390    0.1390      0.1390  └─ [4.63e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     5.1620    5.1620      4.6630  [1.72e-02%] [Thread] tsi::runtime::TsavRT::finalize
1     0.4920    0.4920      0.0480  └─ [1.64e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.4440    0.4440      0.0700    └─ [1.48e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.3440    0.3440      0.3440      └─ [1.15e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0300    0.0300      0.0270      └─ [1.00e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.00e-05%] tsi::runtime::executeWithTimeout
2     0.0070    0.0035      0.0070  └─ [2.33e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

8388 2886.7670 0.3442 273.2030 [ 9.62%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
16776 2610.8050 0.1556 2610.8050 └─ [ 8.70%] tsi::runtime::executeWithTimeout
8388 2.7590 3.29e-04 2.7590 └─ [9.20e-03%] LOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148533888[0x801...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

8388 1598.2810 0.1905 298.9620 [ 5.33%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
16776 1295.0250 0.0772 1295.0250 └─ [ 4.32%] tsi::runtime::executeWithTimeout
8388 4.2940 5.12e-04 4.2940 └─ [1.43e-02%] UNLOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148533888[0x8...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses
8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0920    0.0920      0.0630  [3.07e-04%] [Thread] OPU 
1     0.0290    0.0290      0.0290  └─ [9.67e-05%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

8390 23.0920 0.0028 23.0920 [7.70e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

8388 35.0850 0.0042 35.0850 [1.17e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

8388 2837.5130 0.3383 2837.5130 [ 9.46%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

8388 10.7640 0.0013 10.7640 [3.59e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 29997.8200    0.0000  29997.8200  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9993

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    24.5350   24.5350      7.5330  [8.18e-02%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    16.8630   16.8630      3.8580  └─ [5.62e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1    11.6510   11.6510     11.6510    └─ [3.88e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.2780    1.2780      1.2780    └─ [4.26e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0760    0.0760      0.0690    └─ [2.53e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0070    0.0070      0.0070      └─ [2.33e-05%] tsi::runtime::executeWithTimeout
1     0.1390    0.1390      0.1390  └─ [4.63e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     5.1620    5.1620      4.6630  [1.72e-02%] [Thread] tsi::runtime::TsavRT::finalize
1     0.4920    0.4920      0.0480  └─ [1.64e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.4440    0.4440      0.0700    └─ [1.48e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.3440    0.3440      0.3440      └─ [1.15e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0300    0.0300      0.0270      └─ [1.00e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.00e-05%] tsi::runtime::executeWithTimeout
2     0.0070    0.0035      0.0070  └─ [2.33e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

8388 2886.7670 0.3442 273.2030 [ 9.62%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
16776 2610.8050 0.1556 2610.8050 └─ [ 8.70%] tsi::runtime::executeWithTimeout
8388 2.7590 3.29e-04 2.7590 └─ [9.20e-03%] LOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148533888[0x801...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

8388 1598.2810 0.1905 298.9620 [ 5.33%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
16776 1295.0250 0.0772 1295.0250 └─ [ 4.32%] tsi::runtime::executeWithTimeout
8388 4.2940 5.12e-04 4.2940 └─ [1.43e-02%] UNLOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148533888[0x8...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses
8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0920    0.0920      0.0630  [3.07e-04%] [Thread] OPU 
1     0.0290    0.0290      0.0290  └─ [9.67e-05%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

8390 23.0920 0.0028 23.0920 [7.70e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

8388 35.0850 0.0042 35.0850 [1.17e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

8388 2837.5130 0.3383 2837.5130 [ 9.46%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

8388 10.7640 0.0013 10.7640 [3.59e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 29997.8200    0.0000  29997.8200  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9993

[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "ヨーロッパには何人の国がありますか" -m /proj/rel/sw/ggml/models/SakanaAI-TinySwallow-1.5B-Instruct-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn

ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45 の国があります。

ただし、国境の変更や再編成により、この数

llama_perf_sampler_print: sampling time = 322.77 ms / 49 runs ( 6.59 ms per token, 151.81 tokens per second)
llama_perf_context_print: load time = 9632.12 ms
llama_perf_context_print: prompt eval time = 7444.34 ms / 19 tokens ( 391.81 ms per token, 2.55 tokens per second)
llama_perf_context_print: eval time = 11306.47 ms / 29 runs ( 389.88 ms per token, 2.56 tokens per second)
llama_perf_context_print: total time = 21281.04 ms / 48 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 237341 0 993761 4.19
MUL CPU 87628 0 416268 4.75
RMS_NORM CPU 90071 0 225576 2.50
MUL_MAT CPU 342193 0 633988009 1852.72
CPY CPU 1691 0 8609 5.09
RESHAPE CPU 171297 0 35733 0.21
VIEW CPU 211084 0 22456 0.11
PERMUTE CPU 125522 0 13009 0.10
GET_ROWS CPU 4905 0 9540 1.94
SET_ROWS CPU 86704 0 56701 0.65
ROPE CPU 95240 0 439377 4.61
FLASH_ATTN_EXT CPU 48691 0 3749784 77.01
GLU CPU 47226 0 2027012 42.92

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    24.1640   24.1640      7.9220  [1.13e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    16.1150   16.1150      3.8080  └─ [7.55e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1    10.8210   10.8210     10.8210    └─ [5.07e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.4100    1.4100      1.4100    └─ [6.61e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0760    0.0760      0.0690    └─ [3.56e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0070    0.0070      0.0070      └─ [3.28e-05%] tsi::runtime::executeWithTimeout
1     0.1270    0.1270      0.1270  └─ [5.95e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     2.2510    2.2510      1.9480  [1.06e-02%] [Thread] tsi::runtime::TsavRT::finalize
1     0.2830    0.2830      0.0460  └─ [1.33e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.2370    0.2370      0.0470    └─ [1.11e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.1660    0.1660      0.1660      └─ [7.78e-04%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0240    0.0240      0.0210      └─ [1.12e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.41e-05%] tsi::runtime::executeWithTimeout
2     0.0200    0.0100      0.0200  └─ [9.37e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

2    10.8820    5.4410      0.1010  [5.10e-02%] [Thread] tsi::runtime::TsavRT::processResponses
2    10.7810    5.3905     10.7810  └─ [5.05e-02%] tsi::runtime::executeWithTimeout

========================================================================================================================
- 21333.6950 0.0000 21333.6950 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6667

[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli^C-m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn

ANOOP Calling ggml_backend_tsavorite_reg
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her

llama_perf_sampler_print: sampling time = 67.94 ms / 36 runs ( 1.89 ms per token, 529.86 tokens per second)
llama_perf_context_print: load time = 23988.12 ms
llama_perf_context_print: prompt eval time = 1606.61 ms / 6 tokens ( 267.77 ms per token, 3.73 tokens per second)
llama_perf_context_print: eval time = 7504.90 ms / 29 runs ( 258.79 ms per token, 3.86 tokens per second)
llama_perf_context_print: total time = 31567.85 ms / 35 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 68857 0 447926 6.51
MUL CPU 59608 0 367592 6.17
RMS_NORM CPU 62212 0 205387 3.30
MUL_MAT CPU 268652 0 422290079 1571.89
CPY CPU 1718 0 7318 4.26
RESHAPE CPU 144715 0 61375 0.42
VIEW CPU 160633 0 20828 0.13
PERMUTE CPU 97730 0 10773 0.11
GET_ROWS CPU 4439 0 9266 2.09
SET_ROWS CPU 61566 0 44231 0.72
ROPE CPU 74540 0 301755 4.05
FLASH_ATTN_EXT CPU 38327 0 3268596 85.28
GLU CPU 34713 0 905402 26.08

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    22.1270   22.1270      4.7850  [7.00e-02%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    17.1720   17.1720      3.7600  └─ [5.43e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1    12.3340   12.3340     12.3340    └─ [3.90e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.9300    0.9300      0.9300    └─ [2.94e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1480    0.1480      0.1410    └─ [4.68e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0070    0.0070      0.0070      └─ [2.21e-05%] tsi::runtime::executeWithTimeout
1     0.1700    0.1700      0.1700  └─ [5.38e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     1.6720    1.6720      1.2440  [5.29e-03%] [Thread] tsi::runtime::TsavRT::finalize
1     0.4050    0.4050      0.0390  └─ [1.28e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.3660    0.3660      0.0490    └─ [1.16e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.2900    0.2900      0.2900      └─ [9.17e-04%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0270    0.0270      0.0230      └─ [8.54e-05%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0040    0.0040      0.0040        └─ [1.26e-05%] tsi::runtime::executeWithTimeout
2     0.0230    0.0115      0.0230  └─ [7.27e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

2    12.6060    6.3030      0.1390  [3.99e-02%] [Thread] tsi::runtime::TsavRT::processResponses
2    12.4670    6.2335     12.4670  └─ [3.94e-02%] tsi::runtime::executeWithTimeout

========================================================================================================================
- 31621.2470 0.0000 31621.2470 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6667

[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn

ANOOP Calling ggml_backend_tsavorite_reg
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her

llama_perf_sampler_print: sampling time = 74.78 ms / 36 runs ( 2.08 ms per token, 481.40 tokens per second)
llama_perf_context_print: load time = 3009.39 ms
llama_perf_context_print: prompt eval time = 2410.34 ms / 6 tokens ( 401.72 ms per token, 2.49 tokens per second)
llama_perf_context_print: eval time = 14009.46 ms / 29 runs ( 483.08 ms per token, 2.07 tokens per second)
llama_perf_context_print: total time = 17100.80 ms / 35 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 19184 19394 31088495 1620.54
MUL OPU 19620 19835 13235653 674.60
RMS_NORM OPU 19620 19620 14978110 763.41
MUL_MAT CPU 344994 0 479866487 1390.94
CONT CPU 72517 0 4061080 56.00
RESHAPE CPU 108256 0 54067 0.50
VIEW CPU 161091 0 21013 0.13
PERMUTE CPU 130732 0 23040 0.18
TRANSPOSE CPU 28537 0 6656 0.23
GET_ROWS CPU 3482 0 8464 2.43
SET_ROWS CPU 63729 0 55527 0.87
SOFT_MAX CPU 33348 0 2730224 81.87
ROPE CPU 73331 0 357391 4.87
GLU OPU 9592 9697 16964720 1768.63

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    21.8830   21.8830      4.7060  [1.28e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    17.0220   17.0220      4.0730  └─ [9.92e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1    11.7460   11.7460     11.7460    └─ [6.85e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.0420    1.0420      1.0420    └─ [6.07e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1610    0.1610      0.0960    └─ [9.39e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0650    0.0650      0.0650      └─ [3.79e-04%] tsi::runtime::executeWithTimeout
1     0.1550    0.1550      0.1550  └─ [9.04e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     3.6490    3.6490      3.1130  [2.13e-02%] [Thread] tsi::runtime::TsavRT::finalize
1     0.5280    0.5280      0.0500  └─ [3.08e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.4780    0.4780      0.0700    └─ [2.79e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.3810    0.3810      0.3810      └─ [2.22e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0270    0.0270      0.0240      └─ [1.57e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.75e-05%] tsi::runtime::executeWithTimeout
2     0.0080    0.0040      0.0080  └─ [4.66e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

5210 2360.9790 0.4532 183.9320 [13.76%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
10420 2174.7240 0.2087 2174.7240 └─ [12.68%] tsi::runtime::executeWithTimeout
5210 2.3230 4.46e-04 2.3230 └─ [1.35e-02%] LOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

5210 1103.4970 0.2118 195.4310 [ 6.43%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
10420 904.9970 0.0869 904.9970 └─ [ 5.28%] tsi::runtime::executeWithTimeout
5210 3.0690 5.89e-04 3.0690 └─ [1.79e-02%] UNLOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

5212 1411.7130 0.2709 48.0710 [ 8.23%] [Thread] tsi::runtime::TsavRT::processResponses
5212 1363.6420 0.2616 1363.6420 └─ [ 7.95%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0630    0.0630      0.0400  [3.67e-04%] [Thread] OPU 
1     0.0230    0.0230      0.0230  └─ [1.34e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

5210 88.1890 0.0169 80.9660 [5.14e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
5210 7.2230 0.0014 7.2230 └─ [4.21e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

5212 18.1380 0.0035 18.1380 [1.06e-01%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

5210 23.0340 0.0044 23.0340 [1.34e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

5210 1780.2040 0.3417 1780.2040 [10.38%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

5210 7.1500 0.0014 7.1500 [4.17e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 17153.7190    0.0000  17153.7190  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9990

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    21.8830   21.8830      4.7060  [1.28e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    17.0220   17.0220      4.0730  └─ [9.92e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1    11.7460   11.7460     11.7460    └─ [6.85e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.0420    1.0420      1.0420    └─ [6.07e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1610    0.1610      0.0960    └─ [9.39e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0650    0.0650      0.0650      └─ [3.79e-04%] tsi::runtime::executeWithTimeout
1     0.1550    0.1550      0.1550  └─ [9.04e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     3.6490    3.6490      3.1130  [2.13e-02%] [Thread] tsi::runtime::TsavRT::finalize
1     0.5280    0.5280      0.0500  └─ [3.08e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.4780    0.4780      0.0700    └─ [2.79e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.3810    0.3810      0.3810      └─ [2.22e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0270    0.0270      0.0240      └─ [1.57e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.75e-05%] tsi::runtime::executeWithTimeout
2     0.0080    0.0040      0.0080  └─ [4.66e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

5210 2360.9790 0.4532 183.9320 [13.76%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
10420 2174.7240 0.2087 2174.7240 └─ [12.68%] tsi::runtime::executeWithTimeout
5210 2.3230 4.46e-04 2.3230 └─ [1.35e-02%] LOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

5210 1103.4970 0.2118 195.4310 [ 6.43%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
10420 904.9970 0.0869 904.9970 └─ [ 5.28%] tsi::runtime::executeWithTimeout
5210 3.0690 5.89e-04 3.0690 └─ [1.79e-02%] UNLOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

5212 1411.7130 0.2709 48.0710 [ 8.23%] [Thread] tsi::runtime::TsavRT::processResponses
5212 1363.6420 0.2616 1363.6420 └─ [ 7.95%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0630    0.0630      0.0400  [3.67e-04%] [Thread] OPU 
1     0.0230    0.0230      0.0230  └─ [1.34e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

5210 88.1890 0.0169 80.9660 [5.14e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
5210 7.2230 0.0014 7.2230 └─ [4.21e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

5212 18.1380 0.0035 18.1380 [1.06e-01%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

5210 23.0340 0.0044 23.0340 [1.34e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

5210 1780.2040 0.3417 1780.2040 [10.38%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

5210 7.1500 0.0014 7.1500 [4.17e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 17153.7190    0.0000  17153.7190  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9990

[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$

ggml/src/ggml-backend.cpp

@FIR-1063 - llama.cpp/ggml/tsavorite support for sakanaAI Model

dbe335a

akapoor3518 requested review from Nithyanand-G, atrivedi-tsavoritesi, dineshReddy6381, dmpatra, gkethamallax, mikeuhler and mmankal as code owners November 5, 2025 17:08

atrivedi-tsavoritesi reviewed Nov 5, 2025

View reviewed changes

ggml/src/ggml-backend.cpp Show resolved Hide resolved

Address Ashish's comment

0841f09

atrivedi-tsavoritesi approved these changes Nov 5, 2025

View reviewed changes

akapoor3518 merged commit 1ec2a7b into master Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@FIR-1063 - llama.cpp/ggml/tsavorite support for sakanaAI Model #76

@FIR-1063 - llama.cpp/ggml/tsavorite support for sakanaAI Model #76

Uh oh!

akapoor3518 commented Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@FIR-1063 - llama.cpp/ggml/tsavorite support for sakanaAI Model #76

@FIR-1063 - llama.cpp/ggml/tsavorite support for sakanaAI Model #76

Uh oh!

Conversation

akapoor3518 commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] OPU (cumulative over all threads)

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

10004 5465.9730 0.5464 5343.0030 [7.53e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 10004 122.9700 0.0123 122.9700 └─ [1.69e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

10004 11013.0930 1.1009 259.8990 [ 1.52%] [Thread] tsi::runtime::TsavRT::processResponses 10004 10753.1940 1.0749 10753.1940 └─ [ 1.48%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

10006 210.4630 0.0210 210.4630 [2.90e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

10004 1800.3680 0.1800 1800.3680 [2.48e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

10004 401.7860 0.0402 401.7860 [5.54e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

10004 748.1970 0.0748 748.1970 [1.03e-01%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

10004 101.4360 0.0101 101.4360 [1.40e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9483

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses 8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

8390 23.0920 0.0028 23.0920 [7.70e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

8388 35.0850 0.0042 35.0850 [1.17e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

8388 2837.5130 0.3383 2837.5130 [ 9.46%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

8388 10.7640 0.0013 10.7640 [3.59e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9993

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses 8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

8390 23.0920 0.0028 23.0920 [7.70e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

8388 35.0850 0.0042 35.0850 [1.17e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

8388 2837.5130 0.3383 2837.5130 [ 9.46%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

8388 10.7640 0.0013 10.7640 [3.59e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9993

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

======================================================================================================================== - 21333.6950 0.0000 21333.6950 [100.00%] TOTAL

Counter Metrics:

akapoor3518 commented Nov 5, 2025 •

edited

Loading

10004 5465.9730 0.5464 5343.0030 [7.53e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
10004 122.9700 0.0123 122.9700 └─ [1.69e-02%] tsi::runtime::executeWithTimeout

10004 11013.0930 1.1009 259.8990 [ 1.52%] [Thread] tsi::runtime::TsavRT::processResponses
10004 10753.1940 1.0749 10753.1940 └─ [ 1.48%] tsi::runtime::executeWithTimeout

8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses
8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout

8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout

8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses
8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout

8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout

========================================================================================================================
- 21333.6950 0.0000 21333.6950 [100.00%] TOTAL

========================================================================================================================
- 31621.2470 0.0000 31621.2470 [100.00%] TOTAL

5212 1411.7130 0.2709 48.0710 [ 8.23%] [Thread] tsi::runtime::TsavRT::processResponses
5212 1363.6420 0.2616 1363.6420 └─ [ 7.95%] tsi::runtime::executeWithTimeout

5210 88.1890 0.0169 80.9660 [5.14e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
5210 7.2230 0.0014 7.2230 └─ [4.21e-02%] tsi::runtime::executeWithTimeout

5212 1411.7130 0.2709 48.0710 [ 8.23%] [Thread] tsi::runtime::TsavRT::processResponses
5212 1363.6420 0.2616 1363.6420 └─ [ 7.95%] tsi::runtime::executeWithTimeout

5210 88.1890 0.0169 80.9660 [5.14e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
5210 7.2230 0.0014 7.2230 └─ [4.21e-02%] tsi::runtime::executeWithTimeout