@FIR-1063 - llama.cpp/ggml/tsavorite support for sakanaAI Model #76
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To address the Sakuna issue, I’ve implemented a temporary fix. With this change, prompt responses are now working correctly for Tiny, Gemma (4, 16, 32), and Sakuna across both Tsavorite and CPU backends.
The fix involves identifying RESHAPE, TRANSPOSE, and VIEW operations. If these ops have a source (SRC) tensor, they are offloaded to the CPU. As a result, for Sakuna, some ADD operations (specifically those acting as sources for RESHAPE or VIEW) are now offloaded to the CPU, while the remaining ADD ops continue to run on Tsavorite.
To unlock the teamwe will commit this fix. As longer term solution i will be workin with my priority
I am currently testing on FPGA
FPGA
*** exit status: 0 ***
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# tar -zxvf tsi-ggml-0.2.0.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_sigmoid_16.blob
tsi-ggml/blobs/txe_silu_16.blob
tsi-ggml/blobs/txe_swiglu_16.blob
tsi-ggml/blobs/txe_rms_norm_16.blob
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_swiglu.blob
tsi-ggml/blobs/txe_rms_norm.blob
tsi-ggml/blobs/txe_soft_max.blob
tsi-ggml/blobs/txe_add_16.blob
tsi-ggml/blobs/txe_sub_16.blob
tsi-ggml/blobs/txe_mult_16.blob
tsi-ggml/blobs/txe_div_16.blob
tsi-ggml/blobs/txe_abs_16.blob
tsi-ggml/blobs/txe_neg_16.blob
tsi-ggml/blobs/txe_sqrt_16.blob
tsi-ggml/blobs/txe_sqr_16.blob
tsi-ggml/blobs/txe_inv_16.blob
tsi-ggml/blobs/txe_sin_16.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ls -lrt
-rwxr--r-- 1 root root 47540 Jan 1 1970 vecadd_blob_tvu_main.so
-rwxr--r-- 1 root root 47540 Jan 1 1970 vecadd_blob_tvu_main.blob
-rwxr-xr-x 1 root root 389008 Jan 1 1970 tsi_txe_kernel.bin
-rwxr-xr-x 1 root root 821 Jan 1 1970 tsi_shutdown.sh
-rwxr-xr-x 1 root root 617 Jan 1 1970 tsi_env.sh
-rw-r--r-- 1 root root 207 Jan 1 1970 tsiShutdown.service
-rwxr-xr-x 1 root root 1231 Jan 1 1970 tnApcMgr_run.sh
-rw-r--r-- 1 root root 239 Jan 1 1970 tnApcMgr.service
-rwxr-xr-x 1 root root 327040 Jan 1 1970 tnApcMgr
-rwxr-xr-x 1 root root 80784 Jan 1 1970 sys-diagtool
-rwxr-xr-x 1 root root 1962 Jan 1 1970 run_platform_test.sh
-rwxr-xr-x 1 root root 1557 Jan 1 1970 run_llama_cli.sh
-rwxr-xr-x 1 root root 81256 Jan 1 1970 recvFromHost
-rw-r--r-- 1 root root 546 Jan 1 1970 platform_layout.json
-rwxr-xr-x 1 root root 154944 Jan 1 1970 UAP
drwx------ 7 101006 100003 504 Jan 1 1970 aot-tests
drwxr-xr-x 3 root root 808 Jan 1 1970 tsi-ggml-orig
drwxr-xr-x 3 root root 808 Mar 9 14:02 tsi-ggml
-rw-r--r-- 1 root root 16601481 Nov 5 2025 tsi-ggml-0.2.0.tz
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh
is Luna.
llama_perf_sampler_print: sampling time = 107.93 ms / 11 runs ( 9.81 ms per token, 101.91 tokens per second)
llama_perf_context_print: load time = 86881.95 ms
llama_perf_context_print: prompt eval time = 44125.64 ms / 6 tokens ( 7354.27 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 50143.09 ms / 4 runs (12535.77 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 137167.17 ms / 10 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 484 694 1718635 3550.90
MUL OPU 495 710 943556 1906.17
RMS_NORM OPU 495 495 997128 2014.40
MUL_MAT CPU 8167 0 521220669 63820.33
CONT CPU 1379 0 1672561 1212.88
RESHAPE CPU 1253 0 16040 12.80
VIEW CPU 2010 0 2970 1.48
PERMUTE CPU 1506 0 2556 1.70
TRANSPOSE CPU 413 0 783 1.90
GET_ROWS CPU 84 0 18691 222.51
SET_ROWS CPU 1557 0 597034 383.45
SOFT_MAX CPU 586 0 1002339 1710.48
ROPE CPU 1482 0 101612 68.56
GLU OPU 242 347 1088490 4497.89
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
1310 1014.8600 0.7747 0.0000 [7.28e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1310 1.36e+05 103.6987 1.36e+05 └─ [97.48%] TXE 0 Idle
215 196.9513 0.9161 196.9513 └─ [1.41e-01%] [ txe_swiglu ]
225 140.6387 0.6251 140.6387 └─ [1.01e-01%] [ txe_rms_norm ]
440 133.3295 0.3030 133.3295 └─ [9.57e-02%] [ txe_mult ]
430 126.4495 0.2941 126.4495 └─ [9.07e-02%] [ txe_add ]
[Thread] OPU (cumulative over all threads)
##########
lrwxrwxrwx 1 root root 36 Mar 12 2018 Tiny-Llama-v0.3-FP32-1.1B-F16.gguf -> Tiny-Llama-v0.3-FP32-1.1B-F16:latest
�パには何人の国がありますか" 10 SakanaAI-TinySwallow-1.5B-Instruct-F32:latest ama_cli.sh "ヨーロッ
ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45
llama_perf_sampler_print: sampling time = 931.55 ms / 70 runs ( 13.31 ms per token, 75.14 tokens per second)
llama_perf_context_print: load time = 522881.40 ms
llama_perf_context_print: prompt eval time = 495618.08 ms / 60 tokens ( 8260.30 ms per token, 0.12 tokens per second)
llama_perf_context_print: eval time = 199627.62 ms / 9 runs (22180.85 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 723601.15 ms / 69 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 11802 0 529967 44.90
ADD OPU 2576 5762 11165138 4334.29
MUL OPU 2622 5867 7685276 2931.07
RMS_NORM OPU 2622 2622 5023000 1915.71
MUL_MAT CPU 45495 0 3835513054 84306.25
CONT CPU 7607 0 6103730 802.38
RESHAPE CPU 6730 0 15790 2.35
VIEW CPU 10151 0 8373 0.82
PERMUTE CPU 8313 0 8930 1.07
TRANSPOSE CPU 2131 0 2826 1.33
GET_ROWS CPU 356 0 80435 225.94
SET_ROWS CPU 8229 0 254505 30.93
SOFT_MAX CPU 3615 0 3176380 878.67
ROPE CPU 6875 0 636442 92.57
GLU OPU 1288 2881 9329732 7243.58
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
10004 6879.7890 0.6877 0.0000 [9.48e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
10004 7.15e+05 71.5179 7.15e+05 └─ [98.56%] TXE 0 Idle
1873 2679.1265 1.4304 2679.1265 └─ [3.69e-01%] [ txe_swiglu ]
3815 946.7990 0.2482 946.7990 └─ [1.30e-01%] [ txe_mult ]
3746 918.4545 0.2452 918.4545 └─ [1.27e-01%] [ txe_add ]
570 800.3795 1.4042 800.3795 └─ [1.10e-01%] [ txe_rms_norm ]
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
10004 5465.9730 0.5464 5343.0030 [7.53e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
10004 122.9700 0.0123 122.9700 └─ [1.69e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
10004 11013.0930 1.1009 259.8990 [ 1.52%] [Thread] tsi::runtime::TsavRT::processResponses
10004 10753.1940 1.0749 10753.1940 └─ [ 1.48%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
10006 210.4630 0.0210 210.4630 [2.90e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)
10004 1800.3680 0.1800 1800.3680 [2.48e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
10004 401.7860 0.0402 401.7860 [5.54e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)
10004 748.1970 0.0748 748.1970 [1.03e-01%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
10004 101.4360 0.0101 101.4360 [1.40e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9483
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38
FPGA CPU
oot@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
�パには何人の国がありますか" 10 SakanaAI-TinySwallow-1.5B-Instruct-F32:latest none i.sh "ヨーロッ
ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45
llama_perf_sampler_print: sampling time = 932.43 ms / 70 runs ( 13.32 ms per token, 75.07 tokens per second)
llama_perf_context_print: load time = 46228.02 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 164685.49 ms / 9 runs (18298.39 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 193636.26 ms / 10 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 19495 0 975511 50.04
MUL CPU 7718 0 572598 74.19
RMS_NORM CPU 8163 0 299411 36.68
MUL_MAT CPU 35395 0 3096367955 87480.38
CPY CPU 122 0 133717 1096.04
RESHAPE CPU 8242 0 33716 4.09
VIEW CPU 10324 0 15942 1.54
PERMUTE CPU 6099 0 9819 1.61
GET_ROWS CPU 458 0 73570 160.63
SET_ROWS CPU 7842 0 139176 17.75
ROPE CPU 7186 0 528814 73.59
FLASH_ATTN_EXT CPU 4855 0 8201764 1689.34
GLU CPU 4197 0 2337854 557.03
##########
POSIX LOGS
akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "ヨーロッパには何人の国がありますか" -m /proj/rel/sw/ggml/models/SakanaAI-TinySwallow-1.5B-Instruct-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45 の国があります。
ただし、国境の変更や再編成により、この数
llama_perf_sampler_print: sampling time = 315.10 ms / 49 runs ( 6.43 ms per token, 155.50 tokens per second)
llama_perf_context_print: load time = 12221.21 ms
llama_perf_context_print: prompt eval time = 10035.68 ms / 19 tokens ( 528.19 ms per token, 1.89 tokens per second)
llama_perf_context_print: eval time = 17384.59 ms / 29 runs ( 599.47 ms per token, 1.67 tokens per second)
llama_perf_context_print: total time = 29940.84 ms / 48 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 136214 0 372414 2.73
ADD OPU 24416 25388 34265322 1403.40
MUL OPU 24852 25842 15672949 630.65
RMS_NORM OPU 24852 24852 16474278 662.90
MUL_MAT CPU 439954 0 661593306 1503.78
CONT CPU 93635 0 4786123 51.11
RESHAPE CPU 121538 0 28315 0.23
VIEW CPU 208741 0 27167 0.13
PERMUTE CPU 169804 0 30144 0.18
TRANSPOSE CPU 41690 0 7460 0.18
GET_ROWS CPU 3590 0 7322 2.04
SET_ROWS CPU 84933 0 62283 0.73
SOFT_MAX CPU 43515 0 1392158 31.99
ROPE CPU 95391 0 476439 4.99
GLU OPU 12208 12694 22051834 1806.34
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
8388 2886.7670 0.3442 273.2030 [ 9.62%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
16776 2610.8050 0.1556 2610.8050 └─ [ 8.70%] tsi::runtime::executeWithTimeout
8388 2.7590 3.29e-04 2.7590 └─ [9.20e-03%] LOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148533888[0x801...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
8388 1598.2810 0.1905 298.9620 [ 5.33%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
16776 1295.0250 0.0772 1295.0250 └─ [ 4.32%] tsi::runtime::executeWithTimeout
8388 4.2940 5.12e-04 4.2940 └─ [1.43e-02%] UNLOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148533888[0x8...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses
8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
8390 23.0920 0.0028 23.0920 [7.70e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
8388 35.0850 0.0042 35.0850 [1.17e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
8388 2837.5130 0.3383 2837.5130 [ 9.46%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
8388 10.7640 0.0013 10.7640 [3.59e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9993
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
8388 2886.7670 0.3442 273.2030 [ 9.62%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
16776 2610.8050 0.1556 2610.8050 └─ [ 8.70%] tsi::runtime::executeWithTimeout
8388 2.7590 3.29e-04 2.7590 └─ [9.20e-03%] LOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148533888[0x801...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
8388 1598.2810 0.1905 298.9620 [ 5.33%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
16776 1295.0250 0.0772 1295.0250 └─ [ 4.32%] tsi::runtime::executeWithTimeout
8388 4.2940 5.12e-04 4.2940 └─ [1.43e-02%] UNLOAD_BLOB Command Execution
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148533888[0x8...
8388 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
8390 2438.3930 0.2906 73.4990 [ 8.13%] [Thread] tsi::runtime::TsavRT::processResponses
8390 2364.8940 0.2819 2364.8940 └─ [ 7.88%] tsi::runtime::executeWithTimeout
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
8388 137.2170 0.0164 125.0650 [4.57e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
8388 12.1520 0.0014 12.1520 └─ [4.05e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
8390 23.0920 0.0028 23.0920 [7.70e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
8388 35.0850 0.0042 35.0850 [1.17e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
8388 2837.5130 0.3383 2837.5130 [ 9.46%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
8388 10.7640 0.0013 10.7640 [3.59e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9993
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "ヨーロッパには何人の国がありますか" -m /proj/rel/sw/ggml/models/SakanaAI-TinySwallow-1.5B-Instruct-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
ANOOP Calling ggml_backend_tsavorite_reg
ヨーロッパには 45 の国があります。
ただし、国境の変更や再編成により、この数
llama_perf_sampler_print: sampling time = 322.77 ms / 49 runs ( 6.59 ms per token, 151.81 tokens per second)
llama_perf_context_print: load time = 9632.12 ms
llama_perf_context_print: prompt eval time = 7444.34 ms / 19 tokens ( 391.81 ms per token, 2.55 tokens per second)
llama_perf_context_print: eval time = 11306.47 ms / 29 runs ( 389.88 ms per token, 2.56 tokens per second)
llama_perf_context_print: total time = 21281.04 ms / 48 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 237341 0 993761 4.19
MUL CPU 87628 0 416268 4.75
RMS_NORM CPU 90071 0 225576 2.50
MUL_MAT CPU 342193 0 633988009 1852.72
CPY CPU 1691 0 8609 5.09
RESHAPE CPU 171297 0 35733 0.21
VIEW CPU 211084 0 22456 0.11
PERMUTE CPU 125522 0 13009 0.10
GET_ROWS CPU 4905 0 9540 1.94
SET_ROWS CPU 86704 0 56701 0.65
ROPE CPU 95240 0 439377 4.61
FLASH_ATTN_EXT CPU 48691 0 3749784 77.01
GLU CPU 47226 0 2027012 42.92
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
========================================================================================================================
- 21333.6950 0.0000 21333.6950 [100.00%] TOTAL
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.6667
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli^C-m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
ANOOP Calling ggml_backend_tsavorite_reg
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her
llama_perf_sampler_print: sampling time = 67.94 ms / 36 runs ( 1.89 ms per token, 529.86 tokens per second)
llama_perf_context_print: load time = 23988.12 ms
llama_perf_context_print: prompt eval time = 1606.61 ms / 6 tokens ( 267.77 ms per token, 3.73 tokens per second)
llama_perf_context_print: eval time = 7504.90 ms / 29 runs ( 258.79 ms per token, 3.86 tokens per second)
llama_perf_context_print: total time = 31567.85 ms / 35 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 68857 0 447926 6.51
MUL CPU 59608 0 367592 6.17
RMS_NORM CPU 62212 0 205387 3.30
MUL_MAT CPU 268652 0 422290079 1571.89
CPY CPU 1718 0 7318 4.26
RESHAPE CPU 144715 0 61375 0.42
VIEW CPU 160633 0 20828 0.13
PERMUTE CPU 97730 0 10773 0.11
GET_ROWS CPU 4439 0 9266 2.09
SET_ROWS CPU 61566 0 44231 0.72
ROPE CPU 74540 0 301755 4.05
FLASH_ATTN_EXT CPU 38327 0 3268596 85.28
GLU CPU 34713 0 905402 26.08
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
========================================================================================================================
- 31621.2470 0.0000 31621.2470 [100.00%] TOTAL
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.6667
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 30 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
ANOOP Calling ggml_backend_tsavorite_reg
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her
llama_perf_sampler_print: sampling time = 74.78 ms / 36 runs ( 2.08 ms per token, 481.40 tokens per second)
llama_perf_context_print: load time = 3009.39 ms
llama_perf_context_print: prompt eval time = 2410.34 ms / 6 tokens ( 401.72 ms per token, 2.49 tokens per second)
llama_perf_context_print: eval time = 14009.46 ms / 29 runs ( 483.08 ms per token, 2.07 tokens per second)
llama_perf_context_print: total time = 17100.80 ms / 35 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 19184 19394 31088495 1620.54
MUL OPU 19620 19835 13235653 674.60
RMS_NORM OPU 19620 19620 14978110 763.41
MUL_MAT CPU 344994 0 479866487 1390.94
CONT CPU 72517 0 4061080 56.00
RESHAPE CPU 108256 0 54067 0.50
VIEW CPU 161091 0 21013 0.13
PERMUTE CPU 130732 0 23040 0.18
TRANSPOSE CPU 28537 0 6656 0.23
GET_ROWS CPU 3482 0 8464 2.43
SET_ROWS CPU 63729 0 55527 0.87
SOFT_MAX CPU 33348 0 2730224 81.87
ROPE CPU 73331 0 357391 4.87
GLU OPU 9592 9697 16964720 1768.63
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
5210 2360.9790 0.4532 183.9320 [13.76%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
10420 2174.7240 0.2087 2174.7240 └─ [12.68%] tsi::runtime::executeWithTimeout
5210 2.3230 4.46e-04 2.3230 └─ [1.35e-02%] LOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
5210 1103.4970 0.2118 195.4310 [ 6.43%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
10420 904.9970 0.0869 904.9970 └─ [ 5.28%] tsi::runtime::executeWithTimeout
5210 3.0690 5.89e-04 3.0690 └─ [1.79e-02%] UNLOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
5212 1411.7130 0.2709 48.0710 [ 8.23%] [Thread] tsi::runtime::TsavRT::processResponses
5212 1363.6420 0.2616 1363.6420 └─ [ 7.95%] tsi::runtime::executeWithTimeout
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
5210 88.1890 0.0169 80.9660 [5.14e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
5210 7.2230 0.0014 7.2230 └─ [4.21e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
5212 18.1380 0.0035 18.1380 [1.06e-01%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
5210 23.0340 0.0044 23.0340 [1.34e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
5210 1780.2040 0.3417 1780.2040 [10.38%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
5210 7.1500 0.0014 7.1500 [4.17e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9990
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
5210 2360.9790 0.4532 183.9320 [13.76%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
10420 2174.7240 0.2087 2174.7240 └─ [12.68%] tsi::runtime::executeWithTimeout
5210 2.3230 4.46e-04 2.3230 └─ [1.35e-02%] LOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
5210 1103.4970 0.2118 195.4310 [ 6.43%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
10420 904.9970 0.0869 904.9970 └─ [ 5.28%] tsi::runtime::executeWithTimeout
5210 3.0690 5.89e-04 3.0690 └─ [1.79e-02%] UNLOAD_BLOB Command Execution
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
5210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
5212 1411.7130 0.2709 48.0710 [ 8.23%] [Thread] tsi::runtime::TsavRT::processResponses
5212 1363.6420 0.2616 1363.6420 └─ [ 7.95%] tsi::runtime::executeWithTimeout
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
5210 88.1890 0.0169 80.9660 [5.14e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
5210 7.2230 0.0014 7.2230 └─ [4.21e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
5212 18.1380 0.0035 18.1380 [1.06e-01%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
5210 23.0340 0.0044 23.0340 [1.34e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
5210 1780.2040 0.3417 1780.2040 [10.38%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
5210 7.1500 0.0014 7.1500 [4.17e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9990
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$
[akapoor@ws01 llama.cpp]$