@FIR-979 - llama.cpp update to latest SDK(sdk-r.0.1.9) #54
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tested at FPGA and posix
FPGA Result
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# tar -zxvf tsi-ggml-0.0.8.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# cd /tsi-ggml
-sh: cd: /tsi-ggml: No such file or directory
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# cd tsi-ggml
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# ls
blobs libggml-cpu.so libllama.so
ggml.sh libggml-tsavorite.so llama-cli
libggml-base.so libggml.so simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# ./ggml.sh
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# ./simple-backend-tsi
load_model: using TSavorite backend
Calculating mem_size 384 2 and creating ggml context
Creating input Tensor
Creating Backend Buffer
Loading Input Tensor Data to Backend Buffer
Bringing tensor data from Backend buffer and printing 32 tensor data:
[ 1.10 2.30 3.20 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 ]
main: compute buffer size: 0.2500 KB
Under Test case for compute API creating build_graph
Compute Done
operation type: add, num of elements 32
compute is also done
Index 0: expected bits 400ccccd, actual bits 400ccccd
Index 1: expected bits 40900000, actual bits 40900000
Index 2: expected bits 40d00000, actual bits 40d00000
Index 3: expected bits 41000000, actual bits 41000000
Index 4: expected bits 41200000, actual bits 41200000
Index 5: expected bits 41400000, actual bits 41400000
Index 6: expected bits 41600000, actual bits 41600000
Index 7: expected bits 41800000, actual bits 41800000
Index 8: expected bits 41900000, actual bits 41900000
Index 9: expected bits 41a00000, actual bits 41a00000
Index 10: expected bits 41b00000, actual bits 41b00000
Index 11: expected bits 41c00000, actual bits 41c00000
Index 12: expected bits 41d00000, actual bits 41d00000
Index 13: expected bits 41e00000, actual bits 41e00000
Index 14: expected bits 41f00000, actual bits 41f00000
Index 15: expected bits 42000000, actual bits 42000000
Index 16: expected bits 42080000, actual bits 42080000
Index 17: expected bits 42100000, actual bits 42100000
Index 18: expected bits 42180000, actual bits 42180000
Index 19: expected bits 42200000, actual bits 42200000
Index 20: expected bits 42280000, actual bits 42280000
Index 21: expected bits 42300000, actual bits 42300000
Index 22: expected bits 42380000, actual bits 42380000
Index 23: expected bits 42400000, actual bits 42400000
Index 24: expected bits 42480000, actual bits 42480000
Index 25: expected bits 42500000, actual bits 42500000
Index 26: expected bits 42580000, actual bits 42580000
Index 27: expected bits 42600000, actual bits 42600000
Index 28: expected bits 42680000, actual bits 42680000
Index 29: expected bits 42700000, actual bits 42700000
Index 30: expected bits 42780000, actual bits 42780000
Index 31: expected bits 42800000, actual bits 42800000
TEST CASE PASSED
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
========================================================================================================================
- 2080.0950 0.0000 2080.0950 [100.00%] TOTAL
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.3333
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# cd ..
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# ./run_llama_cli.sh
is Luna.
llama_perf_sampler_print: sampling time = 109.04 ms / 11 runs ( 9.91 ms per token, 100.88 tokens per second)
llama_perf_context_print: load time = 24766.44 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 49392.86 ms / 4 runs (12348.22 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 61805.17 ms / 5 tokens
=== GGML Perf Summary ===
Op Runs Total us Avg us
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
356 243.2140 0.6832 0.0000 [3.80e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
356 61035.1231 171.4470 61035.1231 └─ [95.43%] TXE 0 Idle
180 48.7143 0.2706 48.7143 └─ [7.62e-02%] [ txe_mult ]
176 47.9831 0.2726 47.9831 └─ [7.50e-02%] [ txe_add ]
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
356 288.4650 0.8103 278.2360 [4.51e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
356 10.2290 0.0287 10.2290 └─ [1.60e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
356 408.2510 1.1468 23.7530 [6.38e-01%] [Thread] tsi::runtime::TsavRT::processResponses
356 384.4980 1.0801 384.4980 └─ [6.01e-01%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
358 37.2500 0.1041 37.2500 [5.82e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)
356 147.2580 0.4136 147.2580 [2.30e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
356 31.5090 0.0885 31.5090 [4.93e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)
356 28.4240 0.0798 28.4240 [4.44e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
356 5.4670 0.0154 5.4670 [8.55e-03%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.6313
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#
Posix
akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat
llama_perf_sampler_print: sampling time = 28.20 ms / 16 runs ( 1.76 ms per token, 567.40 tokens per second)
llama_perf_context_print: load time = 13464.16 ms
llama_perf_context_print: prompt eval time = 3526.06 ms / 6 tokens ( 587.68 ms per token, 1.70 tokens per second)
llama_perf_context_print: eval time = 7123.48 ms / 9 runs ( 791.50 ms per token, 1.26 tokens per second)
llama_perf_context_print: total time = 20618.80 ms / 15 tokens
=== GGML Perf Summary ===
Op Runs Total us Avg us
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
1315 1513.7180 1.1511 50.9540 [ 6.68%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
2630 1462.0330 0.5559 1462.0330 └─ [ 6.45%] tsi::runtime::executeWithTimeout
1315 0.7310 5.56e-04 0.7310 └─ [3.22e-03%] LOAD_BLOB Command Execution
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148008576[0x800...
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
1315 413.5820 0.3145 47.6210 [ 1.82%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
2630 365.3530 0.1389 365.3530 └─ [ 1.61%] tsi::runtime::executeWithTimeout
1315 0.6080 4.62e-04 0.6080 └─ [2.68e-03%] UNLOAD_BLOB Command Execution
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148008576[0x8...
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
1317 504.8030 0.3833 9.7470 [ 2.23%] [Thread] tsi::runtime::TsavRT::processResponses
1317 495.0560 0.3759 495.0560 └─ [ 2.18%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
1315 19.6310 0.0149 17.8480 [8.66e-02%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1315 1.7830 0.0014 1.7830 └─ [7.86e-03%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
1317 6.7330 0.0051 6.7330 [2.97e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
1315 5.9530 0.0045 5.9530 [2.63e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
1315 942.2370 0.7165 942.2370 [ 4.16%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
1315 2.3790 0.0018 2.3790 [1.05e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9987
[akapoor@wssw01 llama.cpp]$