Skip to content

Conversation

@akapoor3518
Copy link

Tested at FPGA and posix

FPGA Result
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# tar -zxvf tsi-ggml-0.0.8.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# cd /tsi-ggml
-sh: cd: /tsi-ggml: No such file or directory
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# cd tsi-ggml
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# ls
blobs libggml-cpu.so libllama.so
ggml.sh libggml-tsavorite.so llama-cli
libggml-base.so libggml.so simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# ./ggml.sh
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# ./simple-backend-tsi
load_model: using TSavorite backend

Calculating mem_size 384 2 and creating ggml context

Creating input Tensor

Creating Backend Buffer

Loading Input Tensor Data to Backend Buffer

Bringing tensor data from Backend buffer and printing 32 tensor data:
[ 1.10 2.30 3.20 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 ]
main: compute buffer size: 0.2500 KB

Under Test case for compute API creating build_graph

Compute Done

operation type: add, num of elements 32

compute is also done
Index 0: expected bits 400ccccd, actual bits 400ccccd
Index 1: expected bits 40900000, actual bits 40900000
Index 2: expected bits 40d00000, actual bits 40d00000
Index 3: expected bits 41000000, actual bits 41000000
Index 4: expected bits 41200000, actual bits 41200000
Index 5: expected bits 41400000, actual bits 41400000
Index 6: expected bits 41600000, actual bits 41600000
Index 7: expected bits 41800000, actual bits 41800000
Index 8: expected bits 41900000, actual bits 41900000
Index 9: expected bits 41a00000, actual bits 41a00000
Index 10: expected bits 41b00000, actual bits 41b00000
Index 11: expected bits 41c00000, actual bits 41c00000
Index 12: expected bits 41d00000, actual bits 41d00000
Index 13: expected bits 41e00000, actual bits 41e00000
Index 14: expected bits 41f00000, actual bits 41f00000
Index 15: expected bits 42000000, actual bits 42000000
Index 16: expected bits 42080000, actual bits 42080000
Index 17: expected bits 42100000, actual bits 42100000
Index 18: expected bits 42180000, actual bits 42180000
Index 19: expected bits 42200000, actual bits 42200000
Index 20: expected bits 42280000, actual bits 42280000
Index 21: expected bits 42300000, actual bits 42300000
Index 22: expected bits 42380000, actual bits 42380000
Index 23: expected bits 42400000, actual bits 42400000
Index 24: expected bits 42480000, actual bits 42480000
Index 25: expected bits 42500000, actual bits 42500000
Index 26: expected bits 42580000, actual bits 42580000
Index 27: expected bits 42600000, actual bits 42600000
Index 28: expected bits 42680000, actual bits 42680000
Index 29: expected bits 42700000, actual bits 42700000
Index 30: expected bits 42780000, actual bits 42780000
Index 31: expected bits 42800000, actual bits 42800000

TEST CASE PASSED

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    19.5840   19.5840     18.4450  [9.41e-01%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     0.5580    0.5580      0.5580  └─ [2.68e-02%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.4340    0.4340      0.4340  └─ [2.09e-02%] tsi::runtime::TsavRT::initialize
1     0.1470    0.1470      0.0980  └─ [7.07e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0490    0.0245      0.0490    └─ [2.36e-03%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1     0.1860    0.1860      0.0000  [8.94e-03%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1     5.0024    5.0024      5.0024  └─ [2.40e-01%] TXE 0 Idle
1     0.0959    0.0959      0.0959  └─ [4.61e-03%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1     0.1260    0.1260      0.1180  [6.06e-03%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1     0.0080    0.0080      0.0080  └─ [3.85e-04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1     0.2430    0.2430      0.2330  [1.17e-02%] [Thread] tsi::runtime::TsavRT::processResponses
1     0.0100    0.0100      0.0100  └─ [4.81e-04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    53.9940   53.9940     53.3080  [ 2.60%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1     0.6860    0.6860      0.6860  └─ [3.30e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] OPU (cumulative over all threads)

1     0.3420    0.3420      0.3420  [1.64e-02%] [Thread] OPU 

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

3     0.0980    0.0327      0.0980  [4.71e-03%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1     0.7790    0.7790      0.7790  [3.75e-02%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1     0.1990    0.1990      0.1990  [9.57e-03%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1     0.0370    0.0370      0.0370  [1.78e-03%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1     0.0210    0.0210      0.0210  [1.01e-03%] [Thread] tsi::runtime::TsavRT::deallocate

========================================================================================================================
- 2080.0950 0.0000 2080.0950 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.3333

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# cd ..
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 109.04 ms / 11 runs ( 9.91 ms per token, 100.88 tokens per second)
llama_perf_context_print: load time = 24766.44 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 49392.86 ms / 4 runs (12348.22 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 61805.17 ms / 5 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    20.8360   20.8360     19.4370  [3.26e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     0.6810    0.6810      0.6810  └─ [1.06e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.5200    0.5200      0.5200  └─ [8.13e-04%] tsi::runtime::TsavRT::initialize
1     0.1980    0.1980      0.1480  └─ [3.10e-04%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0500    0.0250      0.0500    └─ [7.82e-05%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

356 243.2140 0.6832 0.0000 [3.80e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
356 61035.1231 171.4470 61035.1231 └─ [95.43%] TXE 0 Idle
180 48.7143 0.2706 48.7143 └─ [7.62e-02%] [ txe_mult ]
176 47.9831 0.2726 47.9831 └─ [7.50e-02%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

356 288.4650 0.8103 278.2360 [4.51e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
356 10.2290 0.0287 10.2290 └─ [1.60e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

356 408.2510 1.1468 23.7530 [6.38e-01%] [Thread] tsi::runtime::TsavRT::processResponses
356 384.4980 1.0801 384.4980 └─ [6.01e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    76.3370   76.3370     57.5970  [1.19e-01%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    18.7400   18.7400     18.7400  └─ [2.93e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

358 37.2500 0.1041 37.2500 [5.82e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

1     5.0620    5.0620      5.0620  [7.91e-03%] [Thread] OPU 

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

356 147.2580 0.4136 147.2580 [2.30e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

356 31.5090 0.0885 31.5090 [4.93e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

356 28.4240 0.0798 28.4240 [4.44e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

356 5.4670 0.0154 5.4670 [8.55e-03%] [Thread] tsi::runtime::TsavRT::deallocate

- 63956.4440    0.0000  63956.4440  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6313

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#

Posix
akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat

llama_perf_sampler_print: sampling time = 28.20 ms / 16 runs ( 1.76 ms per token, 567.40 tokens per second)
llama_perf_context_print: load time = 13464.16 ms
llama_perf_context_print: prompt eval time = 3526.06 ms / 6 tokens ( 587.68 ms per token, 1.70 tokens per second)
llama_perf_context_print: eval time = 7123.48 ms / 9 runs ( 791.50 ms per token, 1.26 tokens per second)
llama_perf_context_print: total time = 20618.80 ms / 15 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    15.8340   15.8340      7.7810  [6.98e-02%] [Thread] tsi::runtime::TsavRTPosix::initialize
1     7.9120    7.9120      0.2610  └─ [3.49e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1     6.6080    6.6080      6.6080    └─ [2.91e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.9050    0.9050      0.9050    └─ [3.99e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1380    0.1380      0.1280    └─ [6.09e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0100    0.0100      0.0100      └─ [4.41e-05%] tsi::runtime::executeWithTimeout
1     0.1410    0.1410      0.1410  └─ [6.22e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     4.5580    4.5580      2.9040  [2.01e-02%] [Thread] tsi::runtime::TsavRT::finalize
1     1.6440    1.6440      0.0570  └─ [7.25e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     1.5870    1.5870      0.0950    └─ [7.00e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     1.4340    1.4340      1.4340      └─ [6.32e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0580    0.0580      0.0420      └─ [2.56e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0160    0.0160      0.0160        └─ [7.06e-05%] tsi::runtime::executeWithTimeout
2     0.0100    0.0050      0.0100  └─ [4.41e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

1315 1513.7180 1.1511 50.9540 [ 6.68%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
2630 1462.0330 0.5559 1462.0330 └─ [ 6.45%] tsi::runtime::executeWithTimeout
1315 0.7310 5.56e-04 0.7310 └─ [3.22e-03%] LOAD_BLOB Command Execution
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148008576[0x800...
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

1315 413.5820 0.3145 47.6210 [ 1.82%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
2630 365.3530 0.1389 365.3530 └─ [ 1.61%] tsi::runtime::executeWithTimeout
1315 0.6080 4.62e-04 0.6080 └─ [2.68e-03%] UNLOAD_BLOB Command Execution
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148008576[0x8...
1315 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1317 504.8030 0.3833 9.7470 [ 2.23%] [Thread] tsi::runtime::TsavRT::processResponses
1317 495.0560 0.3759 495.0560 └─ [ 2.18%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1315 19.6310 0.0149 17.8480 [8.66e-02%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1315 1.7830 0.0014 1.7830 └─ [7.86e-03%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1317 6.7330 0.0051 6.7330 [2.97e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

1     0.0530    0.0530      0.0530  [2.34e-04%] [Thread] OPU 

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1315 5.9530 0.0045 5.9530 [2.63e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1315 942.2370 0.7165 942.2370 [ 4.16%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1315 2.3790 0.0018 2.3790 [1.05e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 22672.1370    0.0000  22672.1370  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9987

[akapoor@wssw01 llama.cpp]$

Copy link

@dineshReddy6381 dineshReddy6381 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@akapoor3518 akapoor3518 merged commit d799844 into master Sep 24, 2025
@akapoor3518 akapoor3518 deleted the FIR-979 branch September 24, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants