Skip to content

Conversation

@akapoor3518
Copy link

Tested at POSIX & FPGA

##########
LOG AT FPGA
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# mv tsi-ggml tsi-ggml-old-sept4
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin#
*** file: tsi-ggml-0.0.7.tz
$ sz -vv tsi-ggml-0.0.7.tz
Sending: tsi-ggml-0.0.7.tz
Bytes Sent:14067152 BPS:89461

Transfer complete

*** exit status: 0 ***
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# tar -zxvf tsi-ggml-0.0.7.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# cd tsi-ggml
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin/tsi-ggml# ls
blobs libggml-cpu.so libllama.so
ggml.sh libggml-tsavorite.so llama-cli
libggml-base.so libggml.so simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin/tsi-ggml# cd ..
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# vi run_llama_cli.sh
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# ./run_llama_cli.sh
is Luna.
I'm a cat

llama_perf_sampler_print: sampling time = 229.68 ms / 16 runs ( 14.36 ms per token, 69.66 tokens per second)
llama_perf_context_print: load time = 118627.98 ms
llama_perf_context_print: prompt eval time = 43300.26 ms / 6 tokens ( 7216.71 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 197338.34 ms / 9 runs (21926.48 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 316277.84 ms / 15 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 440 1715409 3898.66
MUL 670 2136430 3188.70
RMS_NORM 1455 88723 60.98
MUL_MAT 7265 684468244 94214.49
CPY 1220 75759 62.10
CONT 465 4497 9.67
RESHAPE 1804 18618 10.32
VIEW 1538 2511 1.63
PERMUTE 1506 2396 1.59
TRANSPOSE 397 872 2.20
GET_ROWS 93 34568 371.70
SOFT_MAX 634 90675 143.02
ROPE 1486 113372 76.29
UNARY 220 926676 4212.16
-> SILU 220 926676 4212.16

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1 30627.610030627.6100     34.9140  [11.16%] [Thread] OPU 
1 30592.696030592.6960  30572.8270  └─ [11.14%] tsi::runtime::TsavRTFPGA::initialize
1     8.7030    8.7030      8.7030    └─ [3.17e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     7.9080    7.9080      7.9080    └─ [2.88e-03%] tsi::runtime::TsavRT::initialize
1     3.2580    3.2580      2.6770    └─ [1.19e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.5810    0.2905      0.5810      └─ [2.12e-04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1860 2043.4190 1.0986 0.0000 [7.44e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1860 4.80e+05 258.1167 4.80e+05 └─ [174.86%] TXE 0 Idle
990 868.7119 0.8775 868.7119 └─ [3.16e-01%] [ txe_mult ]
220 486.2079 2.2100 486.2079 └─ [1.77e-01%] [ txe_silu ]
650 397.5625 0.6116 397.5625 └─ [1.45e-01%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1860 1097.9780 0.5903 1063.8030 [4.00e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1860 34.1750 0.0184 34.1750 └─ [1.24e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1860 2435.0450 1.3092 302.1230 [8.87e-01%] [Thread] tsi::runtime::TsavRT::processResponses
1860 2132.9220 1.1467 2132.9220 └─ [7.77e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    59.0930   59.0930     46.2950  [2.15e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    12.7980   12.7980     12.7980  └─ [4.66e-03%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1861 151.9130 0.0816 151.9130 [5.53e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1860 729.6860 0.3923 729.6860 [2.66e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1860 129.0960 0.0694 129.0960 [4.70e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1860 144.8350 0.0779 144.8350 [5.28e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1860 30.7080 0.0165 30.7080 [1.12e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   2.75e+05    0.0000    2.75e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7347

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin#

########
LOG AT POSIX
$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 100 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her fur. She is a very smart cat and she can do many things. She can climb trees and she can jump over furniture. She is a very loyal cat and she always protects her owner. She is a very loving cat and she always cares for her owner. She is a very kind and loving cat. She

llama_perf_sampler_print: sampling time = 283.74 ms / 106 runs ( 2.68 ms per token, 373.58 tokens per second)
llama_perf_context_print: load time = 4338.00 ms
llama_perf_context_print: prompt eval time = 3584.60 ms / 6 tokens ( 597.43 ms per token, 1.67 tokens per second)
llama_perf_context_print: eval time = 55487.47 ms / 99 runs ( 560.48 ms per token, 1.78 tokens per second)
llama_perf_context_print: total time = 60182.25 ms / 105 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 4400 8490859 1929.74
MUL 6700 10622988 1585.52
RMS_NORM 17158 61077 3.56
MUL_MAT 79134 118307809 1495.03
CPY 15803 45947 2.91
CONT 8212 7395 0.90
RESHAPE 32642 13517 0.41
VIEW 30531 4583 0.15
PERMUTE 30428 5335 0.18
TRANSPOSE 6968 1220 0.18
GET_ROWS 1091 2699 2.47
SOFT_MAX 8663 211878 24.46
ROPE 16760 84704 5.05
UNARY 2200 5109963 2322.71
-> SILU 2200 5109963 2322.71
is

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    18.2390   18.2390      0.2090  [2.96e-02%] [Thread] OPU 
1    18.0300   18.0300      8.5340  └─ [2.93e-02%] tsi::runtime::TsavRTPosix::initialize
1     9.2870    9.2870      1.1550    └─ [1.51e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1     6.9500    6.9500      6.9500      └─ [1.13e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.1120    1.1120      1.1120      └─ [1.80e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0700    0.0700      0.0610      └─ [1.14e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0090    0.0090      0.0090        └─ [1.46e-05%] tsi::runtime::executeWithTimeout
1     0.2090    0.2090      0.2090    └─ [3.39e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     3.5150    3.5150      3.2750  [5.70e-03%] [Thread] tsi::runtime::TsavRT::finalize
1     0.2330    0.2330      0.0450  └─ [3.78e-04%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.1880    0.1880      0.0740    └─ [3.05e-04%] tsi::runtime::TsavRT::executeSyncCommand
1     0.0830    0.0830      0.0830      └─ [1.35e-04%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0310    0.0310      0.0280      └─ [5.03e-05%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [4.87e-06%] tsi::runtime::executeWithTimeout
2     0.0070    0.0035      0.0070  └─ [1.14e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

13830 12611.0150 0.9119 475.4130 [20.46%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
27660 12130.6360 0.4386 12130.6360 └─ [19.68%] tsi::runtime::executeWithTimeout
13830 4.9660 3.59e-04 4.9660 └─ [8.06e-03%] LOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2181038720[0x820...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

13830 3917.9920 0.2833 462.9980 [ 6.36%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
27660 3448.0140 0.1247 3448.0140 └─ [ 5.59%] tsi::runtime::executeWithTimeout
13830 6.9800 5.05e-04 6.9800 └─ [1.13e-02%] UNLOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2181038720[0x8...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

13832 5354.6700 0.3871 79.7010 [ 8.69%] [Thread] tsi::runtime::TsavRT::processResponses
13832 5274.9690 0.3814 5274.9690 └─ [ 8.56%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

13830 200.3660 0.0145 183.4950 [3.25e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
13830 16.8710 0.0012 16.8710 └─ [2.74e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

13831 49.8910 0.0036 49.8910 [8.09e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

13830 60.3770 0.0044 60.3770 [9.80e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

13830 6980.3560 0.5047 6980.3560 [11.33%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

13830 21.5560 0.0016 21.5560 [3.50e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 61632.0460    0.0000  61632.0460  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9995

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$

@akapoor3518 akapoor3518 merged commit 8def574 into master Sep 4, 2025
@atrivedi-tsavoritesi atrivedi-tsavoritesi deleted the FIR-938 branch September 4, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants