@FIR-938 - LLama.cpp-GGML: Enable Support for 4D Tensor Data (Rank 4) #49

akapoor3518 · 2025-09-04T18:28:04Z

Tested at POSIX & FPGA

##########
LOG AT FPGA
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# mv tsi-ggml tsi-ggml-old-sept4
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin#
*** file: tsi-ggml-0.0.7.tz
$ sz -vv tsi-ggml-0.0.7.tz
Sending: tsi-ggml-0.0.7.tz
Bytes Sent:14067152 BPS:89461

Transfer complete

*** exit status: 0 ***
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# tar -zxvf tsi-ggml-0.0.7.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# cd tsi-ggml
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin/tsi-ggml# ls
blobs libggml-cpu.so libllama.so
ggml.sh libggml-tsavorite.so llama-cli
libggml-base.so libggml.so simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin/tsi-ggml# cd ..
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# vi run_llama_cli.sh
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# ./run_llama_cli.sh
is Luna.
I'm a cat

llama_perf_sampler_print: sampling time = 229.68 ms / 16 runs ( 14.36 ms per token, 69.66 tokens per second)
llama_perf_context_print: load time = 118627.98 ms
llama_perf_context_print: prompt eval time = 43300.26 ms / 6 tokens ( 7216.71 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 197338.34 ms / 9 runs (21926.48 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 316277.84 ms / 15 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 440 1715409 3898.66
MUL 670 2136430 3188.70
RMS_NORM 1455 88723 60.98
MUL_MAT 7265 684468244 94214.49
CPY 1220 75759 62.10
CONT 465 4497 9.67
RESHAPE 1804 18618 10.32
VIEW 1538 2511 1.63
PERMUTE 1506 2396 1.59
TRANSPOSE 397 872 2.20
GET_ROWS 93 34568 371.70
SOFT_MAX 634 90675 143.02
ROPE 1486 113372 76.29
UNARY 220 926676 4212.16
-> SILU 220 926676 4212.16

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1 30627.610030627.6100     34.9140  [11.16%] [Thread] OPU 
1 30592.696030592.6960  30572.8270  └─ [11.14%] tsi::runtime::TsavRTFPGA::initialize
1     8.7030    8.7030      8.7030    └─ [3.17e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     7.9080    7.9080      7.9080    └─ [2.88e-03%] tsi::runtime::TsavRT::initialize
1     3.2580    3.2580      2.6770    └─ [1.19e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.5810    0.2905      0.5810      └─ [2.12e-04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1860 2043.4190 1.0986 0.0000 [7.44e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1860 4.80e+05 258.1167 4.80e+05 └─ [174.86%] TXE 0 Idle
990 868.7119 0.8775 868.7119 └─ [3.16e-01%] [ txe_mult ]
220 486.2079 2.2100 486.2079 └─ [1.77e-01%] [ txe_silu ]
650 397.5625 0.6116 397.5625 └─ [1.45e-01%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1860 1097.9780 0.5903 1063.8030 [4.00e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1860 34.1750 0.0184 34.1750 └─ [1.24e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1860 2435.0450 1.3092 302.1230 [8.87e-01%] [Thread] tsi::runtime::TsavRT::processResponses
1860 2132.9220 1.1467 2132.9220 └─ [7.77e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    59.0930   59.0930     46.2950  [2.15e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    12.7980   12.7980     12.7980  └─ [4.66e-03%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1861 151.9130 0.0816 151.9130 [5.53e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1860 729.6860 0.3923 729.6860 [2.66e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1860 129.0960 0.0694 129.0960 [4.70e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1860 144.8350 0.0779 144.8350 [5.28e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1860 30.7080 0.0165 30.7080 [1.12e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   2.75e+05    0.0000    2.75e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7347

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin#

########
LOG AT POSIX
$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 100 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her fur. She is a very smart cat and she can do many things. She can climb trees and she can jump over furniture. She is a very loyal cat and she always protects her owner. She is a very loving cat and she always cares for her owner. She is a very kind and loving cat. She

llama_perf_sampler_print: sampling time = 283.74 ms / 106 runs ( 2.68 ms per token, 373.58 tokens per second)
llama_perf_context_print: load time = 4338.00 ms
llama_perf_context_print: prompt eval time = 3584.60 ms / 6 tokens ( 597.43 ms per token, 1.67 tokens per second)
llama_perf_context_print: eval time = 55487.47 ms / 99 runs ( 560.48 ms per token, 1.78 tokens per second)
llama_perf_context_print: total time = 60182.25 ms / 105 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 4400 8490859 1929.74
MUL 6700 10622988 1585.52
RMS_NORM 17158 61077 3.56
MUL_MAT 79134 118307809 1495.03
CPY 15803 45947 2.91
CONT 8212 7395 0.90
RESHAPE 32642 13517 0.41
VIEW 30531 4583 0.15
PERMUTE 30428 5335 0.18
TRANSPOSE 6968 1220 0.18
GET_ROWS 1091 2699 2.47
SOFT_MAX 8663 211878 24.46
ROPE 16760 84704 5.05
UNARY 2200 5109963 2322.71
-> SILU 2200 5109963 2322.71
is

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    18.2390   18.2390      0.2090  [2.96e-02%] [Thread] OPU 
1    18.0300   18.0300      8.5340  └─ [2.93e-02%] tsi::runtime::TsavRTPosix::initialize
1     9.2870    9.2870      1.1550    └─ [1.51e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1     6.9500    6.9500      6.9500      └─ [1.13e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.1120    1.1120      1.1120      └─ [1.80e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0700    0.0700      0.0610      └─ [1.14e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0090    0.0090      0.0090        └─ [1.46e-05%] tsi::runtime::executeWithTimeout
1     0.2090    0.2090      0.2090    └─ [3.39e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     3.5150    3.5150      3.2750  [5.70e-03%] [Thread] tsi::runtime::TsavRT::finalize
1     0.2330    0.2330      0.0450  └─ [3.78e-04%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.1880    0.1880      0.0740    └─ [3.05e-04%] tsi::runtime::TsavRT::executeSyncCommand
1     0.0830    0.0830      0.0830      └─ [1.35e-04%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0310    0.0310      0.0280      └─ [5.03e-05%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [4.87e-06%] tsi::runtime::executeWithTimeout
2     0.0070    0.0035      0.0070  └─ [1.14e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

13830 12611.0150 0.9119 475.4130 [20.46%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
27660 12130.6360 0.4386 12130.6360 └─ [19.68%] tsi::runtime::executeWithTimeout
13830 4.9660 3.59e-04 4.9660 └─ [8.06e-03%] LOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2181038720[0x820...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

13830 3917.9920 0.2833 462.9980 [ 6.36%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
27660 3448.0140 0.1247 3448.0140 └─ [ 5.59%] tsi::runtime::executeWithTimeout
13830 6.9800 5.05e-04 6.9800 └─ [1.13e-02%] UNLOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2181038720[0x8...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

13832 5354.6700 0.3871 79.7010 [ 8.69%] [Thread] tsi::runtime::TsavRT::processResponses
13832 5274.9690 0.3814 5274.9690 └─ [ 8.56%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

13830 200.3660 0.0145 183.4950 [3.25e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
13830 16.8710 0.0012 16.8710 └─ [2.74e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

13831 49.8910 0.0036 49.8910 [8.09e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

13830 60.3770 0.0044 60.3770 [9.80e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

13830 6980.3560 0.5047 6980.3560 [11.33%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

13830 21.5560 0.0016 21.5560 [3.50e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 61632.0460    0.0000  61632.0460  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9995

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$

@FIR-938 - LLama.cpp-GGML: Enable Support for 4D Tensor Data (Rank 4)

e000fae

akapoor3518 requested review from Nithyanand-G, atrivedi-tsavoritesi, dineshReddy6381, dmpatra and gkethamallax September 4, 2025 18:28

dmpatra approved these changes Sep 4, 2025

View reviewed changes

atrivedi-tsavoritesi approved these changes Sep 4, 2025

View reviewed changes

Updated the Rank

9240a18

akapoor3518 merged commit 8def574 into master Sep 4, 2025

atrivedi-tsavoritesi deleted the FIR-938 branch September 4, 2025 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@FIR-938 - LLama.cpp-GGML: Enable Support for 4D Tensor Data (Rank 4) #49

@FIR-938 - LLama.cpp-GGML: Enable Support for 4D Tensor Data (Rank 4) #49

Uh oh!

akapoor3518 commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

@FIR-938 - LLama.cpp-GGML: Enable Support for 4D Tensor Data (Rank 4) #49

@FIR-938 - LLama.cpp-GGML: Enable Support for 4D Tensor Data (Rank 4) #49

Uh oh!

Conversation

akapoor3518 commented Sep 4, 2025

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1860 1097.9780 0.5903 1063.8030 [4.00e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 1860 34.1750 0.0184 34.1750 └─ [1.24e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1860 2435.0450 1.3092 302.1230 [8.87e-01%] [Thread] tsi::runtime::TsavRT::processResponses 1860 2132.9220 1.1467 2132.9220 └─ [7.77e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1861 151.9130 0.0816 151.9130 [5.53e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1860 729.6860 0.3923 729.6860 [2.66e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1860 129.0960 0.0694 129.0960 [4.70e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1860 144.8350 0.0779 144.8350 [5.28e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1860 30.7080 0.0165 30.7080 [1.12e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7347

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

13832 5354.6700 0.3871 79.7010 [ 8.69%] [Thread] tsi::runtime::TsavRT::processResponses 13832 5274.9690 0.3814 5274.9690 └─ [ 8.56%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

13830 200.3660 0.0145 183.4950 [3.25e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 13830 16.8710 0.0012 16.8710 └─ [2.74e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

13831 49.8910 0.0036 49.8910 [8.09e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

13830 60.3770 0.0044 60.3770 [9.80e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

13830 6980.3560 0.5047 6980.3560 [11.33%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

13830 21.5560 0.0016 21.5560 [3.50e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9995

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1860 1097.9780 0.5903 1063.8030 [4.00e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1860 34.1750 0.0184 34.1750 └─ [1.24e-02%] tsi::runtime::executeWithTimeout

1860 2435.0450 1.3092 302.1230 [8.87e-01%] [Thread] tsi::runtime::TsavRT::processResponses
1860 2132.9220 1.1467 2132.9220 └─ [7.77e-01%] tsi::runtime::executeWithTimeout

13832 5354.6700 0.3871 79.7010 [ 8.69%] [Thread] tsi::runtime::TsavRT::processResponses
13832 5274.9690 0.3814 5274.9690 └─ [ 8.56%] tsi::runtime::executeWithTimeout

13830 200.3660 0.0145 183.4950 [3.25e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
13830 16.8710 0.0012 16.8710 └─ [2.74e-02%] tsi::runtime::executeWithTimeout