@FIR-938 - LLama.cpp-GGML: Enable Support for 4D Tensor Data (Rank 4) #49
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tested at POSIX & FPGA
##########
LOG AT FPGA
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# mv tsi-ggml tsi-ggml-old-sept4
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin#
*** file: tsi-ggml-0.0.7.tz
$ sz -vv tsi-ggml-0.0.7.tz
Sending: tsi-ggml-0.0.7.tz
Bytes Sent:14067152 BPS:89461
Transfer complete
*** exit status: 0 ***
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# tar -zxvf tsi-ggml-0.0.7.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# cd tsi-ggml
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin/tsi-ggml# ls
blobs libggml-cpu.so libllama.so
ggml.sh libggml-tsavorite.so llama-cli
libggml-base.so libggml.so simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin/tsi-ggml# cd ..
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# vi run_llama_cli.sh
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin# ./run_llama_cli.sh
is Luna.
I'm a cat
llama_perf_sampler_print: sampling time = 229.68 ms / 16 runs ( 14.36 ms per token, 69.66 tokens per second)
llama_perf_context_print: load time = 118627.98 ms
llama_perf_context_print: prompt eval time = 43300.26 ms / 6 tokens ( 7216.71 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 197338.34 ms / 9 runs (21926.48 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 316277.84 ms / 15 tokens
=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 440 1715409 3898.66
MUL 670 2136430 3188.70
RMS_NORM 1455 88723 60.98
MUL_MAT 7265 684468244 94214.49
CPY 1220 75759 62.10
CONT 465 4497 9.67
RESHAPE 1804 18618 10.32
VIEW 1538 2511 1.63
PERMUTE 1506 2396 1.59
TRANSPOSE 397 872 2.20
GET_ROWS 93 34568 371.70
SOFT_MAX 634 90675 143.02
ROPE 1486 113372 76.29
UNARY 220 926676 4212.16
-> SILU 220 926676 4212.16
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
1860 2043.4190 1.0986 0.0000 [7.44e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1860 4.80e+05 258.1167 4.80e+05 └─ [174.86%] TXE 0 Idle
990 868.7119 0.8775 868.7119 └─ [3.16e-01%] [ txe_mult ]
220 486.2079 2.2100 486.2079 └─ [1.77e-01%] [ txe_silu ]
650 397.5625 0.6116 397.5625 └─ [1.45e-01%] [ txe_add ]
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
1860 1097.9780 0.5903 1063.8030 [4.00e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1860 34.1750 0.0184 34.1750 └─ [1.24e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
1860 2435.0450 1.3092 302.1230 [8.87e-01%] [Thread] tsi::runtime::TsavRT::processResponses
1860 2132.9220 1.1467 2132.9220 └─ [7.77e-01%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
1861 151.9130 0.0816 151.9130 [5.53e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)
1860 729.6860 0.3923 729.6860 [2.66e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
1860 129.0960 0.0694 129.0960 [4.70e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)
1860 144.8350 0.0779 144.8350 [5.28e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
1860 30.7080 0.0165 30.7080 [1.12e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.7347
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv35_08_22_2025/bin#
########
LOG AT POSIX
$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 100 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her fur. She is a very smart cat and she can do many things. She can climb trees and she can jump over furniture. She is a very loyal cat and she always protects her owner. She is a very loving cat and she always cares for her owner. She is a very kind and loving cat. She
llama_perf_sampler_print: sampling time = 283.74 ms / 106 runs ( 2.68 ms per token, 373.58 tokens per second)
llama_perf_context_print: load time = 4338.00 ms
llama_perf_context_print: prompt eval time = 3584.60 ms / 6 tokens ( 597.43 ms per token, 1.67 tokens per second)
llama_perf_context_print: eval time = 55487.47 ms / 99 runs ( 560.48 ms per token, 1.78 tokens per second)
llama_perf_context_print: total time = 60182.25 ms / 105 tokens
=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 4400 8490859 1929.74
MUL 6700 10622988 1585.52
RMS_NORM 17158 61077 3.56
MUL_MAT 79134 118307809 1495.03
CPY 15803 45947 2.91
CONT 8212 7395 0.90
RESHAPE 32642 13517 0.41
VIEW 30531 4583 0.15
PERMUTE 30428 5335 0.18
TRANSPOSE 6968 1220 0.18
GET_ROWS 1091 2699 2.47
SOFT_MAX 8663 211878 24.46
ROPE 16760 84704 5.05
UNARY 2200 5109963 2322.71
-> SILU 2200 5109963 2322.71
is
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
13830 12611.0150 0.9119 475.4130 [20.46%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
27660 12130.6360 0.4386 12130.6360 └─ [19.68%] tsi::runtime::executeWithTimeout
13830 4.9660 3.59e-04 4.9660 └─ [8.06e-03%] LOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2181038720[0x820...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
13830 3917.9920 0.2833 462.9980 [ 6.36%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
27660 3448.0140 0.1247 3448.0140 └─ [ 5.59%] tsi::runtime::executeWithTimeout
13830 6.9800 5.05e-04 6.9800 └─ [1.13e-02%] UNLOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2181038720[0x8...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
13832 5354.6700 0.3871 79.7010 [ 8.69%] [Thread] tsi::runtime::TsavRT::processResponses
13832 5274.9690 0.3814 5274.9690 └─ [ 8.56%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
13830 200.3660 0.0145 183.4950 [3.25e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
13830 16.8710 0.0012 16.8710 └─ [2.74e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
13831 49.8910 0.0036 49.8910 [8.09e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
13830 60.3770 0.0044 60.3770 [9.80e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
13830 6980.3560 0.5047 6980.3560 [11.33%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
13830 21.5560 0.0016 21.5560 [3.50e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9995
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$