Skip to content

Conversation

@akapoor3518
Copy link

@akapoor3518 akapoor3518 commented Oct 22, 2025

Validated at Posix & FPGA
akapoor@wssw01 ggml]$ ls -lrt
total 1373444
drwxr-xr-x 3 akapoor tsiusers 4096 May 5 13:06 vendor
drwxr-xr-x 2 akapoor tsiusers 8192 May 12 13:46 lib
-rw-r--r-- 1 atrivedi tsiusers 10930369 May 25 12:04 tsi-ggml-0.0.1.tz
-rw-r--r-- 1 atrivedi tsiusers 14485488 Jun 5 11:11 tsi-ggml-0.0.2.tz
-rw-r--r-- 1 atrivedi tsiusers 14476846 Jun 18 12:10 tsi-ggml-0.0.3.tz
-rw-r--r-- 1 akapoor tsiusers 14481013 Jul 2 10:18 tsi-ggml-0.0.4.tz
-rw-r--r-- 1 atrivedi tsiusers 14066216 Jul 11 09:47 tsi-ggml-0.0.5.tz
-rw-r--r-- 1 atrivedi tsiusers 14066294 Aug 15 15:17 tsi-ggml-0.0.6.tz
-rw-r--r-- 1 atrivedi tsiusers 14067550 Sep 12 15:26 tsi-ggml-0.0.7.tz
-rw-r--r-- 1 akapoor tsiusers 16576955 Sep 24 12:31 tsi-ggml-0.0.8.tz
drwxrwxrwx 6 atrivedi tsiusers 4096 Oct 2 10:01 models_bf16
-rw-r--r-- 1 akapoor tsiusers 16593627 Oct 6 22:48 tsi-ggml-0.0.9.tz
-rw-r--r-- 1 akapoor tsiusers 16594215 Oct 10 16:19 tsi-ggml-0.0.10.tz
drwxrwxrwx 2 akapoor tsiusers 4096 Oct 16 14:47 models
-rw-r--r-- 1 kraza tsiusers 1237843968 Oct 16 14:56 modelsclear
-rw-r--r-- 1 akapoor tsiusers 16599289 Oct 22 16:24 tsi-ggml-0.2.0.tz
lrwxrwxrwx 1 akapoor tsiusers 39 Oct 22 16:24 tsi-ggml-aws-latest.tz -> /aws/proj/rel/sw/ggml/tsi-ggml-0.2.0.tz
lrwxrwxrwx 1 akapoor tsiusers 35 Oct 22 16:24 tsi-ggml-latest.tz -> /proj/rel/sw/ggml/tsi-ggml-0.2.0.tz
[akapoor@wssw01 ggml]$

###########
POSIX LOG
akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I

llama_perf_sampler_print: sampling time = 15.42 ms / 12 runs ( 1.29 ms per token, 778.01 tokens per second)
llama_perf_context_print: load time = 5562.28 ms
llama_perf_context_print: prompt eval time = 4478.28 ms / 6 tokens ( 746.38 ms per token, 1.34 tokens per second)
llama_perf_context_print: eval time = 3905.12 ms / 5 runs ( 781.02 ms per token, 1.28 tokens per second)
llama_perf_context_print: total time = 9484.59 ms / 11 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 704 914 957122 1359.55
MUL OPU 720 935 507975 705.52
RMS_NORM OPU 720 720 435028 604.21
MUL_MAT CPU 12663 0 2093600 1653.32
CONT CPU 2660 0 138887 52.21
RESHAPE CPU 4044 0 1669 0.41
VIEW CPU 5995 0 678 0.11
PERMUTE CPU 4734 0 844 0.18
TRANSPOSE CPU 1088 0 269 0.25
GET_ROWS CPU 134 0 324 2.42
SET_ROWS CPU 2502 0 1921 0.77
SOFT_MAX OPU 352 14784 8219517 23350.90
ROPE CPU 2650 0 13404 5.06
GLU OPU 352 457 560127 1591.27

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    18.5280   18.5280      7.4160  [1.60e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    10.9000   10.9000      1.8580  └─ [9.41e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1     8.0250    8.0250      8.0250    └─ [6.93e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.9560    0.9560      0.9560    └─ [8.25e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0610    0.0610      0.0530    └─ [5.27e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0080    0.0080      0.0080      └─ [6.91e-05%] tsi::runtime::executeWithTimeout
1     0.2120    0.2120      0.2120  └─ [1.83e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1    52.4180   52.4180     51.7980  [4.53e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     0.6120    0.6120      0.0520  └─ [5.28e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.5600    0.5600      0.0950    └─ [4.84e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.4190    0.4190      0.4190      └─ [3.62e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0460    0.0460      0.0430      └─ [3.97e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [2.59e-05%] tsi::runtime::executeWithTimeout
2     0.0080    0.0040      0.0080  └─ [6.91e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

9210 1615.4930 0.1754 229.6150 [13.95%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
18420 1384.8900 0.0752 1384.8900 └─ [11.96%] tsi::runtime::executeWithTimeout
9210 0.9880 1.07e-04 0.9880 └─ [8.53e-03%] LOAD_BLOB Command Execution
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

9210 1514.3190 0.1644 272.3920 [13.07%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
18420 1240.2900 0.0673 1240.2900 └─ [10.71%] tsi::runtime::executeWithTimeout
9210 1.6370 1.78e-04 1.6370 └─ [1.41e-02%] UNLOAD_BLOB Command Execution
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

9212 1680.7330 0.1825 58.8010 [14.51%] [Thread] tsi::runtime::TsavRT::processResponses
9212 1621.9320 0.1761 1621.9320 └─ [14.00%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0720    0.0720      0.0450  [6.22e-04%] [Thread] OPU 
1     0.0270    0.0270      0.0270  └─ [2.33e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

###########
FPGA LOG
rwx------ 7 101006 100003 504 Jan 1 1970 aot-tests
drwxr-xr-x 4 root root 952 Mar 9 12:35 tsi-ggml
drwxr-xr-x 4 root root 952 Mar 9 12:36 tsi-ggml-orig
-rw-r--r-- 1 root root 16598931 Oct 22 2025 tsi-ggml-0.2.0.tz
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv39_10_19_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 108.05 ms / 11 runs ( 9.82 ms per token, 101.80 tokens per second)
llama_perf_context_print: load time = 53827.25 ms
llama_perf_context_print: prompt eval time = 42797.28 ms / 6 tokens ( 7132.88 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 48400.95 ms / 4 runs (12100.24 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 102362.41 ms / 10 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 484 694 1528861 3158.80
MUL OPU 495 710 966104 1951.73
RMS_NORM OPU 495 495 1120119 2262.87
MUL_MAT CPU 8227 0 510991926 62111.57
CONT CPU 1329 0 1720169 1294.33
RESHAPE CPU 1329 0 15171 11.42
VIEW CPU 1847 0 2506 1.36
PERMUTE CPU 1552 0 2970 1.91
TRANSPOSE CPU 307 0 696 2.27
GET_ROWS CPU 83 0 18355 221.14
SET_ROWS CPU 1644 0 538218 327.38
SOFT_MAX CPU 629 0 1007642 1601.97
ROPE CPU 1580 0 99573 63.02
GLU OPU 242 347 1051099 4343.38

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    28.7510   28.7510     24.8580  [2.75e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     2.7170    2.7170      2.7170  └─ [2.60e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.6230    0.6230      0.6230  └─ [5.96e-04%] tsi::runtime::TsavRT::initialize
1     0.5530    0.5530      0.5000  └─ [5.29e-04%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0530    0.0265      0.0530    └─ [5.07e-05%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1310 1007.8860 0.7694 0.0000 [9.64e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1310 1.01e+05 77.1745 1.01e+05 └─ [96.71%] TXE 0 Idle
215 201.5619 0.9375 201.5619 └─ [1.93e-01%] [ txe_swiglu ]
225 136.1007 0.6049 136.1007 └─ [1.30e-01%] [ txe_rms_norm ]
440 128.8312 0.2928 128.8312 └─ [1.23e-01%] [ txe_mult ]
430 124.7104 0.2900 124.7104 └─ [1.19e-01%] [ txe_add ]

[Thread] OPU (cumulative over all threads)

1     4.8840    4.8840      4.6440  [4.67e-03%] [Thread] OPU 
1     0.2400    0.2400      0.2400  └─ [2.30e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1310 781.7130 0.5967 765.6000 [7.48e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1310 16.1130 0.0123 16.1130 └─ [1.54e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1310 1485.2700 1.1338 46.5140 [ 1.42%] [Thread] tsi::runtime::TsavRT::processResponses
1310 1438.7560 1.0983 1438.7560 └─ [ 1.38%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    75.0060   75.0060     58.2750  [7.18e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    16.7310   16.7310     16.7310  └─ [1.60e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1312 66.6430 0.0508 66.6430 [6.38e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1310 339.9690 0.2595 339.9690 [3.25e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1310 68.7070 0.0524 68.7070 [6.57e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1310 98.5160 0.0752 98.5160 [9.42e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1310 14.8400 0.0113 14.8400 [1.42e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   1.05e+05    0.0000    1.05e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7774

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv39_10_19_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv39_10_19_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv39_10_19_2025/bin#
Terminating...
Thanks for using picocom
akapoor@fpga4:/proj/work/akapoor$

Copy link

@atrivedi-tsavoritesi atrivedi-tsavoritesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi Anoop,

Don't we need to change the default case as well, in CMakeLists ? Also how about in the ggml-tsi-kernel repo ?

Thanks
Ashish

@akapoor3518
Copy link
Author

hi Anoop,

Don't we need to change the default case as well, in CMakeLists ? Also how about in the ggml-tsi-kernel repo ?

Thanks Ashish

Done

@akapoor3518 akapoor3518 changed the title @FIR-1039 - llama.cpp: new release of 0.2.0 with sync with MLIR SDK 0… @FIR-1039 - llama.cpp: new release of 0.2.0 with sync with MLIR SDK 0.2.0 Oct 22, 2025
@akapoor3518 akapoor3518 merged commit 541a89b into master Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants