@FIR-737: Added another endpoint llama-cli t invoke directly in URL #11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit has two changes
Make sure to read the contributing guidelines before submitting a PR
The test results are below from the browser on FPGA2 without any markers
URL used is http://10.50.0.112:5003/llama-cli?model=tiny-llama&backend=tSavorite&tokens=5&prompt=Hello+How+are+you
/usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin/run_platform_test.sh "Hello How are you" 5 tinyllama-vo-5m-para.gguf tSavorite Check if tnApcMgr is running; if it is not, uncomment below line and execute the run_platform_test.sh script. Running on v0.1.1.tsv31_06_06_2025 register_backend: registered backend Tsavorite (1 devices) register_device: registered device Tsavorite (txe) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (CPU) load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin/tsi-ggml/libggml-tsavorite.so load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin/tsi-ggml/libggml-cpu.so build: 5473 (a7b7e46) with gcc (GCC) 13.3.0 for x86_64-pc-linux-gnu (debug) main: llama backend init main: load the model and apply lora adapter, if any TXE Device MEMORY Summary total 134217728 and free 134217728 llama_model_load_from_file_impl: using device Tsavorite (txe) - 128 MiB free llama_model_loader: loaded meta data with 38 key-value pairs and 201 tensors from /tsi/akapoor/ggml/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Tiny Llama v0.3 FP32 llama_model_loader: - kv 3: general.size_label str = 1.1B llama_model_loader: - kv 4: general.license str = apache-2.0 llama_model_loader: - kv 5: general.dataset.count u32 = 3 llama_model_loader: - kv 6: general.dataset.0.name str = SlimPajama 627B llama_model_loader: - kv 7: general.dataset.0.organization str = Cerebras llama_model_loader: - kv 8: general.dataset.0.repo_url str = https://huggingface.co/cerebras/SlimP... llama_model_loader: - kv 9: general.dataset.1.name str = Starcoderdata llama_model_loader: - kv 10: general.dataset.1.organization str = Bigcode llama_model_loader: - kv 11: general.dataset.1.repo_url str = https://huggingface.co/bigcode/starco... llama_model_loader: - kv 12: general.dataset.2.name str = Oasst_Top1_2023 08 25 llama_model_loader: - kv 13: general.dataset.2.version str = 08-25 llama_model_loader: - kv 14: general.dataset.2.organization str = OpenAssistant llama_model_loader: - kv 15: general.dataset.2.repo_url str = https://huggingface.co/OpenAssistant/... llama_model_loader: - kv 16: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 17: llama.block_count u32 = 22 llama_model_loader: - kv 18: llama.context_length u32 = 2048 llama_model_loader: - kv 19: llama.embedding_length u32 = 2048 llama_model_loader: - kv 20: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 21: llama.attention.head_count u32 = 32 llama_model_loader: - kv 22: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 23: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 24: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 25: general.file_type u32 = 0 llama_model_loader: - kv 26: llama.vocab_size u32 = 32003 llama_model_loader: - kv 27: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 28: tokenizer.ggml.model str = llama llama_model_loader: - kv 29: tokenizer.ggml.pre str = default llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,32003] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 31: tokenizer.ggml.scores arr[f32,32003] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,32003] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 201 tensors print_info: file format = GGUF V3 (latest) print_info: file type = all F32 print_info: file size = 4.10 GiB (32.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 6 load: token to piece cache size = 0.1684 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 2048 print_info: n_embd = 2048 print_info: n_layer = 22 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 5632 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 2048 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1B print_info: model params = 1.10 B print_info: general.name = Tiny Llama v0.3 FP32 print_info: vocab type = SPM print_info: n_vocab = 32003 print_info: n_merges = 0 print_info: BOS token = 1 '' print_info: EOS token = 2 '' print_info: EOT token = 32002 '<|im_end|>' print_info: UNK token = 0 '' print_info: PAD token = 32000 '[PAD]' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '' print_info: EOG token = 32002 '<|im_end|>' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = true) TXE Device MEMORY Summary total 134217728 and free 134217728 load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/23 layers to GPU load_tensors: CPU_Mapped model buffer size = 4196.40 MiB .......................................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 12288 llama_context: n_ctx_per_seq = 12288 llama_context: n_batch = 1024 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (12288) > n_ctx_train (2048) -- possible training context overflow [2018-03-09 12:37:20.831859] 277:278 [�[32m info�[m] :: TXE resource allocation request processed successfully. llama_context: CPU output buffer size = 0.12 MiB llama_kv_cache_unified: CPU KV buffer size = 264.00 MiB llama_kv_cache_unified: size = 264.00 MiB ( 12288 cells, 22 layers, 1 seqs), K (f16): 132.00 MiB, V (f16): 132.00 MiB ggml_backend_tsavorite_buffer_type_alloc_buffer is called from llama data Loader ANoop Allocating memory from tsi_alloc with size 15732736 Allocating memory from tsi_alloc with size 15732736 starting memory 0xfffea3cb3080 Address of Newly Created BUffer 0xfffea3cb3080 and size 15732736 llama_context: tsavorite compute buffer size = 15.00 MiB llama_context: CPU compute buffer size = 808.01 MiB llama_context: graph nodes = 798 llama_context: graph splits = 223 (with bs=512), 137 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 12288 main: llama threadpool init, n_threads = 4 main: model was trained on only 2048 context tokens (12288 specified) system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | sampler seed: 1594714430 sampler params: repeat_last_n = 5, repeat_penalty = 1.500, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 12288 top_k = 50, top_p = 0.900, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 12288, n_batch = 1024, n_predict = 5, n_keep = 1 my cat’s name is Luna. llama_perf_sampler_print: sampling time = 163.23 ms / 11 runs ( 14.84 ms per token, 67.39 tokens per second) llama_perf_context_print: load time = 125697.32 ms llama_perf_context_print: prompt eval time = 94708.24 ms / 6 tokens (15784.71 ms per token, 0.06 tokens per second) llama_perf_context_print: eval time = 142173.74 ms / 4 runs (35543.44 ms per token, 0.03 tokens per second) llama_perf_context_print: total time = 268116.82 ms / 10 tokens TXE_ADD Operation, total tensor: 5 Number of Kernel Call: 160 Number of tensor got spilt: 5 Min Num of Elem 2048 Max Num of Elem 2048 TXE_SUB Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_MULT Operation, total tensor: 225 Number of Kernel Call: 14080 Number of tensor got spilt: 225 Min Num of Elem 2048 Max Num of Elem 12288 TXE_DIV Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SQRT Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_NEG Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_ABS Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIN Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIGMOID Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SILU Operation, total tensor: 110 Number of Kernel Call: 18920 Number of tensor got spilt: 110 Min Num of Elem 5632 Max Num of Elem 33792 [2018-03-09 12:41:22.453747] 277:278 [�[32m info�[m] :: TXE resource release request processed successfully. GGML Tsavorite Profiling Results: ------------------------------------------------------------------------------------------------------------------------ Calls Total(ms) T/call Self(ms) Function ------------------------------------------------------------------------------------------------------------------------ 33160 47905.000 1.445 0.000 [20%] RuntimeHostShim::awaitCommandListCompletion 18920 29773.936 1.574 29773.936 └─ [12%] [ txe_silu ] 14080 21871.588 1.553 21871.588 └─ [ 9%] [ txe_mult ] 160 251.353 1.571 251.353 └─ [ 0%] [ txe_add ] 33160 0.541 0.000 0.541 └─ [ 0%] TXE 0 Idle 1 204.000 204.000 26.000 [ 0%] GGML Tsavorite 1 178.000 178.000 178.000 └─ [ 0%] RuntimeHostShim::initialize 1 87.000 87.000 87.000 [ 0%] RuntimeHostShim::finalize 33160 50.000 0.002 50.000 [ 0%] RuntimeHostShim::loadBlob 33160 40.000 0.001 40.000 [ 0%] RuntimeHostShim::finalizeCommandList 33160 7.000 0.000 7.000 [ 0%] RuntimeHostShim::createCommandList 33160 5.000 0.000 5.000 [ 0%] RuntimeHostShim::deallocate 33161 4.000 0.000 4.000 [ 0%] RuntimeHostShim::allocate 33160 4.000 0.000 4.000 [ 0%] RuntimeHostShim::addCommandToList 33160 1.000 0.000 1.000 [ 0%] RuntimeHostShim::unloadBlob 113720 0.000 0.000 0.000 [ 0%] RuntimeHostShim::getShmemManager 33160 0.000 0.000 0.000 [ 0%] RuntimeHostShim::launchBlob ======================================================================================================================== 412163 241743.000 0.587241743.000 [100%] TOTAL ========================================================================================================================