Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5723
CUDA: add conv_2d_transpose (#14287) * CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts
b5722
b5721
vocab : prevent tokenizer overflow (#14301) * vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow
b5720
sycl: add usage of enqueue_functions extension (#14244) * Add header and namespace to use enqueue_functions extension * Convert submit and parallel_for to use new extension in convert.cpp * Convert submit and parallel_for to use extension in ggml-sycl.cpp * Convert submit and parallel_for to use extension in gla.cpp * Convert submit and parallel_for in mmq.cpp * Convert submit and parallel_for in mmvq.cpp * Convert submit and parallel_for in remaining files * Convert all simple parallel_for to nd_launch from enqueue_functions extension * Wrapping extension in general function Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels. --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
b5719
Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286) * Add PowerPC feature detection and scoring * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC * ggml-cpu: Delay some initializations until function is called When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU. --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>
b5718
llama : improve sep token handling (#14272)
b5717
cuda : synchronize graph capture and cublas handle destruction (#14288) Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread
b5716
ggml : fix repack work size for mul_mat_id (#14292) ggml-ci
b5715
ggml: Update KleidiAI to v1.9.0 (#14277)
b5714
model : more uniform output id handling (#14275) * model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci