Skip to content
View VJHack's full-sized avatar

Block or report VJHack

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
VJHack/README.md

Hey there, I’m VJHack 👋

Fast, hermetic builds with Bazel · LLM inference & optimisation


I'm a software engineer passionate about builds and AI.
With Bazel, there's no reason not to have lightning-fast, cross-platform builds.
I believe everyone should be able to run language models on consumer hardware, and I'm deeply interested in inference and performance optimization.


Major Contributions

ggml‑org / llama.cpp

  • #11223 – Top‑σ sampler | Paper – Implemented Top‑σ sampling algorithm from the paper Top-nσ: Not All Logits Are You Need, a novel alternative to Top‑k/Top‑p for LLM decoding, creating a stable sampling space even in high temeratures.
  • #11180 #11116 – Restructures the gguf PyPI package to avoid installing multiple top-level packages and prevent conflicts with existing scripts directory.
  • – Fixed memory alignment issues in quantized KV-cache allocations, improving stability for int4 models.
  • #9527 – Updates response_format to match OpenAI's new structured output schema
  • #9484 – Added the option to disable context shift on infinite text generation with command line argument (--no-context-shift)

ggml‑org / llama.vim

  • #15 – Adds a local cache for FIM completions to reduce server calls. Uses a SHA-256 hash of the prompt state as the key. Default size is 250 (configurable), with a random eviction policy.
  • #18 – Optimizes FIM cache by retaining suggestions when the user continues typing the same text.
  • #21 – Updates the info message to show cache-specific metrics on cache hits (C: current/size | t: total time). Also reduces cache size by storing only the completion content.
  • #24 – Minimizes server-client payloads by filtering out unused response fields. Applies to both ring_update() and main FIM calls, keeping only essential fields like content and timings.

Pinned Loading

  1. llama.cpp Public

    Forked from ggml-org/llama.cpp

    LLM inference in C/C++

    C++

  2. llama.vim Public

    Forked from ggml-org/llama.vim

    Vim plugin for LLM-assisted code/text completion

    Vim Script