diff --git a/README.md b/README.md index 40d64116c..8972e5b56 100644 --- a/README.md +++ b/README.md @@ -108,8 +108,8 @@ For more details about using ``QEfficient`` via Cloud AI 100 Apps SDK, visit [Li ## Documentation -* [Quick Start Guide](https://quic.github.io/efficient-transformers/source/quick_start.html#) -* [Python API](https://quic.github.io/efficient-transformers/source/hl_api.html) +* [Quick Start Guide](https://quic.github.io/efficient-transformers/source/quick_start.html) +* [QEFF API](https://quic.github.io/efficient-transformers/source/qeff_autoclasses.html) * [Validated Models](https://quic.github.io/efficient-transformers/source/validate.html) * [Models coming soon](https://quic.github.io/efficient-transformers/source/validate.html#models-coming-soon) diff --git a/docs/source/supported_features.rst b/docs/source/supported_features.rst index 4177b451f..9715da982 100644 --- a/docs/source/supported_features.rst +++ b/docs/source/supported_features.rst @@ -30,6 +30,8 @@ Supported Features - Enables execution with FP8 precision, significantly improving performance and reducing memory usage for computational tasks. * - Prefill caching - Enhances inference speed by caching key-value pairs for shared prefixes, reducing redundant computations and improving efficiency. + * - On Device Sampling + - Enables sampling operations to be executed directly on the QAIC device rather than the host CPU for QEffForCausalLM models. This enhancement significantly reduces host-device communication overhead and improves inference throughput and scalability. Refer `sample script `_ for more **details**. * - Prompt-Lookup Decoding - Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer `sample script `_ for more **details**. * - :ref:`PEFT LoRA support `