Skip to content

Releases: microsoft/onnxruntime-genai

v0.7.1

22 Apr 02:20
efab081
Compare
Choose a tag to compare

Release Notes

  • Add AMD Quark Quantizer Support #1207
  • Added Gemma 3 to model builder #1359
  • Updated Phi-3 Python Q&A example to be consistent with C++ example #1392
  • Updated Microsoft.Extensions.AI.Abstractions to 9.4.0-preview.1.25207.5 #1388
  • Added OnnxRuntimeGenAIChatClient constructor with Config #1364
  • Improve and Fix TopKTopP #1363
  • Switch the order of softmax on CPU Top K #1354
  • Updated custom nuget packaging logic #1377
  • Updated pybind and fix rpath for macos and check for nullptr #1367
  • Convert tokens to list for concat to accommodate breaking API change in tokenizer #1358

v0.7.0

28 Mar 16:58
8a48d7b
Compare
Choose a tag to compare

Release Notes

We are excited to announce the release of onnxruntime-genai version 0.7.0. Below are the key updates included in this release:

  1. Support for a wider variety of QNN NPU models (such as Deepseek R1)
  2. Remove onnxruntime-genai static library. All language bindings now interface with onnxruntime-genai through the onnxruntime-genai shared library.
    • All return types from onnxruntime-genai python package is now a numpy array type.
    • Previously the return type from tokenizer.encode was a python list. This broke examples/python/model-qa.py which was using '+' to concatenate two lists. np.concatenate must be used instead for these cases.
  3. Abstract away execution provider specific code into shared libraries of their own (for example onnxruntime-genai-cuda for cuda, and onnxruntime-genai-dml for dml). This allows using the onnxruntime-genai-cuda package to also work on non cuda machines (as an example).
  4. Support for multi-modal models (text, speech, and vision) such as phi4-multi-modal.
  5. Add an IChatClient implementation to the onnxruntime-genai C# bindings.
  6. Expose the model type through the Python bindings.
  7. Code and performance improvements for DML EP.

This release also includes several bug fixes that resolve issues reported by users.

v0.6.0

14 Feb 18:07
97d44f6
Compare
Choose a tag to compare

Release Notes

We are excited to announce the release of onnxruntime-genai version 0.6.0. Below are the key updates included in this release:

  1. Support for contextual or continuous decoding allows users to carry out multi-turn conversation style generation.
  2. Support for new models such as Deepseek R1, AMD OLMo, IBM Granite and others.
  3. Python 3.13 wheels have been introduced
  4. Support for generation for models sourced from Qualcomm's AI Hub. This work also includes publishing a nuget package Microsoft.ML.OnnxRuntimeGenAI.QNN for QNN EP
  5. Support for WebGPU EP

This release also includes performance improvements to optimize memory usage and speed. In addition, there are several bug fixes that resolve issues reported by users.

v0.5.2

26 Nov 18:05
27bcf6c
Compare
Choose a tag to compare

Release Notes

Patch release 0.5.2 adds:

  • Fixes for bugs #1074, #1092 via PRs #1065 and #1070
  • Fix Nuget sample in package README to show correct disposal of objects
  • Added extra validation via PRs #1050 #1066

Features in 0.5.0:

  • Support for MultiLoRA
  • Support for multi-frame for Phi-3 vision and Phi-3.5 vision models
  • Support for the Phi-3 MoE model
  • Support for NVIDIA Nemotron model
  • Support for the Qwen model
  • Addition of the Set Terminate feature, which allows users to cancel mid-generation
  • Soft capping support for Group Query Attention
  • Extend quantization support to embedding and LM head layers
  • Mac support in published packages

Known issues

  • Models running with DirectML do not support batching
  • Python 3.13 is not supported in this release

v0.5.1

13 Nov 21:26
e8cd6bc
Compare
Choose a tag to compare

Release Notes

In addition to the features in the 0.5.0 release, this release adds:

  • Add ability to choose provider and modify options at runtime
  • Fixed data leakage bug with KV caches

Features in 0.5.0:

  • Support for MultiLoRA
  • Support for multi-frame for Phi-3 vision and Phi-3.5 vision models
  • Support for the Phi-3 MoE model
  • Support for NVIDIA Nemotron model
  • Support for the Qwen model
  • Addition of the Set Terminate feature, which allows users to cancel mid-generation
  • Soft capping support for Group Query Attention
  • Extend quantization support to embedding and LM head layers
  • Mac support in published packages

Known issues

  • Models running with DirectML do not support batching
  • Python 3.13 is not supported in this release

v0.5.0

08 Nov 19:43
826f6aa
Compare
Choose a tag to compare

Release Notes

  • Support for MultiLoRA
  • Support for multi-frame for Phi-3 vision and Phi-3.5 vision models
  • Support for the Phi-3 MoE model
  • Support for NVIDIA Nemotron model
  • Support for the Qwen model
  • Addition of the Set Terminate feature, which allows users to cancel mid-generation
  • Soft capping support for Group Query Attention
  • Extend quantization support to embedding and LM head layers
  • Mac support in published packages

Known issues

  • Models running with DirectML do not support batching
  • Python 3.13 is not supported in this release

v0.4.0

22 Aug 20:26
b77e768
Compare
Choose a tag to compare

Release Notes

  • Support for new models such as Qwen 2, LLaMA 3.1, Gemma 2, Phi-3 small on CPU
  • Support to build already-quantized models that were quantized with AWQ or GPTQ
  • Performance improvements for Intel and Arm CPU
  • Packing and language binding
    • Added Java bindings (build from source)
    • Separate OnnxRuntime.dll and directml.dll out of GenAI package to improve usability
    • Publish packages for Win Arm
    • Support for Android (build from source)

v0.3.0

21 Jun 21:23
964eb65
Compare
Choose a tag to compare

Release Notes

  • Phi-3 Vision model support for DML EP.
  • Addressed DML memory leak issue and crashes on long prompts.
  • Addressed crashes and slowness on CPU EP GQA on long prompts due to integer overflow issues.
  • Added the import lib for windows C API package.
  • Addressed a bug with get_output('logits') so that it returns the logits for the entire prompt and not for the last generated token.
  • Addressed a bug with querying the device type of the model so that it won't crash.
  • Added NetStandard 2.0 compatibility.

ONNX Runtime GenAI v0.3.0-rc2

30 May 17:24
d536387
Compare
Choose a tag to compare
Pre-release

Release Notes

  • Added support for the Phi-3-Vision model.
  • Added support for the Phi-3-Small model.
  • Removed usage of std::filesystem to avoid runtime issues when loading incompatible symbols from stdc++ and stdc++fs.