Releases: turboderp/exllamav2
Releases · turboderp/exllamav2
0.0.21
- Support for Granite architecture
- Support for GPT2 architecture
- Support for banned strings in streaming generator
- A bit more work on multimodal support (still unfinished)
- Few bugfixes and stuff
- Windows wheels for PyTorch 2.2.0 are included below to work around an apparent (likely temporary) issue in PyTorch. See #434 and pytorch/pytorch#125109
Full Changelog: v0.0.20...v0.0.21
0.0.20
- Adds Phi3 support
- Wheels compiled for PyTorch 2.3.0
- ROCm 6.0 wheels
Full Changelog: v0.0.19...v0.0.20
0.0.19
- More accurate Q4 cache using groupwise rotations
- Better prompt ingestion speed when using flash-attn
- Minor fixes related to issues quantizing Llama 3
- New, more robust optimizer
- Fix bug on long-sequence inference for GPTQ models
Full Changelog: v0.0.18...v0.0.19
0.0.18
- Support for Command-R-plus
- Fix for pre-AVX2 CPUs
- VRAM optimizations for quantization
- Very preliminary multimodal support
- Various other small fixes and optimizations
Full Changelog: v0.0.17...v0.0.18
0.0.17
Mostly just minor fixes and support for DBRX models.
Full Changelog: v0.0.16...v0.0.17
0.0.16
- Adds support for Cohere models
- N-gram decoding
- A few bugfixes
- Lots of optimizations
Full Changelog: v0.0.15...v0.0.16
0.0.15
- Adds Q4 cache mode
- Support for StarCoder2
- Minor optimizations and a couple of bugfixes
Full Changelog: v0.0.14...v0.0.15
0.0.14
Adds support for Qwen1.5 and Gemma architectures.
Various fixes and optimizations.
Full Changelog since 0.0.13: v0.0.13...v0.0.14
0.0.13.post2
Full Changelog: 0.0.13.post1...0.0.13.post2
0.0.13.post1
Fixes inference on models with vocab sizes that are not multiples of 32