Skip to content

Releases: huggingface/text-generation-inference

v3.1.1

04 Mar 17:15
c34bd9d
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.1.0...v3.1.1

v3.1.0

31 Jan 13:26
463228e
Compare
Choose a tag to compare

Important changes

Deepseek R1 is fully supported on both AMD and Nvidia !

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1

What's Changed

Full Changelog: v3.0.2...v3.1.0

v3.0.2

24 Jan 11:16
b70f29d
Compare
Choose a tag to compare

Tl;dr

New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez

New models unlocked: Cohere2, olmo, olmo2, helium.

What's Changed

New Contributors

Full Changelog: v3.0.1...v3.0.2

v3.0.1

11 Dec 20:13
bb9095a
Compare
Choose a tag to compare

Summary

Patch release to handle a few older models and corner cases.

What's Changed

New Contributors

Full Changelog: v3.0.0...v3.0.1

v3.0.0

09 Dec 20:22
8f326c9
Compare
Choose a tag to compare

TL;DR

Big new release

benchmarks_v3

Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

What's Changed

New Contributors

Full Changelog: v2.4.1...v3.0.0

v2.4.1

22 Nov 17:35
d2ed52f
Compare
Choose a tag to compare

Notable changes

  • Choose input/total tokens automatically based on available VRAM
  • Support Qwen2 VL
  • Decrease latency of very large batches (> 128)

What's Changed

New Contributors

Full Changelog: v2.3.0...v2.4.1

v2.4.0

25 Oct 21:14
0a655a0
Compare
Choose a tag to compare

Notable changes

  • Experimental prefill chunking (PREFILL_CHUNKING=1)
  • Experimental FP8 KV cache support
  • Greatly decrease latency for large batches (> 128 requests)
  • Faster MoE kernels and support for GPTQ-quantized MoE
  • Faster implementation of MLLama

What's Changed

New Contributors

Read more

v2.3.1

03 Oct 13:01
a094729
Compare
Choose a tag to compare

Important changes

  • Added support for Mllama (3.2, vision models). Flashed, unpadded.
  • FP8 performance improvements
  • Moe performance improvements
  • BREAKING CHANGE - When using tools, models could answer with a tool call notify_error with the content error, it will instead output regular generation.

What's Changed

New Contributors

Full Changelog: v2.3.0...v2.3.1

v2.3.0

20 Sep 16:20
169178b
Compare
Choose a tag to compare

Important changes

  • Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
    So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.

  • Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).

  • Lots of performance improvements with Marlin and quantization.

What's Changed

Read more

v2.2.0

23 Jul 16:30
Compare
Choose a tag to compare

Notable changes

  • Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
  • Gemma2 softcap support
  • Deepseek v2 support.
  • Lots of internal reworks/cleanup (allowing for cool features)
  • Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
  • Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)

What's Changed

New Contributors

Full Changelog: v2.1.1...v2.2.0