What's Changed
- other: sync main with dev by @rebel-eunji in #580
- feature(sub-block): emit sub-block-granular KV cache events by @rebel-jaehwang in #553
- fix(core): account for draft model memory in speculative decoding by @junstar92 in #546
- other(ci): dispatch Merge CI on push to dev by @rebel-jinhwan in #570
- fix(lora): sync with vLLM 0.18.0 and update LoRA tests by @junstar92 in #504
- fix(pooling): align with vllm 0.18 PoolingMetadata + LLM API by @rebel-jinhwan in #582
- other: trigger fsw-integration nightly e2e via repository_dispatch by @rebel-jinhwan in #545
- fix(test): e2e pytest CI bugfix of lora oom and platform path mismatch by @rebel-jinhwan in #586
- fix: handle chunked prefill with speculative decoding by @junstar92 in #554
- fix(kv_connector): prevent double finalizing kv connector when using spec decoding by @rebel-jinhwan in #575
- other: log KV cache layout, warm-up phases, rbln backend invocations by @rebel-jaehwang in #589
- refactor: compile optimum models internally by @rebel-eunji in #538
- feature(core): "rbln" device tensor by @rebel-jonghewk in #548
- fix: rbln_config name(device) in from_optimum by @rebel-seinpark in #598
- fix(model): add moe custom op args - scoring_func by @rebel-thkim in #590
- other: auto-update optimum-rbln to 0.10.4a0 by @rebel-develop in #600
- feature: add no_export_fallback mode by @rebel-daeyang in #601
- fix: ci log level by @rebel-seinpark in #607
- fix(disagg_encoder): handle mixed input scenario and fix potential memory leak by @rebel-yskim in #608
- fix: allow the block size to be omitted for enc & enc-dec models by @rebel-eunji in #602
- fix(platform): set device attrs at class definition for spawn compatibility by @rebel-jaehwang in #609
- fix: sub-block cache copy compat with device tensor by @rebel-jaehwang in #611
- refactor: deduplicate _without_outlier with TypeVar by @rebel-eunji in #606
- fix(whisper): remove workaround INVALID_TOKEN and add missing feature by @rebel-eunji in #597
- fix(sampler): default temperature to 1.0 and torch.zeros to avoid NaN logits by @rebel-eunji in #615
- other: auto-update optimum-rbln to 0.10.4a1 by @rebel-develop in #617
- feature: add vmemory performance metrics by @rebel-daeyang in #587
- fix(sampler): decouple greedy and topk-topp sampling by @rebel-eunji in #571
- other: auto-update optimum-rbln to 0.10.4rc0 by @rebel-develop in #620
- fix: add warmup config and fix logits dtype casting in apply_temperature by @rebel-eunji in #621
- fix(sampler): guard padded-row sampling tensors against torch.empty garbage by @rebel-eunji in #623
- other(platform): default max_num_seqs to 1 by @rebel-eunji in #625
- other: auto-update optimum-rbln to 0.10.4 by @rebel-develop in #633
- release: v0.10.4 by @rebel-seinpark in #634
New Contributors
- @rebel-daeyang made their first contribution in #601
Full Changelog: v0.10.3...v0.10.4