Support FP8-E5M2 KV Cache #2279

zhaoyang-star · 2023-12-27T02:32:43Z

Quantize KV Cache to fp8 can reducue memory usage of kv cache and then could boost throughput. The impl uses fp8 data type for kv cache and has been tested on A100.

The following test is under WarzardCoder-34B.

Dataset	Baseline(KV Cache FP16)	KV Cache FP8 E5M2	KV Cache FP8 E4M3
HumanEval-Python-EN	68.293%	65.854% (↓ 2.439%)	67.683% (↓ 0.61%)
HumanEval-Python-CN	59.146%	59.146% (=)	59.756% (↑ 0.61%)

LLaMA-7B	Baseline(KV Cache FP16)	KV Cache FP8 E5M2	Speedup
Offline throughput (tokens/sec)	1514.35	2265.89	1.49x

Usage:

    from vllm import LLM, SamplingParams
    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Throughput: It will increase offline throughput as the memory for kv cache is doubled. If the on line requests are enough, it will also boost the online throughput.
Latency: It may increase the paged attention kernel as there are quantize/dequantize for cache, especially using fp8e4m3. So we use fp8e5m2 as defult.
Accuracy: We use HumanEval to evaluate the impact of fp8 and found that both e5m2 and e4m3 could be acceptable. In general, please use e4m3 if you want higher accuracy, but be aware that e4m3 will also make latency high as e4m3 may cost more cycles than e5m2 when casting from fp16/bf16/float.

irasin · 2023-12-27T03:01:28Z

LGTM, I was wondering about the performance improvement.
And can we run the fp8 intrinsic on Volta/Ampere/Ada arch or is it just Hopper only?

irasin · 2023-12-27T09:37:01Z

And I want to know which one should we use for better precision and performance between E5M2 and E4M3? I guess this may be related to the specific model.

casper-hansen · 2023-12-27T11:46:44Z

This seriously looks good. Is RTN used for the kv-cache quantization?

zhaoyang-star · 2023-12-29T06:20:19Z

LGTM, I was wondering about the performance improvement. And can we run the fp8 intrinsic on Volta/Ampere/Ada arch or is it just Hopper only?

It is not limited on Hopper. Volta/Ampere are both ok and have bee tested. The fp8 intrinsic will directly use ASM to do data type conversion on Hopper while use bit operations on pre-Hopper.

zhaoyang-star · 2023-12-29T06:53:07Z

RTN

RoundToNearest is not used in this impl. The impl uses cuda fp8 intrinsic, such as __nv_cvt_fp8_to_halfraw and __nv_cvt_bfloat16raw_to_fp8. I think cuda fp8 intrinsic is more general than RNT as it has been supported both on Hopper and pre-Hopper.

zhaoyang-star · 2023-12-29T06:55:56Z

Below are tested on A100-40GB:

Offline throughput:

[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/
Namespace(backend='vllm', dataset=None, dtype='auto', enforce_eager=False, hf_max_batch_size=None, input_len=1024, max_model_len=None, model='/models/huggingface/LLM/llama-7B-hf/', n=1, num_prompts=1000, output_len=1024, quantization=None, seed=0, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 05:45:54 llm_engine.py:74] Initializing an LLM engine with config: model='/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=None, seed=0)
INFO 12-29 05:46:12 llm_engine.py:230] # GPU blocks: 2802, # CPU blocks: 512
INFO 12-29 05:46:17 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 05:46:17 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 05:46:31 model_runner.py:449] Graph capturing finished in 14 secs.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [22:08<00:00,  1.33s/it]
Throughput: 0.75 requests/s, 1541.35 tokens/s
[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/ --kv-cache-dtype="fp8"
Namespace(backend='vllm', dataset=None, dtype='auto', enforce_eager=False, hf_max_batch_size=None, input_len=1024, kv_cache_dtype='fp8', max_model_len=None, model='/models/huggingface/LLM/llama-7B-hf/', n=1, num_prompts=1000, output_len=1024, quantization=None, seed=0, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 06:16:00 llm_engine.py:74] Initializing an LLM engine with config: model='/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 12-29 06:16:13 llm_engine.py:230] # GPU blocks: 5605, # CPU blocks: 1024
INFO 12-29 06:16:21 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 06:16:21 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 06:16:41 model_runner.py:449] Graph capturing finished in 20 secs.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [15:03<00:00,  1.11it/s]
Throughput: 1.11 requests/s, 2265.89 tokens/

Latency:

[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_latency.py --input-len 1024 --output-len 1024 --model /shared/models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/
Namespace(batch_size=8, dtype='auto', enforce_eager=False, input_len=1024, kv_cache_dtype=None, model='/shared/models/huggingface/LLM/llama-7B-hf/', n=1, num_iters=3, output_len=1024, profile=False, profile_result_dir=None, quantization=None, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 07:01:41 llm_engine.py:74] Initializing an LLM engine with config: model='/shared/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=None, seed=0)
INFO 12-29 07:01:53 llm_engine.py:230] # GPU blocks: 2802, # CPU blocks: 512
INFO 12-29 07:01:55 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 07:01:55 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 07:02:01 model_runner.py:449] Graph capturing finished in 6 secs.
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.78s/it]
Avg latency: 18.779154599333804 seconds
[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_latency.py --input-len 1024 --output-len 1024 --model /shared/models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/ --kv-cache-dtype="fp8"
Namespace(batch_size=8, dtype='auto', enforce_eager=False, input_len=1024, kv_cache_dtype='fp8', model='/shared/models/huggingface/LLM/llama-7B-hf/', n=1, num_iters=3, output_len=1024, profile=False, profile_result_dir=None, quantization=None, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 07:13:48 llm_engine.py:74] Initializing an LLM engine with config: model='/shared/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 12-29 07:13:55 llm_engine.py:230] # GPU blocks: 5605, # CPU blocks: 1024
INFO 12-29 07:13:57 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 07:13:57 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 07:14:02 model_runner.py:449] Graph capturing finished in 5 secs.
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:52<00:00, 17.37s/it]
Avg latency: 17.37384683514635 second

zhaoyang-star · 2023-12-29T07:04:04Z

@WoosukKwon @zhuohan123 The PR is ready for review. Could you please take some time to review the code? Thanks a lot.

seanxcwang · 2024-01-03T07:21:31Z

got an error when test on this branch
and
add AT_DISPATCH_CASE(at::ScalarType::Byte, __VA_ARGS__) at csrc/dispatch_utils.h:12 may not a good way to fix this error

zhaoyang-star · 2024-01-03T08:23:21Z

@seanxcwang Thanks for your feedback. We need to add torch.uint8 dtype for cache ops (copy, swap). I will fix it ASAP.

zhaoyang-star · 2024-01-03T14:18:09Z

got an error when test on this branch and add AT_DISPATCH_CASE(at::ScalarType::Byte, __VA_ARGS__) at csrc/dispatch_utils.h:12 may not a good way to fix this error

Fixed. @seanxcwang could you please use the latest PR to test? Thanks again.

seanxcwang · 2024-01-04T06:11:56Z

@zhaoyang-star have used new pr for testing，no other errors were found

zhaoyang-star · 2024-01-09T00:53:48Z

@zhuohan123 @WoosukKwon The PR is ready for review. Could you please take time to review the code?

junior-zsy · 2024-01-09T03:21:56Z

@zhuohan123 @WoosukKwon The PR is ready for review. Could you please take time to review the code?

I hope it can be merged, which is very useful for large models

zhaoyang-star · 2024-01-10T02:00:08Z

@tjtanaa @hongxiayang We use CUDA Math API such as __nv_cvt_fp8_to_halfraw to do data type conversion. So I guess it will fail when running on AMD GPU. I think there are corresponding functions in hip. We could support it in the next PR.

HaiShaw

Some comments below:

E4M3 is the only common FP8 type used (and needed) during inference or model forward path, using E5M2 in forward is rare.
Most FP8 serving and inferencing come with scaled tensor quant, either parameters or activations (KV cache as a part). Using saturate to finite without scaling isn't common in practice, which may incur performance issues in general.

HaiShaw · 2024-01-10T08:01:34Z

vllm/model_executor/layers/attention.py

@@ -220,6 +220,8 @@ def _paged_attention(
 ) -> torch.Tensor:
    output = torch.empty_like(query)

+    enable_fp8_kv_cache = key_cache.dtype == torch.uint8
+


Would this invalidate 8bit KV cache other than FP8, unnecessarily?

+1 Can we get this from model config?

+1 Can we get this from model config?

Sure, have fixed.

zhaoyang-star · 2024-01-10T09:00:08Z

E4M3 is the only common FP8 type used (and needed) during inference or model forward path, using E5M2 in forward is rare.

Thanks for your review.

The main reason E4M3 is not used is that E4M3 is much slower compared with E5M2 under pre-Hopper GPUs. For example, using benchmarks/benchmark_latency.py with --input-len 1024 --output-len 1024 on A100-40GB, E4M3 is about 70% slower than FP16! Because there are more bit operations when E4M3->half on pre-Hopper GPUs, while only one assembly instruction cvt.rn.f16x2.e4m3x2 on Hopper GPUs. So I made E5M2 as the defaut fp8 data type.

LLaMA-7B	Baseline(KV Cache FP16)	KV Cache FP8-E5M2	KV Cache FP8-E4M3
Latency (sec)	18.78	17.37	31.77

Yes, E4M3 (data range is [-447., 448.]) does need scaled param to avoid accuracy loss. E5M2 could no need of scale param.

Yard1 · 2024-01-11T18:19:30Z

vllm/engine/arg_utils.py

+                            type=str,
+                            choices=['fp8', None],
+                            default=None,
+                            help='Data type for kv cache storage.')


Suggested change

help='Data type for kv cache storage.')

help='Data type for kv cache storage. If None, will use model data type.')

zhaoyang-star · 2024-01-27T10:03:42Z

@zhuohan123 Thanks for the review. I applied your suggestion.

zhuohan123

Thanks! Left final two minor comments (I hope these are really the final comments). Can you merge with the main branch and let's see how CI goes?

zhuohan123 · 2024-01-28T07:06:42Z

vllm/config.py

    """

    def __init__(
        self,
        block_size: int,
        gpu_memory_utilization: float,
        swap_space: int,
+        cache_dtype_str: str,


Let's just call it cache_dtype? The _str suffix seems unnecessary to me.

Sure. Fixed.

zhuohan123 · 2024-01-28T07:13:07Z

vllm/worker/worker.py

@@ -36,6 +36,7 @@ def __init__(
        rank: int,
        distributed_init_method: str,
        lora_config: Optional[LoRAConfig] = None,
+        cache_config: Optional[CacheConfig] = None,


This change is weird. Originally we set the cache_config in self.init_cache_engine() (as in line 60 below). This change introduces two cache_config objects, which is super confusing.

The reason we delay the initialization of the cache_config is that cache_config includes the number of KV blocks, which can only be known after memory profiling.

To make things more clear, I think we can just feed in kv_cache_dtype here.

Your suggestion is good for me. cache_config: Optional[CacheConfig] = None -> kv_cache_dtype: Optional[str] = "auto"

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

pcmoritz · 2024-01-31T10:04:36Z

vllm/utils.py

+    num_layers: int,
+    num_heads: int,
+    head_size: int,
+    cache_dtype: Optional[Union[str, torch.dtype]],


Would it be better to make the type Union[str, torch.dtype] here? Based on below implementation, if this is None, the first set of if conditions at the beginning of the function will always end with a ValueError, right? So None is not really an option.

Yes. It is better to use Union[str, torch.dtype] for cache_dtype. I will modify it soon.

ymwangg · 2024-01-31T19:14:51Z

Hi @zhaoyang-star, thanks for the great work! What's the sampling parameter did you use to get the HumanEval pass@1 score? I recently found I need to set frequency_penalty=0.1 to reproduce pass@1 = 0.53 for CodeLlama34B-Python. Pure greedy sampling without frequency penalty only gives pass@1 = 0.40.

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

zhaoyang-star · 2024-02-01T08:57:00Z

Hi @zhaoyang-star, thanks for the great work! What's the sampling parameter did you use to get the HumanEval pass@1 score? I recently found I need to set frequency_penalty=0.1 to reproduce pass@1 = 0.53 for CodeLlama34B-Python. Pure greedy sampling without frequency penalty only gives pass@1 = 0.40.

I used a fine-tuned model based on the open-sourced WarzardCoder-34B. Sorry, The sampling parameter is not recorded and I have not evaluate it under greedy sampling.

enochlev · 2024-02-06T03:37:33Z

Some benchmarks to generate 100 tokens

5-15% boost in performance were most performance gains happen during long input prompts

I added a awq model with kv_fp8 as well

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

HaiShaw · 2024-02-16T01:51:30Z

Hi @zhaoyang-star, thanks for the great work! What's the sampling parameter did you use to get the HumanEval pass@1 score? I recently found I need to set frequency_penalty=0.1 to reproduce pass@1 = 0.53 for CodeLlama34B-Python. Pure greedy sampling without frequency penalty only gives pass@1 = 0.40.

I used a fine-tuned model based on the open-sourced WarzardCoder-34B. Sorry, The sampling parameter is not recorded and I have not evaluate it under greedy sampling.

@zhaoyang-star Is it possible to share the test configuration / parameters for the following table? Thanks.

zhaoyang-star · 2024-02-17T05:42:27Z

@HaiShaw Thanks for your attention. The main configuration I used is as following. Note the WarzardCoder-34B I used is fune-tuned for inner use so it may not be possible to open source.

I have seen your RFC #2461 about fp8 e4m3 with scale factor. It is a great work! I think fp8 with scale factor will achieve less accuracy drop compared to the current Implementation in the PR.

{
    "max_tokens": 2048,
    "temperature": 0.2,
    "use_beam_search": false,
    "top_p": 1,
    "top_k": -1,
    "ignore_eos": false,
    "presence_penalty": 1.2,
    "frequency_penalty": 1.0
}

HaiShaw · 2024-02-17T06:20:10Z

@zhaoyang-star Thanks for your info on the WarzardCoder-34B testing parameters.
Btw, we are working on enable fp8 e4m3 with scaling factors, on AMD chips.

Time-Limit · 2024-03-05T13:07:21Z

Hello. Does KV Cache FP8 need calibrate dataset? How to specify this dataset?

zhaoyang-star · 2024-03-06T01:28:58Z

KV Cache FP8 need calibrate dataset? How to specify this dataset?

@Time-Limit The fp8-e5m2 in vllm has no scaling factors, so calibrate dataset is not need. docs is about how to use this feature. Please feel free to touch me if any trouble is met.

HaiShaw · 2024-03-06T03:21:23Z

Hello. Does KV Cache FP8 need calibrate dataset? How to specify this dataset?

Had a reference to quantizer tool in #2461
Btw, we will send a pull request with fp8 kv cache with scaling factors soon.

Short answer is that, quantizer and its utility would enable you to quantize and compute some scaling factors over your assigned calibration dataset (e.g. cnnmail, or your domain specific data).
Also, we had RFC: FP8 Quantization Schema in vLLM #3218

zhaoyang added 3 commits December 27, 2023 02:04

test_cache.py passed

0ac4ba1

format

63ec85b

fix compiling warning

f98c816

zhaoyang-star added 4 commits December 28, 2023 11:57

fix fp8x4 -> float4

9881221

test_attention pass

850137b

fix typo

a852f54

fix typo

7be2ed4

zhaoyang-star changed the title ~~[WIP] Support FP8 KV Cache~~ Support FP8 KV Cache Dec 29, 2023

add latency & throughput benchmark

4bcf15c

fix benchmark_latency.py

1a13c5a

zhaoyang-star added 3 commits January 3, 2024 08:26

Merge branch 'main' into fp8_cache

5cdb619

fix copy_blocks

82516df

use VLLM_DISPATCH_CASE_FLOATING_BYTE_TYPES

556e5b2

HaiShaw reviewed Jan 10, 2024

View reviewed changes

Yard1 reviewed Jan 11, 2024

View reviewed changes

zhaoyang-star added 2 commits January 27, 2024 10:35

update

dbd464c

rename: create_kv_caches -> create_kv_caches_with_random

11411e1

zhuohan123 approved these changes Jan 28, 2024

View reviewed changes

zhaoyang-star and others added 3 commits January 28, 2024 17:30

remove cache_dtype_str

4945577

add co-author

b52e702

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Merge branch 'main' into fp8_cache

fee9a13

zhuohan123 merged commit 9090bf0 into vllm-project:main Jan 29, 2024
17 checks passed

yippp mentioned this pull request Jan 29, 2024

Error in newest dev ver: Worker.__init__() got an unexpected keyword argument 'cache_config' #2640

Closed

AniZpZ mentioned this pull request Jan 29, 2024

Support int8 KVCache Quant in Vllm #1507

Open

lhcalibur mentioned this pull request Jan 29, 2024

[BUG] Compile source code error for ROCM6.0 #2646

Closed

pcmoritz reviewed Jan 31, 2024

View reviewed changes

pcmoritz mentioned this pull request Jan 31, 2024

[Minor] Fix test_cache.py CI test failure #2684

Merged

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024

Support FP8-E5M2 KV Cache (vllm-project#2279)

deed1ff

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

zhaoyang-star deleted the fp8_cache branch February 1, 2024 08:55

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support FP8-E5M2 KV Cache (vllm-project#2279)

e97ec4f

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

Support FP8-E5M2 KV Cache (vllm-project#2279)

18fb250

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

HaiShaw mentioned this pull request Feb 17, 2024

[Experimental] Prefix Caching Support #1669

Merged

5 tasks

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

AdrianAbeyta mentioned this pull request Mar 8, 2024

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) #3290

Merged

chenxu2048 mentioned this pull request Mar 13, 2024

Prefix Caching with FP8 KV cache support #3234

Draft

	help='Data type for kv cache storage.')
	help='Data type for kv cache storage. If None, will use model data type.')

Support FP8-E5M2 KV Cache #2279

Support FP8-E5M2 KV Cache #2279

Conversation

zhaoyang-star commented Dec 27, 2023 • edited Loading

irasin commented Dec 27, 2023

irasin commented Dec 27, 2023

casper-hansen commented Dec 27, 2023

zhaoyang-star commented Dec 29, 2023

zhaoyang-star commented Dec 29, 2023 • edited Loading

zhaoyang-star commented Dec 29, 2023 • edited Loading

zhaoyang-star commented Dec 29, 2023 • edited Loading

seanxcwang commented Jan 3, 2024 • edited Loading

zhaoyang-star commented Jan 3, 2024 • edited Loading

zhaoyang-star commented Jan 3, 2024 • edited Loading

seanxcwang commented Jan 4, 2024

zhaoyang-star commented Jan 9, 2024

junior-zsy commented Jan 9, 2024

zhaoyang-star commented Jan 10, 2024

HaiShaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaoyang-star commented Jan 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaoyang-star commented Jan 27, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymwangg commented Jan 31, 2024 • edited Loading

zhaoyang-star commented Feb 1, 2024

enochlev commented Feb 6, 2024 • edited Loading

HaiShaw commented Feb 16, 2024

zhaoyang-star commented Feb 17, 2024

HaiShaw commented Feb 17, 2024

Time-Limit commented Mar 5, 2024

zhaoyang-star commented Mar 6, 2024 • edited Loading

HaiShaw commented Mar 6, 2024 • edited Loading

zhaoyang-star commented Dec 27, 2023 •

edited

Loading

zhaoyang-star commented Dec 29, 2023 •

edited

Loading

zhaoyang-star commented Dec 29, 2023 •

edited

Loading

zhaoyang-star commented Dec 29, 2023 •

edited

Loading

seanxcwang commented Jan 3, 2024 •

edited

Loading

zhaoyang-star commented Jan 3, 2024 •

edited

Loading

zhaoyang-star commented Jan 3, 2024 •

edited

Loading

ymwangg commented Jan 31, 2024 •

edited

Loading

enochlev commented Feb 6, 2024 •

edited

Loading

zhaoyang-star commented Mar 6, 2024 •

edited

Loading

HaiShaw commented Mar 6, 2024 •

edited

Loading