Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

bonuschild · 2023-10-26T04:52:07Z

Test on llm-vscode-inference-server

I use project llm-vscode-inference-server, which inherits from vllm, to load model weight from CodeLlama-7B-AWQ with command:

python api_server.py --trust-remote-code --model ../CodeLlama-7B-AWQ --quantization awq --dtype half --max-model-len 512

And output:

WARNING 10-26 12:34:54 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:34:54 llm_engine.py:72] Initializing an LLM engine with config: model='../CodeLlama-7B-AWQ', tokenizer='../CodeLlama-7B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:34:54 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

Then output after about 5 minutes:

INFO 10-26 12:39:51 llm_engine.py:207] # GPU blocks: 793, # CPU blocks: 512
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 8.49 GiB already allocated; 1.53 GiB free; 8.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I've set the PYTORCH_CUDA_ALLOC_CONF via command before I execute the run command above but still got error:

 set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:100

Test on vllm

Simply change command to:

python3 python -m vllm.entrypoints.openai.api_server --model TheBloke/CodeLlama-7B-Python-AWQ --quantization awq -dtype half

if without -dtype half it raise error like:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

and output:

WARNING 10-26 12:44:31 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:44:31 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-7B-AWQ/', tokenizer='./CodeLlama-7B-AWQ/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:44:31 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

then error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 0; 12.00 GiB total capacity; 4.17 GiB already allocated; 5.94 GiB free; 4.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Resources Usage

Before I execute the command, My RTX 3060 VRAM is 1.5/12GB; After executed it raises to 6.0/12GB then throw out error after about 5minutes saying that OutOfMemoryError.

Question

I just confused that why the AWQ model size is only <=4GB but can not run on the NVIDIA RTX 3060 with 12GB VRAM...

The text was updated successfully, but these errors were encountered:

amir-in-a-cynch · 2023-10-26T18:08:02Z

When testing on vllm, did you try --max-model-len 512?It looks from your output that it went to 16384.

bonuschild · 2023-10-27T01:27:08Z

I've also add `--max-model-len=512` when test on vllm:```bashpython -m vllm.entrypoints.openai.api_server --model ./CodeLlama-7B-AWQ --max-model-len=512 -q awq --dtype half```It outputs same like tests above and occupy same VRAM usage: about 6.3GB/12.0GB. And then throw error with VRAM usage dropping down:```torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 6.93 GiB already allocated; 3.09 GiB free; 6.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

…

---- Replied Message ---- From ***@***.***> Date 10/27/2023 02:08 To ***@***.***> Cc ***@***.***> , ***@***.***> Subject Re: [vllm-project/vllm] Running out of memory with TheBloke/CodeLlama-7B-AWQ (Issue #1479) When testing on vllm, did you try --max-model-len 512? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

casper-hansen · 2023-10-28T09:23:05Z

Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

bonuschild · 2023-10-28T13:05:04Z

Yes, the `--max-model-len` was passed from CLI but converted into `--max-seq-len` in vLLM api. You can verify and try on your vLLM. I've try with 512 many times but won't work ...

…

---- Replied Message ---- From ***@***.***> Date 10/28/2023 17:23 To ***@***.***> Cc ***@***.***> , ***@***.***> Subject Re: [vllm-project/vllm] Running out of memory with TheBloke/CodeLlama-7B-AWQ (Issue #1479) Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

bonuschild · 2023-10-29T10:05:42Z

Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

I've re-tested this on A100 instead of RTX3060, it show that finally it occupy about 20+GB VRAM! Why is that?
I use command:

python api_server.py --model path/to/7b-awq/model --port 8000 -q awq --dtype half --trust-remote-code

That was so weired...

SupreethRao99 · 2023-10-31T09:00:11Z

second @bonuschild 's error output , trying to run mistral-7b on a T4 with 16GB VRAM after I've quantised it with AWQ still causes CUDA OOM errors.

sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=2048)
llm = LLM(model="zephyr-7b-beta-awq", quantization="awq", dtype="float16")
outputs = llm.generate(prompts, sampling_params)

bonuschild · 2023-11-01T00:52:49Z

@amir-in-a-cynch @casper-hansen @SupreethRao99 @tmm1

I use the AWQ model made by @TheBloke from https://huggingface.co/TheBloke/CodeLlama-7B-AWQ

with its instruct I should run this command:

python -m vllm.entrypoints.api_server --model TheBloke/CodeLlama-7B-AWQ --quantization awq

but raise datatype not supported error:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

So I change command and rerun:

python -m vllm.entrypoints.openai.api_server --model path/to/CodeLlama-7B-AWQ -q awq --dtype half

and raise OOM error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 12.00 GiB total capacity; 3.38 GiB already allocated; 6.99 GiB free; 3.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

this time seems is really not enough VRAM...

So why a 4GB AWQ model require more than 12GB VRAM to run?

My GPU Card is RTX 3060(12GB VRAM).
I've tested on A100 with 40GB VRAM and running this model with vllm actually takes about 22GB VRAM

slobodaapl · 2023-11-02T18:06:31Z

Can confirm this is an issue, tried on A100 with normal vLLM, no fork, also facing the same issue.

thr3a · 2023-11-03T13:56:34Z

Same issue

Python 3.11
Docker nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
transformers==4.35.0
torch==2.0.1+cu118
torchvision==0.15.2+cu118

manishiitg · 2023-11-05T16:08:44Z

try adding --gpu-memory-utilization .8 this worked for me

slobodaapl · 2023-11-05T16:39:20Z

I think I found a potential issue and solution. This is specifically because of vLLM works.

Setting the 'max_batch_tokens' (I think is the name) too high causes the KV cache to be too big. It directly influences the GPU memory occupied for some reason. Try setting your max_batch_tokens to like 32k while keeping everything else the same.

This fixed it for me.

SupreethRao99 · 2023-11-05T18:01:17Z

@slobodaapl , could you be more exact with the parameter that should be changed, I don't seem to find anything similar to max_batch_tokens in the arguments for SamplingParams, or the LLM class.

Thank you!

slobodaapl · 2023-11-05T19:45:26Z

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:

python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768

When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

bonuschild · 2023-11-06T06:49:02Z

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:
python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768
When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

I use the command as you provided but still cost 21GB VRAM when loading a 7B-AWQ model :(

python -m vllm.entrypoints.openai.api_server \
        --model $MODEL_ID \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768 \
        -q awq --dtype half --trust-remote-code

slobodaapl · 2023-11-06T10:46:21Z

For those with very limited VRAM try setting the batched tokens to about 4-8k, and combine it with the memory limit parameter to about 0.8.

Also, try the non-quantised version with this first. It seems vLLM uses extra memory to do some kind of operation on the model when loading it quantised.

gesanqiu · 2023-11-09T02:22:52Z

Except --max-num-batched-tokens and --gpu-memory-utilization, I also limit the --max-num-seqs.

demegire · 2023-11-14T11:12:10Z

For anybody stumbling here, be sure to check max_seq_len, for some reason the default was 32768 in TheBloke/zephyr-7B-beta-AWQ

bonuschild · 2023-11-14T12:48:48Z

@demegire Agreed and need to findout the correct max sequence length, which is normally not mentioned in the model card :)

Jaykumaran · 2023-12-24T10:20:26Z

!python -u -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ
--dtype half
--max-num-batched-tokens 4096
--max-model-len 256
--quantization awq
--tensor-parallel-size 1
--port 8010 | grep -q "Uvicorn running" &npx localtunnel --port 8010

  worked in T4 colab  cuda 12.2

Jaykumaran · 2023-12-24T10:21:56Z

Does anyone have idea about using RAGAS EVALUATION using VLLM server for HuggingFace models.I was successfully serving my model,but ragas metrics evaluation does recognise vlllm serving and it always asks for open ai api key

evaluate

from ragas import evaluate

result = evaluate(
fiqa_eval["baseline"].select(range(5)), # showing only 5 for demonstration
metrics=[faithfulness],
)

result

OpenAIKeyNotFound: OpenAI API key not found! Seems like your trying to use Ragas metrics with OpenAI endpoints. Please set 'OPENAI_API_KEY' environment variable

bonuschild mentioned this issue Oct 26, 2023

Running out of memory with TheBloke/CodeLlama-7B-AWQ wangcx18/llm-vscode-inference-server#5

Open

hmellor closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

bonuschild commented Oct 26, 2023

amir-in-a-cynch commented Oct 26, 2023 •

edited

Loading

bonuschild commented Oct 27, 2023 via email

casper-hansen commented Oct 28, 2023

bonuschild commented Oct 28, 2023 via email

bonuschild commented Oct 29, 2023

SupreethRao99 commented Oct 31, 2023

bonuschild commented Nov 1, 2023

slobodaapl commented Nov 2, 2023

thr3a commented Nov 3, 2023

manishiitg commented Nov 5, 2023

slobodaapl commented Nov 5, 2023

SupreethRao99 commented Nov 5, 2023

slobodaapl commented Nov 5, 2023

bonuschild commented Nov 6, 2023

slobodaapl commented Nov 6, 2023

gesanqiu commented Nov 9, 2023

demegire commented Nov 14, 2023

bonuschild commented Nov 14, 2023

Jaykumaran commented Dec 24, 2023

Jaykumaran commented Dec 24, 2023 •

edited

Loading

Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

Comments

bonuschild commented Oct 26, 2023

Test on llm-vscode-inference-server

Test on vllm

System Resources Usage

Question

amir-in-a-cynch commented Oct 26, 2023 • edited Loading

bonuschild commented Oct 27, 2023 via email

casper-hansen commented Oct 28, 2023

bonuschild commented Oct 28, 2023 via email

bonuschild commented Oct 29, 2023

SupreethRao99 commented Oct 31, 2023

bonuschild commented Nov 1, 2023

slobodaapl commented Nov 2, 2023

thr3a commented Nov 3, 2023

manishiitg commented Nov 5, 2023

slobodaapl commented Nov 5, 2023

SupreethRao99 commented Nov 5, 2023

slobodaapl commented Nov 5, 2023

bonuschild commented Nov 6, 2023

slobodaapl commented Nov 6, 2023

gesanqiu commented Nov 9, 2023

demegire commented Nov 14, 2023

bonuschild commented Nov 14, 2023

Jaykumaran commented Dec 24, 2023

Jaykumaran commented Dec 24, 2023 • edited Loading

evaluate

amir-in-a-cynch commented Oct 26, 2023 •

edited

Loading

Jaykumaran commented Dec 24, 2023 •

edited

Loading