Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

Closed
bonuschild opened this issue Oct 26, 2023 · 20 comments
Closed

Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

bonuschild opened this issue Oct 26, 2023 · 20 comments

Comments

@bonuschild
Copy link

Test on llm-vscode-inference-server

I use project llm-vscode-inference-server, which inherits from vllm, to load model weight from CodeLlama-7B-AWQ with command:

python api_server.py --trust-remote-code --model ../CodeLlama-7B-AWQ --quantization awq --dtype half --max-model-len 512

And output:

WARNING 10-26 12:34:54 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:34:54 llm_engine.py:72] Initializing an LLM engine with config: model='../CodeLlama-7B-AWQ', tokenizer='../CodeLlama-7B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:34:54 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

Then output after about 5 minutes:

INFO 10-26 12:39:51 llm_engine.py:207] # GPU blocks: 793, # CPU blocks: 512
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 8.49 GiB already allocated; 1.53 GiB free; 8.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I've set the PYTORCH_CUDA_ALLOC_CONF via command before I execute the run command above but still got error:

 set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:100

Test on vllm

Simply change command to:

python3 python -m vllm.entrypoints.openai.api_server --model TheBloke/CodeLlama-7B-Python-AWQ --quantization awq -dtype half

if without -dtype half it raise error like:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

and output:

WARNING 10-26 12:44:31 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:44:31 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-7B-AWQ/', tokenizer='./CodeLlama-7B-AWQ/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:44:31 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

then error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 0; 12.00 GiB total capacity; 4.17 GiB already allocated; 5.94 GiB free; 4.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Resources Usage

Before I execute the command, My RTX 3060 VRAM is 1.5/12GB; After executed it raises to 6.0/12GB then throw out error after about 5minutes saying that OutOfMemoryError.

Question

I just confused that why the AWQ model size is only <=4GB but can not run on the NVIDIA RTX 3060 with 12GB VRAM...

@amir-in-a-cynch
Copy link

amir-in-a-cynch commented Oct 26, 2023

When testing on vllm, did you try --max-model-len 512?It looks from your output that it went to 16384.

@bonuschild
Copy link
Author

bonuschild commented Oct 27, 2023 via email

@casper-hansen
Copy link
Contributor

Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

@bonuschild
Copy link
Author

bonuschild commented Oct 28, 2023 via email

@bonuschild
Copy link
Author

Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

I've re-tested this on A100 instead of RTX3060, it show that finally it occupy about 20+GB VRAM! Why is that?
I use command:

python api_server.py --model path/to/7b-awq/model --port 8000 -q awq --dtype half --trust-remote-code

That was so weired...

@SupreethRao99
Copy link

second @bonuschild 's error output , trying to run mistral-7b on a T4 with 16GB VRAM after I've quantised it with AWQ still causes CUDA OOM errors.

sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=2048)
llm = LLM(model="zephyr-7b-beta-awq", quantization="awq", dtype="float16")
outputs = llm.generate(prompts, sampling_params)

@bonuschild
Copy link
Author

@amir-in-a-cynch @casper-hansen @SupreethRao99 @tmm1

I use the AWQ model made by @TheBloke from https://huggingface.co/TheBloke/CodeLlama-7B-AWQ

with its instruct I should run this command:

python -m vllm.entrypoints.api_server --model TheBloke/CodeLlama-7B-AWQ --quantization awq

but raise datatype not supported error:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

So I change command and rerun:

python -m vllm.entrypoints.openai.api_server --model path/to/CodeLlama-7B-AWQ -q awq --dtype half

and raise OOM error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 12.00 GiB total capacity; 3.38 GiB already allocated; 6.99 GiB free; 3.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

this time seems is really not enough VRAM...

So why a 4GB AWQ model require more than 12GB VRAM to run?

  • My GPU Card is RTX 3060(12GB VRAM).
  • I've tested on A100 with 40GB VRAM and running this model with vllm actually takes about 22GB VRAM

@slobodaapl
Copy link

Can confirm this is an issue, tried on A100 with normal vLLM, no fork, also facing the same issue.

@thr3a
Copy link

thr3a commented Nov 3, 2023

Same issue

  • Python 3.11
  • Docker nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
  • transformers==4.35.0
  • torch==2.0.1+cu118
  • torchvision==0.15.2+cu118

@manishiitg
Copy link

try adding --gpu-memory-utilization .8 this worked for me

@slobodaapl
Copy link

I think I found a potential issue and solution. This is specifically because of vLLM works.

Setting the 'max_batch_tokens' (I think is the name) too high causes the KV cache to be too big. It directly influences the GPU memory occupied for some reason. Try setting your max_batch_tokens to like 32k while keeping everything else the same.

This fixed it for me.

@SupreethRao99
Copy link

@slobodaapl , could you be more exact with the parameter that should be changed, I don't seem to find anything similar to max_batch_tokens in the arguments for SamplingParams, or the LLM class.

Thank you!

@slobodaapl
Copy link

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:

python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768

When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

@bonuschild
Copy link
Author

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:

python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768

When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

I use the command as you provided but still cost 21GB VRAM when loading a 7B-AWQ model :(

python -m vllm.entrypoints.openai.api_server \
        --model $MODEL_ID \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768 \
        -q awq --dtype half --trust-remote-code

@slobodaapl
Copy link

For those with very limited VRAM try setting the batched tokens to about 4-8k, and combine it with the memory limit parameter to about 0.8.

Also, try the non-quantised version with this first. It seems vLLM uses extra memory to do some kind of operation on the model when loading it quantised.

@gesanqiu
Copy link
Contributor

gesanqiu commented Nov 9, 2023

Except --max-num-batched-tokens and --gpu-memory-utilization, I also limit the --max-num-seqs.

@demegire
Copy link

For anybody stumbling here, be sure to check max_seq_len, for some reason the default was 32768 in TheBloke/zephyr-7B-beta-AWQ

@bonuschild
Copy link
Author

@demegire Agreed and need to findout the correct max sequence length, which is normally not mentioned in the model card :)

@Jaykumaran
Copy link

!python -u -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ
--dtype half
--max-num-batched-tokens 4096
--max-model-len 256
--quantization awq
--tensor-parallel-size 1
--port 8010 | grep -q "Uvicorn running" &npx localtunnel --port 8010

  worked in T4 colab  cuda 12.2

@Jaykumaran
Copy link

Jaykumaran commented Dec 24, 2023

Does anyone have idea about using RAGAS EVALUATION using VLLM server for HuggingFace models.I was successfully serving my model,but ragas metrics evaluation does recognise vlllm serving and it always asks for open ai api key

evaluate

from ragas import evaluate

result = evaluate(
fiqa_eval["baseline"].select(range(5)), # showing only 5 for demonstration
metrics=[faithfulness],
)

result

OpenAIKeyNotFound: OpenAI API key not found! Seems like your trying to use Ragas metrics with OpenAI endpoints. Please set 'OPENAI_API_KEY' environment variable

@hmellor hmellor closed this as completed Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests