Your current environment
For some reason, I can't run collect_env.py in my env. Sorry about that. :) But I'm sure this problem has nothing to do with the environment.
My envinronment:
vllm: 0.7.3
cuda: 12.4
transformers: 4.50.1
trl: 0.15.2
🐛 Describe the bug
My code using vllm:
llm = LLM(model=model_name_or_path,
dtype="float16",
tensor_parallel_size=tensor_parallel_size,
max_num_seqs=batch_size,
max_model_len=None if max_model_len == -1 else max_model_len,
gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=temperature, max_tokens=max_tokens)
outputs = llm.generate(prompts, sampling_params)
When I want to generate response from Qwen2.5-7B-Instruct, I encounter ValueError raised by this line:
Sliding window for some but all layers is not supported. This model uses sliding window but `max_window_layers` = 28 is less than `num_hidden_layers` = 28. Please open an issue to discuss this feature.
The model I used is fine-tuned using trl library and flash-attention 2, with sliding window enabled.
Looks like there is a TODO tag on this line, does it make sense?
I'm curious why when I trained it with trl and vllm, all works fine, but when I want to predict with the fine-tuend model, the vllm throws this ValueError?
Before submitting a new issue...