Skip to content

Potential issue with load_modal function from trainer chekpoints for all models #176

Open
@AmazingK2k3

Description

@AmazingK2k3

Search before asking

  • I have searched the Multimodal Maestro issues and found no similar bug report.

Bug

Hello,

I was testing out the zeroshot object detection colab notebook personally in my aws environment and I noticed initially that the qwen model was loading across different gpus using this below code and not just the same code even after setting the cuda device.:

from maestro.trainer.models.qwen_2_5_vl.checkpoints import load_model, OptimizationStrategy

MODEL_ID_OR_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
MIN_PIXELS = 512 * 28 * 28
MAX_PIXELS = 2048 * 28 * 28

processor, model = load_model(
    model_id_or_path= "Qwen/Qwen2.5-VL-7B-Instruct",
    device = 'cuda:0',
    optimization_strategy=OptimizationStrategy.NONE,
    min_pixels=MIN_PIXELS,
    max_pixels=MAX_PIXELS,
)

I did browse through the checkpoints file for the qwen model and figured this might be the issue, where regardless of the parameter to the function the model device map is set 'auto'.

https://github.com/roboflow/maestro/blob/develop/maestro/trainer/models/qwen_2_5_vl/checkpoints.py#L81C2-L103C28

// maestro/trainer/models/qwen_2_5_vl/checkpoints.py

     model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            model_id_or_path,
            revision=revision,
            trust_remote_code=True,
            device_map="auto",
            quantization_config=bnb_config,
            torch_dtype=torch.bfloat16,
            cache_dir=cache_dir,
        )
        model = get_peft_model(model, lora_config)
        model.print_trainable_parameters()
    else:
        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            model_id_or_path,
            revision=revision,
            trust_remote_code=True,
            device_map="auto",
            torch_dtype=torch.bfloat16,
            cache_dir=cache_dir,
        )
        model.to(device)

which might override the hyperparameters.

I am confident this is a bug but let me know if this is a issue from my side.

Environment

  • maestro[qwen_2_5_vl]==1.1.0rc2"
  • aws environment - with 4A10 GPUs
  • WIthout import os
    os.environ["CUDA_VISIBLE_DEVICES"] = "0", unable to prevent the model loading in all 4 GPUs.

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions