Skip to content

checkpoint.initial_load_in_hf should overwrite everything and load from hf weights. #1900

@wukaixingxp

Description

@wukaixingxp

Bug description

I have a checkpoint folder and I set initial_load_in_hf: true in yaml config like this, when running python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml, I will get the error step-1 not found. From the log I saw the warning :

[0] WARNING checkpoint.initial_load_path is provided but the checkpoint.folder exists. Checkpointer will use the checkpoints from the checkpoint.folder checkpoint.
[0] WARNING checkpoint.initial_load_in_hf is True but the checkpoint.folder exists. Checkpointer will not load from HF safetensors

Looking closer, I noticed that If the checkpoint folder for the current run is not empty, located at {--job.dump_folder}/{--checkpoint.folder} at this line. Since the checkpoint.folder will by default be checkpoints, it will check if checkpoints folder exist or not and try to search from checkpoints folder.. totally ignore the setting initial_load_in_hf: true.

I hope we can change it so that when initial_load_in_hf=True , it will load from HF weights not matter if checkpoint.folder exist or not. This is more user-friendly as the user already configured explicitly initial_load_in_hf=True and expect the program to load from HF weights.

Versions

Latest main

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions