-
Notifications
You must be signed in to change notification settings - Fork 571
Description
Bug description
I have a checkpoint folder and I set initial_load_in_hf: true in yaml config like this, when running python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml, I will get the error step-1 not found. From the log I saw the warning :
[0] WARNING checkpoint.initial_load_path is provided but the checkpoint.folder exists. Checkpointer will use the checkpoints from the checkpoint.folder checkpoint.
[0] WARNING checkpoint.initial_load_in_hf is True but the checkpoint.folder exists. Checkpointer will not load from HF safetensors
Looking closer, I noticed that If the checkpoint folder for the current run is not empty, located at {--job.dump_folder}/{--checkpoint.folder} at this line. Since the checkpoint.folder will by default be checkpoints, it will check if checkpoints folder exist or not and try to search from checkpoints folder.. totally ignore the setting initial_load_in_hf: true.
I hope we can change it so that when initial_load_in_hf=True , it will load from HF weights not matter if checkpoint.folder exist or not. This is more user-friendly as the user already configured explicitly initial_load_in_hf=True and expect the program to load from HF weights.
Versions
Latest main