-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
跑增量预训练是中断后恢复不能继续 #10
Comments
设置peft_path 就可以恢复训练。 |
感谢,我测试了加了peft_path这个参数还是有问题,报错 |
加了peft_path,就可以把resume_from_checkpoint的逻辑注释掉,我一会儿改下。 |
lora的恢复训练用参数peft_path,全参的恢复训练用resume_from_checkpoint |
使用peft_path 后,日志如下: {'loss': 2.1882, 'learning_rate': 1.1111111111111112e-08, 'epoch': 0.0} 这个看上去还是重新开始呢 |
是继续训练,看loss可知。 |
you can pass |
出错后恢复重跑报错,已经去掉--overwrite_output_dir参数,麻烦请问可能是怎么原因呀,由于跑的时间要比较长,一旦中断现在就要从头开始
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint-8000
checkpoint-8000目录下文件存在
adapter_config.json
adapter_model.bin
optimizer.pt
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
scaler.pt
scheduler.pt
trainer_state.json
training_args.bin
The text was updated successfully, but these errors were encountered: