Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

跑增量预训练是中断后恢复不能继续 #10

Closed
charryshi opened this issue Jun 13, 2023 · 7 comments
Closed

跑增量预训练是中断后恢复不能继续 #10

charryshi opened this issue Jun 13, 2023 · 7 comments
Labels
question Further information is requested

Comments

@charryshi
Copy link

出错后恢复重跑报错,已经去掉--overwrite_output_dir参数,麻烦请问可能是怎么原因呀,由于跑的时间要比较长,一旦中断现在就要从头开始

raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint-8000

checkpoint-8000目录下文件存在
adapter_config.json
adapter_model.bin
optimizer.pt
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
scaler.pt
scheduler.pt
trainer_state.json
training_args.bin

@charryshi charryshi added the question Further information is requested label Jun 13, 2023
@shibing624
Copy link
Owner

设置peft_path 就可以恢复训练。

@charryshi
Copy link
Author

感谢,我测试了加了peft_path这个参数还是有问题,报错
File "/home/xxx/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2128, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at outputs-pt-v1/checkpoint-8000
已经配置了
--peft_path ~/MedicalGPT/scripts/outputs-pt-v1/checkpoint-8000,这个目录下文件如下
adapter_config.json
adapter_model.bin
optimizer.pt
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
scaler.pt
scheduler.pt
trainer_state.json
training_args.bin

@shibing624
Copy link
Owner

加了peft_path,就可以把resume_from_checkpoint的逻辑注释掉,我一会儿改下。

@shibing624
Copy link
Owner

lora的恢复训练用参数peft_path,全参的恢复训练用resume_from_checkpoint

@yangcm1986
Copy link

lora的恢复训练用参数peft_path,全参的恢复训练用resume_from_checkpoint

使用peft_path 后,日志如下:
Peft from pre-trained model: /root/autodl-tmp/finetune-sft/outputs-sft-v2/checkpoint-32500

{'loss': 2.1882, 'learning_rate': 1.1111111111111112e-08, 'epoch': 0.0}
{'loss': 0.9015, 'learning_rate': 1.1111111111111112e-07, 'epoch': 0.0}
{'loss': 1.0434, 'learning_rate': 2.2222222222222224e-07, 'epoch': 0.0}
0%| | 22/36000 [02:20<63:07:57, 6.32s/it]

这个看上去还是重新开始呢

@shibing624
Copy link
Owner

是继续训练,看loss可知。

@SelenoChannel
Copy link

lora的恢复训练用参数peft_path,全参的恢复训练用resume_from_checkpoint

使用peft_path 后,日志如下: Peft from pre-trained model: /root/autodl-tmp/finetune-sft/outputs-sft-v2/checkpoint-32500

{'loss': 2.1882, 'learning_rate': 1.1111111111111112e-08, 'epoch': 0.0} {'loss': 0.9015, 'learning_rate': 1.1111111111111112e-07, 'epoch': 0.0} {'loss': 1.0434, 'learning_rate': 2.2222222222222224e-07, 'epoch': 0.0} 0%| | 22/36000 [02:20<63:07:57, 6.32s/it]

这个看上去还是重新开始呢

you can pass resume_from_checkpoint=True to the trainer to skip previous steps. See huggingface/transformers#24274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants