-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU can not save adapter_model.bin #483
Comments
|
This is due to change in transformers. huggingface/transformers@357f281 You may downgrade to I've added following snippet to from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers.trainer_callback import TrainerCallback
class SavePeftModelCallback(TrainerCallback):
def on_save(self, args, state, control, **kwargs):
checkpoint_folder = os.path.join(
args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}"
)
peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
kwargs["model"].save_pretrained(peft_model_path)
pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
if os.path.exists(pytorch_model_path):
os.remove(pytorch_model_path)
return control
....
trainer = transformers.Trainer(
...
callbacks=[SavePeftModelCallback]
) With this code, the checkpoint directory look like below. It seems the implementation in
|
With this I always fine adapter_model.bin to be 443 bytes |
@maekawataiki Thank you very much for your help. It works for me now! From the main branch of https://github.com/tloen/alpaca-lora, I did the following changes:
Finally I got the files, and they works well. BTW, to fix the issue "FileNotFoundError: [Errno 2] No such file or directory: 'xxx/checkpoint-200/pytorch_model.bin" in mulit GPUs runtime, I revised the snippet a little bit:
Thanks |
Me too, did you solve it? @KKcorps |
443 bytes adapter_model.bin is addressed in #446 and #334
I got following result
Inside checkpoint
|
Hi, I'm trying to run finetune.py by 6 GPUs:
WORLD_SIZE=6 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 torchrun --nproc_per_node=6 --master_port=1234 finetune.py
--base_model='./llama-7b-hf'
--num_epochs=3
--cutoff_len=512
--group_by_length
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]'
--lora_r=16
--micro_batch_size=64
--batch_size=384
And I commented L263~L269 in the finetune.py, based on: #446 (comment)
And I got the following issues:
{'loss': 0.8493, 'learning_rate': 1.8384401114206126e-05, 'epoch': 2.88}
{'loss': 0.8812, 'learning_rate': 1.0027855153203342e-05, 'epoch': 2.94}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 459/459 [2:04:05<00:00, 12.51s/it]
The intermediate checkpoints of PEFT may not be saved correctly, using
TrainerCallback
to save adapter_model.bin in corresponding folders, here are some examples huggingface/peft#96The intermediate checkpoints of PEFT may not be saved correctly, using
TrainerCallback
to save adapter_model.bin in corresponding folders, here are some examples huggingface/peft#96The intermediate checkpoints of PEFT may not be saved correctly, using
TrainerCallback
to save adapter_model.bin in corresponding folders, here are some examples huggingface/peft#96The intermediate checkpoints of PEFT may not be saved correctly, using
TrainerCallback
to save adapter_model.bin in corresponding folders, here are some examples huggingface/peft#96The intermediate checkpoints of PEFT may not be saved correctly, using
TrainerCallback
to save adapter_model.bin in corresponding folders, here are some examples huggingface/peft#96The intermediate checkpoints of PEFT may not be saved correctly, using
TrainerCallback
to save adapter_model.bin in corresponding folders, here are some examples huggingface/peft#96Traceback (most recent call last):
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 283, in
Traceback (most recent call last):
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 283, in
Traceback (most recent call last):
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 283, in
Traceback (most recent call last):
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 283, in
Traceback (most recent call last):
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 283, in
fire.Fire(train)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
fire.Fire(train)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
fire.Fire(train)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
fire.Fire(train) File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
fire.Fire(train)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component, remaining_args = _CallAndUpdateTrace(
component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component = fn(*varargs, **kwargs)component, remaining_args = _CallAndUpdateTrace(
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 273, in train
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)component = fn(*varargs, **kwargs)
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 273, in train
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 273, in train
component, remaining_args = _CallAndUpdateTrace(
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
trainer.train(resume_from_checkpoint=resume_from_checkpoint)trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
component = fn(*varargs, **kwargs)
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 273, in train
component = fn(*varargs, **kwargs)
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 273, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
return inner_training_loop(
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2094, in _inner_training_loop
return inner_training_loop(
return inner_training_loop( File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2094, in _inner_training_loop
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2094, in _inner_training_loop
return inner_training_loop(
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2094, in _inner_training_loop
return inner_training_loop(
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2094, in _inner_training_loop
self._load_best_model()
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _load_best_model
self._load_best_model()
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _load_best_model
self._load_best_model()
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _load_best_model
Traceback (most recent call last):
self._load_best_model()
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _load_best_model
self._load_best_model()
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _load_best_model
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 283, in
fire.Fire(train)
self._issue_warnings_after_load(load_result)
UnboundLocalError: local variable 'load_result' referenced before assignment
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
self._issue_warnings_after_load(load_result)
self._issue_warnings_after_load(load_result)
UnboundLocalError: local variable 'load_result' referenced before assignment
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
UnboundLocalError: local variable 'load_result' referenced before assignment
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/rick/llm/llm-training-data/github/alpaca-lora/finetune.py", line 273, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
self._issue_warnings_after_load(load_result)
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
return inner_training_loop(
UnboundLocalError: local variable 'load_result' referenced before assignment
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2094, in _inner_training_loop
self._load_best_model()
File "/home/rick/anaconda3/envs/alpaca-lora/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _load_best_model
self._issue_warnings_after_load(load_result)
self._issue_warnings_after_load(load_result)
UnboundLocalError: local variable 'load_result' referenced before assignment
UnboundLocalError: local variable 'load_result' referenced before assignment
The text was updated successfully, but these errors were encountered: