Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some issues in training #5

Closed
VicZlq opened this issue Oct 30, 2023 · 2 comments
Closed

Some issues in training #5

VicZlq opened this issue Oct 30, 2023 · 2 comments

Comments

@VicZlq
Copy link

VicZlq commented Oct 30, 2023

Thank you greatly for your excellent work, as I try to reproduce the training process, I encountered the following problem and wondered if you have encountered it?

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/embeddings/word_embedding.py", line 27, in forward
emb = self.embedding(src)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/embeddings/embedding.py", line 27, in forward
emb = embedding(src, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/models/model.py", line 33, in forward
emb = self.embedding(src, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 160, in forward_propagation
loss_info = model(src, tgt, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 110, in train
loss = self.forward_propagation(batch, model)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 638, in worker
trainer.train(args, gpu_id, rank, train_loader, model_for_training, optimizer, scheduler)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 56, in train_and_validate
worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
File "/storage/zhaoliuqing/code/VisorGPT/train/pretrain.py", line 117, in main
trainer.train_and_validate(args)
File "/storage/zhaoliuqing/code/VisorGPT/train/pretrain.py", line 121, in
main()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Looking forward to your reply!

@Sierkinhane
Copy link
Collaborator

Hi, I did not encounter such a problem, but from the error information, it seems that you should put all the tensors on the same device rather that some on cuda and some on cpu.

@VicZlq
Copy link
Author

VicZlq commented Oct 31, 2023

Thank you very much for your reply!
I did the dubug and in trainer.py line 56:
if args.deepspeed:
worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
args.local_rank=None,

This resulted in a subsequent line 559 where the:
# Get logger
args.logger = init_logger(args)
if args.deepspeed:
import deepspeed
deepspeed.init_distributed(dist_backend=args.backend)
rank = dist.get_rank()
gpu_id = proc_id
elif args.dist_train:
rank = gpu_ranks[proc_id]
gpu_id = proc_id

gpu_id = proc_id=None

Here are my startup command parameters
"--deepspeed",
"--deepspeed_config", "/storage/zhaoliuqing/code/VisorGPT/train/models/deepspeed_config.json",
"--dataset_path", "/storage/zhaoliuqing/code/VisorGPT/train/visorgpt_dagger_train_seq.pt",
"--vocab_path", "/storage/zhaoliuqing/code/VisorGPT/train/models/google_uncased_en_coord_vocab.txt",
"--config_path", "/storage/zhaoliuqing/code/VisorGPT/train/models/gpt2/config.json",
"--output_model_path", "/storage/zhaoliuqing/code/VisorGPT/train/models/visorgpt_dagger_train_seq.bin",
"--world_size", "2",
"--gpu_ranks", "0","1",
"--total_steps", "200000",
"--save_checkpoint_steps", "10000",
"--report_steps", "100",
"--learning_rate", "5e-5",
"--batch_size", "16"
I only changed the values for world size and gpu_ranks.

@VicZlq VicZlq closed this as completed Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants