Some issues in training #5

VicZlq · 2023-10-30T12:23:54Z

Thank you greatly for your excellent work, as I try to reproduce the training process, I encountered the following problem and wondered if you have encountered it?

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/embeddings/word_embedding.py", line 27, in forward
emb = self.embedding(src)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/embeddings/embedding.py", line 27, in forward
emb = embedding(src, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/models/model.py", line 33, in forward
emb = self.embedding(src, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 160, in forward_propagation
loss_info = model(src, tgt, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 110, in train
loss = self.forward_propagation(batch, model)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 638, in worker
trainer.train(args, gpu_id, rank, train_loader, model_for_training, optimizer, scheduler)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 56, in train_and_validate
worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
File "/storage/zhaoliuqing/code/VisorGPT/train/pretrain.py", line 117, in main
trainer.train_and_validate(args)
File "/storage/zhaoliuqing/code/VisorGPT/train/pretrain.py", line 121, in
main()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Looking forward to your reply!

Sierkinhane · 2023-10-30T13:45:20Z

Hi, I did not encounter such a problem, but from the error information, it seems that you should put all the tensors on the same device rather that some on cuda and some on cpu.

VicZlq · 2023-10-31T02:18:24Z

Thank you very much for your reply!
I did the dubug and in trainer.py line 56:
if args.deepspeed:
worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
args.local_rank=None,

This resulted in a subsequent line 559 where the:
# Get logger
args.logger = init_logger(args)
if args.deepspeed:
import deepspeed
deepspeed.init_distributed(dist_backend=args.backend)
rank = dist.get_rank()
gpu_id = proc_id
elif args.dist_train:
rank = gpu_ranks[proc_id]
gpu_id = proc_id

gpu_id = proc_id=None

Here are my startup command parameters
"--deepspeed",
"--deepspeed_config", "/storage/zhaoliuqing/code/VisorGPT/train/models/deepspeed_config.json",
"--dataset_path", "/storage/zhaoliuqing/code/VisorGPT/train/visorgpt_dagger_train_seq.pt",
"--vocab_path", "/storage/zhaoliuqing/code/VisorGPT/train/models/google_uncased_en_coord_vocab.txt",
"--config_path", "/storage/zhaoliuqing/code/VisorGPT/train/models/gpt2/config.json",
"--output_model_path", "/storage/zhaoliuqing/code/VisorGPT/train/models/visorgpt_dagger_train_seq.bin",
"--world_size", "2",
"--gpu_ranks", "0","1",
"--total_steps", "200000",
"--save_checkpoint_steps", "10000",
"--report_steps", "100",
"--learning_rate", "5e-5",
"--batch_size", "16"
I only changed the values for world size and gpu_ranks.

VicZlq closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some issues in training #5

Some issues in training #5

VicZlq commented Oct 30, 2023

Sierkinhane commented Oct 30, 2023

VicZlq commented Oct 31, 2023

Some issues in training #5

Some issues in training #5

Comments

VicZlq commented Oct 30, 2023

Sierkinhane commented Oct 30, 2023

VicZlq commented Oct 31, 2023