Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error at windows #126

Closed
sounansu opened this issue Jun 5, 2019 · 9 comments
Closed

Training error at windows #126

sounansu opened this issue Jun 5, 2019 · 9 comments

Comments

@sounansu
Copy link

sounansu commented Jun 5, 2019

I already can demo.py on my windows10 CP.

Next, I try to run benchmark evaluate on my PC.
But, below error was occurred.

Please, some one help me!

(CenterNet) F:\Users\sounansu\Anaconda3\CenterNet\src>python test.py ctdet --exp_id coco_dla --keep_res --load_model ../models/ctdet_coco_dla_2x.pth
Keep resolution testing.
training chunk_sizes: [32]
The output will be saved to  F:\Users\sounansu\Anaconda3\CenterNet\src\lib\..\..\exp\ctdet\coco_dla
heads {'hm': 80, 'wh': 2, 'reg': 2}
Namespace(K=100, aggr_weight=0.0, agnostic_ex=False, arch='dla_34', aug_ddd=0.5, aug_rot=0, batch_size=32, cat_spec_wh=False, center_thresh=0.1, chunk_sizes=[32], data_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\data', dataset='coco', debug=0, debug_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\exp\\ctdet\\coco_dla\\debug', debugger_theme='white', demo='', dense_hp=False, dense_wh=False, dep_weight=1, dim_weight=1, down_ratio=4, eval_oracle_dep=False, eval_oracle_hm=False, eval_oracle_hmhp=False, eval_oracle_hp_offset=False, eval_oracle_kps=False, eval_oracle_offset=False, eval_oracle_wh=False, exp_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\exp\\ctdet', exp_id='coco_dla', fix_res=False, flip=0.5, flip_test=False, gpus=[0], gpus_str='0', head_conv=256, heads={'hm': 80, 'wh': 2, 'reg': 2}, hide_data_time=False, hm_hp=True, hm_hp_weight=1, hm_weight=1, hp_weight=1, input_h=512, input_res=512, input_w=512, keep_res=True, kitti_split='3dop', load_model='../models/ctdet_coco_dla_2x.pth', lr=0.000125, lr_step=[90, 120], master_batch_size=32, mean=array([[[0.40789655, 0.44719303, 0.47026116]]], dtype=float32), metric='loss', mse_loss=False, nms=False, no_color_aug=False, norm_wh=False, not_cuda_benchmark=False, not_hm_hp=False, not_prefetch_test=False, not_rand_crop=False, not_reg_bbox=False, not_reg_hp_offset=False, not_reg_offset=False, num_classes=80, num_epochs=140, num_iters=-1, num_stacks=1, num_workers=4, off_weight=1, output_h=128, output_res=128, output_w=128, pad=31, peak_thresh=0.2, print_iter=0, rect_mask=False, reg_bbox=True, reg_hp_offset=True, reg_loss='l1', reg_offset=True, resume=False, root_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..', rot_weight=1, rotate=0, save_all=False, save_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\exp\\ctdet\\coco_dla', scale=0.4, scores_thresh=0.1, seed=317, shift=0.1, std=array([[[0.2886383 , 0.27408165, 0.27809834]]], dtype=float32), task='ctdet', test=False, test_scales=[1.0], trainval=False, val_intervals=5, vis_thresh=0.3, wh_weight=0.1)
==> initializing coco 2017 val data.
loading annotations into memory...
Done (t=0.70s)
creating index...
index created!
Loaded val 5000 samples
Creating model...
loaded ../models/ctdet_coco_dla_2x.pth, epoch 230
coco_dlaTHCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.6_035809\conda\conda-bld\pytorch_1556683229598\work\torch/csrc/generic/StorageSharing.cpp line=245 error=71 : operation not supported
Traceback (most recent call last):
  File "test.py", line 126, in <module>
    prefetch_test(opt)
  File "test.py", line 69, in prefetch_test
    for ind, (img_id, pre_processed_images) in enumerate(data_loader):
  File "f:\users\sounansu\anaconda3\envs\centernet\lib\site-packages\torch\utils\data\dataloader.py", line 193, in __iter__
    return _DataLoaderIter(self)
  File "f:\users\sounansu\anaconda3\envs\centernet\lib\site-packages\torch\utils\data\dataloader.py", line 469, in __init__
    w.start()
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "f:\users\sounansu\anaconda3\envs\centernet\lib\site-packages\torch\multiprocessing\reductions.py", line 231, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\tmp_conda_3.6_035809\conda\conda-bld\pytorch_1556683229598\work\torch/csrc/generic/StorageSharing.cpp:245

(CenterNet) F:\Users\sounansu\Anaconda3\CenterNet\src>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
@xingyizhou
Copy link
Owner

It seems the problem is from the pytorch dataload. Make sure your pytorch is installed correctly, i.e., you can run other pytorch projects. For benchmark testing, you can disable the multi-thread dataloader by --not_prefetch_test. Note that this will slow down the testing.

@sounansu
Copy link
Author

sounansu commented Jun 9, 2019

Thank you!
I can run test.py by below.

python test.py ctdet --not_prefetch_test --exp_id coco_dla --keep_res --load_model ..\models\ctdet_coco_dla_2x.pth

But, when I tried to run main.py Training as below.

(CenterNet) F:\Users\sounansu\Anaconda3\CenterNet\src>python main.py ctdet --exp_id coco_dla --batch_size 32 --master_batch 15 --lr 1.25e-4  --gpus 0
Fix size testing.
training chunk_sizes: [15]
The output will be saved to  F:\Users\sounansu\Anaconda3\CenterNet\src\lib\..\..\exp\ctdet\coco_dla
heads {'hm': 80, 'wh': 2, 'reg': 2}
Namespace(K=100, aggr_weight=0.0, agnostic_ex=False, arch='dla_34', aug_ddd=0.5, aug_rot=0, batch_size=32, cat_spec_wh=False, center_thresh=0.1, chunk_sizes=[15], data_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\data', dataset='coco', debug=0, debug_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\exp\\ctdet\\coco_dla\\debug', debugger_theme='white', demo='', dense_hp=False, dense_wh=False, dep_weight=1, dim_weight=1, down_ratio=4, eval_oracle_dep=False, eval_oracle_hm=False, eval_oracle_hmhp=False, eval_oracle_hp_offset=False, eval_oracle_kps=False, eval_oracle_offset=False, eval_oracle_wh=False, exp_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\exp\\ctdet', exp_id='coco_dla', fix_res=True, flip=0.5, flip_test=False, gpus=[0], gpus_str='0', head_conv=256, heads={'hm': 80, 'wh': 2, 'reg': 2}, hide_data_time=False, hm_hp=True, hm_hp_weight=1, hm_weight=1, hp_weight=1, input_h=512, input_res=512, input_w=512, keep_res=False, kitti_split='3dop', load_model='', lr=0.000125, lr_step=[90, 120], master_batch_size=15, mean=array([[[0.40789655, 0.44719303, 0.47026116]]], dtype=float32), metric='loss', mse_loss=False, nms=False, no_color_aug=False, norm_wh=False, not_cuda_benchmark=False, not_hm_hp=False, not_prefetch_test=False, not_rand_crop=False, not_reg_bbox=False, not_reg_hp_offset=False, not_reg_offset=False, num_classes=80, num_epochs=140, num_iters=-1, num_stacks=1, num_workers=4, off_weight=1, output_h=128, output_res=128, output_w=128, pad=31, peak_thresh=0.2, print_iter=0, rect_mask=False, reg_bbox=True, reg_hp_offset=True, reg_loss='l1', reg_offset=True, resume=False, root_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..', rot_weight=1, rotate=0, save_all=False, save_dir='F:\\Users\\sounansu\\Anaconda3\\CenterNet\\src\\lib\\..\\..\\exp\\ctdet\\coco_dla', scale=0.4, scores_thresh=0.1, seed=317, shift=0.1, std=array([[[0.2886383 , 0.27408165, 0.27809834]]], dtype=float32), task='ctdet', test=False, test_scales=[1.0], trainval=False, val_intervals=5, vis_thresh=0.3, wh_weight=0.1)
Creating model...
Setting up data...
==> initializing coco 2017 val data.
loading annotations into memory...
Done (t=0.68s)
creating index...
index created!
Loaded val 5000 samples
==> initializing coco 2017 train data.
loading annotations into memory...
Done (t=18.41s)
creating index...
index created!
Loaded train 118287 samples
Starting training...
ctdet/coco_dlaTraceback (most recent call last):
  File "main.py", line 102, in <module>
    main(opt)
  File "main.py", line 70, in main
    log_dict_train, _ = trainer.train(epoch, train_loader)
  File "F:\Users\sounansu\Anaconda3\CenterNet\src\lib\trains\base_trainer.py", line 119, in train
    return self.run_epoch('train', epoch, data_loader)
  File "F:\Users\sounansu\Anaconda3\CenterNet\src\lib\trains\base_trainer.py", line 61, in run_epoch
    for iter_id, batch in enumerate(data_loader):
  File "f:\users\sounansu\anaconda3\envs\centernet\lib\site-packages\torch\utils\data\dataloader.py", line 193, in __iter__
    return _DataLoaderIter(self)
  File "f:\users\sounansu\anaconda3\envs\centernet\lib\site-packages\torch\utils\data\dataloader.py", line 469, in __init__
    w.start()
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_dataset.<locals>.Dataset'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "F:\Users\sounansu\Anaconda3\envs\CenterNet\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

EOFError was occurred.
Next I tried to run with --not_prefetch_test option. But, some error was occurred, too.

@zzundark
Copy link

zzundark commented Jun 10, 2019

Same issue..is it any solution?

@zzundark
Copy link

Same issue..is it any solution?
it work with --num_workers==0 but I can't handle data loading using multi-processing in windows...

@sounansu
Copy link
Author

sounansu commented Jun 15, 2019

Thank you! zzundark!
I can train in below command!

python main.py ctdet --exp_id coco_dla --batch_size 11 --master_batch 11 --lr 1.25e-4 --gpus 0 --num_workers 0

I have only one RTX2070 with 8GB memories. So, I change batch size and master batch to 11.
(I tried 12. But OUT OF MEMORY.)

I will try res_18. It needs short training time.

@heartInsert
Copy link

Thank you! zzundark!
I can train in below command!

python main.py ctdet --exp_id coco_dla --batch_size 11 --master_batch 11 --lr 1.25e-4 --gpus 0 --num_workers 0

I have only one GTX2070 with 8GB memories. So, I change batch size and master batch to 11.
(I tried 12. But OUT OF MEMORY.)

I will try res_18. It needs short training time.

Yes,the num_worker must be 0 , even it is 1, goes wrong

@Ai-is-light
Copy link

@ @sounansu why the num_worker must be 0

@TomsonBoylett
Copy link

I have it training in windows with num_workers > 0

Basically the error boils down to Can't pickle local object 'get_dataset..Dataset'

From the python pickle docs: only classes that are defined at the top level of a module are picklable.

The Dataset object is defined dynamically in src\lib\datasets\dataset_factory.py based on the options you pass in from the command line.

def get_dataset(dataset, task):
  class Dataset(dataset_factory[dataset], _sample_factory[task]):
    pass
  return Dataset

What you will notice is that this class is defined in a function and not in the top level! So to fix this I made a small workaround:

class MyDataset(dataset_factory['mydataset'], _sample_factory['ctdet']):
  pass

def get_dataset(dataset, task):
  if dataset == 'mydataset' and task == 'ctdet':
    return MyDataset
  class Dataset(dataset_factory[dataset], _sample_factory[task]):
    pass
  return Dataset

Messy, but it works.

@kap2403
Copy link

kap2403 commented May 11, 2023

please explain the procedure for centernet dataloader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants