Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in the process of compiling C++ extensions #1

Closed
WoodsGao opened this issue Oct 17, 2019 · 11 comments
Closed

Stuck in the process of compiling C++ extensions #1

WoodsGao opened this issue Oct 17, 2019 · 11 comments
Labels
help wanted Extra attention is needed

Comments

@WoodsGao
Copy link

CUDA VERSION:9.0
Python VERSION:3.6.8
Pytorch VERSION:1.2.0

I downloaded the tmm17 dataset and pre-trained model from Google Drive and used the command

sudo python eval.py -d 0 logs/tmm/config.yaml logs/tmm/checkpoint_latest.pth.tar

to evaluate the tmm17 dataset, but after outputting

Let's use 1 GPU(s)!

, the program has no other output. When I interrupt the program, I can see the program stucked in the "torch.utils.cpp_extension.load" function.
Is there any problem with this operation?

This is the complete output:

{   'io': {   'augmentation_level': 2,
              'datadir': 'data/tmm17',
              'dataset': 'TMM17',
              'focal_length': 1,
              'logdir': 'logs/',
              'num_vpts': 1,
              'num_workers': 4,
              'resume_from': 'logs/ultimate-suw-3xlr-fixdata',
              'tensorboard_port': 0,
              'validation_debug': -1,
              'validation_interval': 24000},
    'model': {   'backbone': 'stacked_hourglass',
                 'batch_size': 6,
                 'cat_vpts': True,
                 'conic_6x': False,
                 'depth': 4,
                 'fc_channel': 1024,
                 'im2col_step': 32,
                 'multires': <BoxList: [0.0013457768043554, 0.0051941870036646, 0.02004838034795, 0.0774278195486317, 0.299564810864565]>,
                 'num_blocks': 1,
                 'num_stacks': 1,
                 'num_steps': 4,
                 'output_stride': 4,
                 'smp_multiplier': 2,
                 'smp_neg': 1,
                 'smp_pos': 1,
                 'smp_rnd': 3,
                 'upsample_scale': 1},
    'optim': {   'amsgrad': True,
                 'lr': 3e-05,
                 'lr_decay_epoch': 365,
                 'max_epoch': 400,
                 'name': 'Adam',
                 'weight_decay': 3e-05}}
Let's use 1 GPU(s)!
^CTraceback (most recent call last):
  File "eval.py", line 179, in <module>
    main()
  File "eval.py", line 83, in main
    model, C.model.output_stride, C.model.upsample_scale
  File "/workspace/neurvps/neurvps/models/vanishing_net.py", line 23, in __init__
    self.anet = ApolloniusNet(output_stride, upsample_scale)
  File "/workspace/neurvps/neurvps/models/vanishing_net.py", line 95, in __init__
    self.conv1 = ConicConv(32, 64)
  File "/workspace/neurvps/neurvps/models/conic.py", line 19, in __init__
    bias=bias,
  File "/workspace/neurvps/neurvps/models/deformable.py", line 132, in __init__
    DCN = load_cpp_ext("DCN")
  File "/workspace/neurvps/neurvps/models/deformable.py", line 29, in load_cpp_ext
    build_directory=tar_dir,
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 649, in load
    is_python_module)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 822, in _jit_compile
    baton.wait()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/file_baton.py", line 49, in wait
    time.sleep(self.wait_seconds)
KeyboardInterrupt
@zhou13 zhou13 added the help wanted Extra attention is needed label Oct 18, 2019
@zhou13
Copy link
Owner

zhou13 commented Oct 18, 2019

torch.utils.cpp_extension.load is the function to compile the C++/CUDA code. With the provided information, I cannot see what is the problem. Do you have any CPU load when the program stucks? Maybe you can test other PyTorch code with dynamic compilation. Or you could comment out

warnings.simplefilter("ignore")
to see if you could get more warnings.

@zhou13
Copy link
Owner

zhou13 commented Nov 2, 2019

Feel free to reopen this issue if you have more clues and updates.

@zhou13 zhou13 closed this as completed Nov 2, 2019
@yashnsn
Copy link

yashnsn commented Mar 12, 2021

hello
I was trying to infer stylegan2 pytorch version model but getting the same issue.. Please help me out if you found the solution.

@Agrechka
Copy link

hello
I was trying to infer stylegan2 pytorch version model but getting the same issue.. Please help me out if you found the solution.

did you fix the issue? I have the same problem. Thanks

@KellyYutongHe
Copy link

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

@SamchungHwang
Copy link

I remove the .cache directory. But the same issue occurs.

@lzaazl
Copy link

lzaazl commented Dec 15, 2021

Exact the same issue as yashnsn and Agrechka, Thank you so much @KellyYutongHe

@biphasic
Copy link

@KellyYutongHe you're a hero

@JANVI2411
Copy link

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

@KellyYutongHe Thank you so much !! You saved my lot of time.

@AndyJZhao
Copy link

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

Thanks for the great answer. Also, for those who have difficulties finding what the "something" is in the "~/.cache/torch_extensions/something". I found it useful to evaluate the expression "os.path.join(build_directory, 'lock')" in some remote debug session (I use Pycharm remote debugging) and you will get what you want. For me, the "something" happens to be the "spmm_0". Therefore, after "rm -rf ~/.cache/torch_extensions/spmm_0", the bug is fixed.

@hhuang-code
Copy link

@Agrechka and @yashnsn I found a solution if you guys still need it: go to your .cache directory, delete the lock file for your cpp extension (it is likely under the directory ~/.cache/torch_extensions/something), and you should be able to run it again.

If you can't find your cache directory, you can run python -m pdb your_program.py and break at your .../lib/python3.X/site-packages/torch/utils/cpp_extension.py line 1179 (specifically the line containing "baton = FileBaton(os.path.join(build_directory, 'lock'))") and then print "build_directory". That should be the cache directory for your programs.

Hope this helps!

It works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests