Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot open shared object file #24

Closed
Vincentwei1021 opened this issue Feb 21, 2022 · 15 comments
Closed

Cannot open shared object file #24

Vincentwei1021 opened this issue Feb 21, 2022 · 15 comments

Comments

@Vincentwei1021
Copy link

Hi, I was trying to run the eva_finetune code with the provided docker 1.5 and I encountered the following issue:

Loading extension module utils...
Traceback (most recent call last):
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 506, in
main()
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 486, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args, config, ds_config, args.do_train)
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 139, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/init.py", line 110, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in init
util_ops = UtilsBuilder().load()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
op_module = load(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
return _jit_compile(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1362, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1752, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1101, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /home/xxx/.cache/torch_extensions/py38_cu102/utils/utils.so: cannot open shared object file: No such file or directory

where /home/xxx is my user home directory. I checked the particular path where the error occured and I found the torch_extensions is not under that path. Could you please help on this issue? thanks in advance!

@Jiaxin-Wen
Copy link
Member

what's the version of pytorch in your environment

@Vincentwei1021
Copy link
Author

what's the version of pytorch in your environment
@XWwwwww

print(torch.version)
1.10.1+cu102

@t1101675
Copy link
Member

Thanks for reporting this. We will check the environment.

@Hermes777
Copy link

I encountered the same error. Here's my environment
torch 1.10.1 + cu111 + gcc 9.3

@Hermes777
Copy link

Hi. I upgrade the torch to 1.10.2, and cuda to cu113. That solves the problem.

@Vincentwei1021
Copy link
Author

Hi. I upgrade the torch to 1.10.2, and cuda to cu113. That solves the problem.

@Hermes777 Hi were you able to get the path /torch_extensions/py38_cu102/utils/utils.so after you upgrade the torch and cuda?

@Hermes777
Copy link

From the term "py38_cu102", you are obviously using cuda 102, which might not compatible.

@Hermes777
Copy link

I checked the directory utils/, it contains utils.so

Never forget to install the apex once again, after you upgrade the torch and cuda.

@Vincentwei1021
Copy link
Author

I checked the directory utils/, it contains utils.so

Never forget to install the apex once again, after you upgrade the torch and cuda.

Thanks so much. I will try later

@Vincentwei1021
Copy link
Author

@t1101675 @XWwwwww Sorry, I'm still having this issue. I don't think it is the problem of cuda version, as I'm using cuda10.2 which is suggested in your readme. I'm running with the docker provided and I was able to run the interactive scripts. But the eva_finetune still gives me the aforementioned error. I checked the log and spotted that while building the extension module utils, there was

Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: fatal: waitpid(113): No child processes

So the building of utils might fail? Please help and give me some advices, thanks!

@Jiaxin-Wen Jiaxin-Wen reopened this Apr 29, 2022
@Jiaxin-Wen
Copy link
Member

Jiaxin-Wen commented Apr 29, 2022

@Vincentwei1021 We upload the missing file in src/ds_fix/utils.so. Let me know if it works.

@Vincentwei1021
Copy link
Author

@XWwwwww Hi thanks for your reply. I have tried to use the file, and it gives the following error:
ImportError: /.cache/torch_extensions/py38_cu102/utils/utils.so: undefined symbol: _ZNK2at6Tensor6narrowElll

@Jiaxin-Wen
Copy link
Member

@t1101675 @XWwwwww Sorry, I'm still having this issue. I don't think it is the problem of cuda version, as I'm using cuda10.2 which is suggested in your readme. I'm running with the docker provided and I was able to run the interactive scripts. But the eva_finetune still gives me the aforementioned error. I checked the log and spotted that while building the extension module utils, there was

Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: fatal: waitpid(113): No child processes

So the building of utils might fail? Please help and give me some advices, thanks!

I just notice that you have already been able to run the interactive scripts, which means the utils.so file has already been automatically compiled and saved in your cache path, right?
And the current error message is ninja: fatal: waitpid(113): No child processes instead of cache/torch_extensions/py38_cu102/utils/utils.so: cannot open shared object file: No such file or directory?

@Vincentwei1021
Copy link
Author

@XWwwwww You are right. I run the interactive scripts on a single GPU cloud machine, and eva_finetune script on a multi-gpu cloud machine in distributed mode. Both with the docker provided. The interactive script works well to build the utils.so, but finetune script fails with ninja: fatal: waitpid(113): No child processes.

Anyway, I just tried to copy my utils.so generated by the interactive script to my multi-gpu machine, and it works well now. Thanks!

@BaiMeiyingxue
Copy link

spotted

@XWwwwww You are right. I run the interactive scripts on a single GPU cloud machine, and eva_finetune script on a multi-gpu cloud machine in distributed mode. Both with the docker provided. The interactive script works well to build the utils.so, but finetune script fails with ninja: fatal: waitpid(113): No child processes.

Anyway, I just tried to copy my utils.so generated by the interactive script to my multi-gpu machine, and it works well now. Thanks!

i met this situation :

Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: fatal: waitpid(97011): No child processes

and occured this error:
....
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
return _jit_compile(
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1317, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1699, in _import_module_from_library
file, path, description = imp.find_module(module_name, [path])
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/imp.py", line 296, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'utils'

could you give me some suggestions? thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants