-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot open shared object file #24
Comments
what's the version of pytorch in your environment |
|
Thanks for reporting this. We will check the environment. |
I encountered the same error. Here's my environment |
Hi. I upgrade the torch to 1.10.2, and cuda to cu113. That solves the problem. |
@Hermes777 Hi were you able to get the path /torch_extensions/py38_cu102/utils/utils.so after you upgrade the torch and cuda? |
From the term "py38_cu102", you are obviously using cuda 102, which might not compatible. |
I checked the directory utils/, it contains utils.so Never forget to install the apex once again, after you upgrade the torch and cuda. |
Thanks so much. I will try later |
@t1101675 @XWwwwww Sorry, I'm still having this issue. I don't think it is the problem of cuda version, as I'm using cuda10.2 which is suggested in your readme. I'm running with the docker provided and I was able to run the interactive scripts. But the eva_finetune still gives me the aforementioned error. I checked the log and spotted that while building the extension module utils, there was Building extension module utils... So the building of utils might fail? Please help and give me some advices, thanks! |
@Vincentwei1021 We upload the missing file in |
@XWwwwww Hi thanks for your reply. I have tried to use the file, and it gives the following error: |
I just notice that you have already been able to run the interactive scripts, which means the |
@XWwwwww You are right. I run the interactive scripts on a single GPU cloud machine, and eva_finetune script on a multi-gpu cloud machine in distributed mode. Both with the docker provided. The interactive script works well to build the utils.so, but finetune script fails with ninja: fatal: waitpid(113): No child processes. Anyway, I just tried to copy my utils.so generated by the interactive script to my multi-gpu machine, and it works well now. Thanks! |
i met this situation : Building extension module utils... and occured this error: could you give me some suggestions? thank you! |
Hi, I was trying to run the eva_finetune code with the provided docker 1.5 and I encountered the following issue:
Loading extension module utils...
Traceback (most recent call last):
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 506, in
main()
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 486, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args, config, ds_config, args.do_train)
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 139, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/init.py", line 110, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in init
util_ops = UtilsBuilder().load()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
op_module = load(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
return _jit_compile(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1362, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1752, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1101, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /home/xxx/.cache/torch_extensions/py38_cu102/utils/utils.so: cannot open shared object file: No such file or directory
where /home/xxx is my user home directory. I checked the particular path where the error occured and I found the torch_extensions is not under that path. Could you please help on this issue? thanks in advance!
The text was updated successfully, but these errors were encountered: