Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch compile: libcuda.so cannot found #107960

Closed
sanchit-gandhi opened this issue Aug 25, 2023 · 14 comments
Closed

Torch compile: libcuda.so cannot found #107960

sanchit-gandhi opened this issue Aug 25, 2023 · 14 comments
Labels
dependency issue oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@sanchit-gandhi
Copy link

sanchit-gandhi commented Aug 25, 2023

🐛 Describe the bug

Using torch.compile with a Colab T4 GPU fails and gives a very cryptic error running on nightly 2.1

Error logs
---------------------------------------------------------------------------
BackendCompilerFailed                     Traceback (most recent call last)
[<ipython-input-8-3e6b92348d53>](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in <cell line: 1>()
----> 1 audio = pipe("brazilian samba drums").audios[0]

52 frames
[/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in decorate_context(*args, **kwargs)
    113     def decorate_context(*args, **kwargs):
    114         with ctx_factory():
--> 115             return func(*args, **kwargs)
    116 
    117     return decorate_context

[/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/pipeline_audioldm2.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in __call__(self, prompt, audio_length_in_s, num_inference_steps, guidance_scale, negative_prompt, num_waveforms_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, generated_prompt_embeds, negative_generated_prompt_embeds, attention_mask, negative_attention_mask, max_new_tokens, return_dict, callback, callback_steps, cross_attention_kwargs, output_type)
    925 
    926                 # predict the noise residual
--> 927                 noise_pred = self.unet(
    928                     latent_model_input,
    929                     t,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _fn(*args, **kwargs)
    326             dynamic_ctx.__enter__()
    327             try:
--> 328                 return fn(*args, **kwargs)
    329             finally:
    330                 set_eval_frame(prior)

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in catch_errors(frame, cache_entry, frame_state)
    486 
    487         with compile_lock, _disable_current_modes():
--> 488             return callback(frame, cache_entry, hooks, frame_state)
    489 
    490     catch_errors._torchdynamo_orig_callable = callback  # type: ignore[attr-defined]

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _convert_frame(frame, cache_entry, hooks, frame_state)
    623         counters["frames"]["total"] += 1
    624         try:
--> 625             result = inner_convert(frame, cache_entry, hooks, frame_state)
    626             counters["frames"]["ok"] += 1
    627             return result

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _fn(*args, **kwargs)
    137         cleanup = setup_compile_debug()
    138         try:
--> 139             return fn(*args, **kwargs)
    140         finally:
    141             cleanup.close()

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _convert_frame_assert(frame, cache_entry, hooks, frame_state)
    378         )
    379 
--> 380         return _compile(
    381             frame.f_code,
    382             frame.f_globals,

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in _compile(code, globals, locals, builtins, compiler_fn, one_graph, export, export_constraints, hooks, cache_size, frame, frame_state, compile_id)
    553     with compile_context(CompileContext(compile_id)):
    554         try:
--> 555             guarded_code = compile_inner(code, one_graph, hooks, transform)
    556             return guarded_code
    557         except (

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in time_wrapper(*args, **kwargs)
    187             with torch.profiler.record_function(f"{key} (dynamo_timed)"):
    188                 t0 = time.time()
--> 189                 r = func(*args, **kwargs)
    190                 time_spent = time.time() - t0
    191             compilation_time_metrics[key].append(time_spent)

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compile_inner(code, one_graph, hooks, transform)
    475         for attempt in itertools.count():
    476             try:
--> 477                 out_code = transform_code_object(code, transform)
    478                 orig_code_map[out_code] = code
    479                 break

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in transform_code_object(code, transformations, safe)
   1026     propagate_line_nums(instructions)
   1027 
-> 1028     transformations(instructions, code_options)
   1029     return clean_and_assemble_instructions(instructions, keys, code_options)[1]
   1030 

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in transform(instructions, code_options)
    442         try:
    443             with tracing(tracer.output.tracing_context):
--> 444                 tracer.run()
    445         except (exc.RestartAnalysis, exc.SkipFrame):
    446             raise

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in run(self)
   2072 
   2073     def run(self):
-> 2074         super().run()
   2075 
   2076     def match_nested_cell(self, name, cell):

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in run(self)
    722                     self.instruction_pointer is not None
    723                     and not self.output.should_exit
--> 724                     and self.step()
    725                 ):
    726                     pass

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in step(self)
    686                 self.f_code.co_filename, self.lineno, self.f_code.co_name
    687             )
--> 688             getattr(self, inst.opname)(inst)
    689 
    690             return inst.opname != "RETURN_VALUE"

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in RETURN_VALUE(self, inst)
   2160         )
   2161         log.debug("RETURN_VALUE triggered compile")
-> 2162         self.output.compile_subgraph(
   2163             self,
   2164             reason=GraphCompileReason(

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compile_subgraph(self, tx, partial_convert, reason)
    855             if count_calls(self.graph) != 0 or len(pass2.graph_outputs) != 0:
    856                 output.extend(
--> 857                     self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
    858                 )
    859 

[/usr/lib/python3.10/contextlib.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in inner(*args, **kwds)
     77         def inner(*args, **kwds):
     78             with self._recreate_cm():
---> 79                 return func(*args, **kwds)
     80         return inner
     81 

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compile_and_call_fx_graph(self, tx, rv, root)
    955         )
    956 
--> 957         compiled_fn = self.call_user_compiler(gm)
    958         compiled_fn = disable(compiled_fn)
    959 

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in time_wrapper(*args, **kwargs)
    187             with torch.profiler.record_function(f"{key} (dynamo_timed)"):
    188                 t0 = time.time()
--> 189                 r = func(*args, **kwargs)
    190                 time_spent = time.time() - t0
    191             compilation_time_metrics[key].append(time_spent)

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in call_user_compiler(self, gm)
   1022             unimplemented_with_warning(e, self.root_tx.f_code, msg)
   1023         except Exception as e:
-> 1024             raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
   1025                 e.__traceback__
   1026             ) from None

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in call_user_compiler(self, gm)
   1007             if config.verify_correctness:
   1008                 compiler_fn = WrapperBackend(compiler_fn)
-> 1009             compiled_fn = compiler_fn(gm, self.example_inputs())
   1010             _step_logger()(logging.INFO, f"done compiler function {name}")
   1011             assert callable(compiled_fn), "compiler_fn did not return callable"

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/repro/after_dynamo.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in debug_wrapper(gm, example_inputs, **kwargs)
    115                     raise
    116         else:
--> 117             compiled_gm = compiler_fn(gm, example_inputs)
    118 
    119         return compiled_gm

[/usr/local/lib/python3.10/dist-packages/torch/__init__.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in __call__(self, model_, inputs_)
   1566         from torch._inductor.compile_fx import compile_fx
   1567 
-> 1568         return compile_fx(model_, inputs_, config_patches=self.config)
   1569 
   1570     def get_compiler_config(self):

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compile_fx(model_, example_inputs_, inner_compile, config_patches, decompositions)
   1148         tracing_context
   1149     ), compiled_autograd.disable():
-> 1150         return aot_autograd(
   1151             fw_compiler=fw_compiler,
   1152             bw_compiler=bw_compiler,

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/backends/common.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compiler_fn(gm, example_inputs)
     53             # NB: NOT cloned!
     54             with enable_aot_logging(), patch_config:
---> 55                 cg = aot_module_simplified(gm, example_inputs, **kwargs)
     56                 counters["aot_autograd"]["ok"] += 1
     57                 return disable(cg)

[/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in aot_module_simplified(mod, args, fw_compiler, bw_compiler, partition_fn, decompositions, keep_inference_input_mutations, inference_compiler)
   3889 
   3890     with compiled_autograd.disable():
-> 3891         compiled_fn = create_aot_dispatcher_function(
   3892             functional_call,
   3893             full_args,

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in time_wrapper(*args, **kwargs)
    187             with torch.profiler.record_function(f"{key} (dynamo_timed)"):
    188                 t0 = time.time()
--> 189                 r = func(*args, **kwargs)
    190                 time_spent = time.time() - t0
    191             compilation_time_metrics[key].append(time_spent)

[/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in create_aot_dispatcher_function(flat_fn, flat_args, aot_config)
   3427         # You can put more passes here
   3428 
-> 3429         compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
   3430         if aot_config.is_export:
   3431 

[/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in aot_wrapper_dedupe(flat_fn, flat_args, aot_config, compiler_fn, fw_metadata)
   2210 
   2211     if ok:
-> 2212         return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
   2213 
   2214     # export path: ban duplicate inputs for now, add later if requested.

[/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in aot_wrapper_synthetic_base(flat_fn, flat_args, aot_config, fw_metadata, needs_autograd, compiler_fn)
   2390     # Happy path: we don't need synthetic bases
   2391     if synthetic_base_info is None:
-> 2392         return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
   2393 
   2394     # export path: ban synthetic bases for now, add later if requested.

[/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in aot_dispatch_base(flat_fn, flat_args, aot_config, fw_metadata)
   1571         if torch._guards.TracingContext.get():
   1572             torch._guards.TracingContext.get().fw_metadata = fw_metadata
-> 1573         compiled_fw = compiler(fw_module, flat_args)
   1574 
   1575     # This boxed_call handling happens inside create_runtime_wrapper as well.

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in time_wrapper(*args, **kwargs)
    187             with torch.profiler.record_function(f"{key} (dynamo_timed)"):
    188                 t0 = time.time()
--> 189                 r = func(*args, **kwargs)
    190                 time_spent = time.time() - t0
    191             compilation_time_metrics[key].append(time_spent)

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in fw_compiler_base(model, example_inputs, is_inference)
   1090             }
   1091 
-> 1092         return inner_compile(
   1093             model,
   1094             example_inputs,

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/repro/after_aot.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in debug_wrapper(gm, example_inputs, **kwargs)
     78             # Call the compiler_fn - which is either aot_autograd or inductor
     79             # with fake inputs
---> 80             inner_compiled_fn = compiler_fn(gm, example_inputs)
     81         except Exception as e:
     82             # TODO: Failures here are troublesome because no real inputs,

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/debug.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in inner(*args, **kwargs)
    226         def inner(*args, **kwargs):
    227             with DebugContext():
--> 228                 return fn(*args, **kwargs)
    229 
    230         return wrap_compiler_debug(inner, compiler_name="inductor")

[/usr/lib/python3.10/contextlib.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in inner(*args, **kwds)
     77         def inner(*args, **kwds):
     78             with self._recreate_cm():
---> 79                 return func(*args, **kwds)
     80         return inner
     81 

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in newFunction(*args, **kwargs)
     52             @wraps(old_func)
     53             def newFunction(*args, **kwargs):
---> 54                 return old_func(*args, **kwargs)
     55 
     56             return newFunction

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compile_fx_inner(gm, example_inputs, cudagraphs, num_fixed, is_backward, graph_id, cpp_wrapper, aot_mode, is_inference, boxed_forward_device_index, user_visible_outputs, layout_opt)
    339     }
    340 
--> 341     compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
    342         *graph_args, **graph_kwargs  # type: ignore[arg-type]
    343     )

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in fx_codegen_and_compile(gm, example_inputs, cudagraphs, num_fixed, is_backward, graph_id, cpp_wrapper, aot_mode, is_inference, user_visible_outputs, layout_opt)
    563                     else:
    564                         context.output_strides.append(None)
--> 565             compiled_fn = graph.compile_to_fn()
    566 
    567             if graph.disable_cudagraphs:

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compile_to_fn(self)
    965             return AotCodeCache.compile(self, code, cuda=self.cuda)
    966         else:
--> 967             return self.compile_to_module().call
    968 
    969     def get_output_names(self):

[/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in time_wrapper(*args, **kwargs)
    187             with torch.profiler.record_function(f"{key} (dynamo_timed)"):
    188                 t0 = time.time()
--> 189                 r = func(*args, **kwargs)
    190                 time_spent = time.time() - t0
    191             compilation_time_metrics[key].append(time_spent)

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in compile_to_module(self)
    936         linemap = [(line_no, node.stack_trace) for line_no, node in linemap]
    937         key, path = PyCodeCache.write(code)
--> 938         mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
    939         self.cache_key = key
    940         self.cache_path = path

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in load_by_key_path(cls, key, path, linemap)
   1137                 mod.__file__ = path
   1138                 mod.key = key
-> 1139                 exec(code, mod.__dict__, mod.__dict__)
   1140                 sys.modules[mod.__name__] = mod
   1141                 # another thread might set this first

[/tmp/torchinductor_root/7n/c7nibse6wgoaypjbdhgrkgsgbfcxqvdo7jque3pr52vtf6vor5yi.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in <module>
   7289 
   7290 
-> 7291 async_compile.wait(globals())
   7292 del async_compile
   7293 

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in wait(self, scope)
   1416                     pbar.set_postfix_str(key)
   1417                 if isinstance(result, (Future, TritonFuture)):
-> 1418                     scope[key] = result.result()
   1419                     pbar.update(1)
   1420 

[/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in result(self)
   1275             return self.kernel
   1276         # If the worker failed this will throw an exception.
-> 1277         self.future.result()
   1278         kernel = self.kernel = _load_kernel(self.kernel_name, self.source_code)
   1279         latency = time() - t0

[/usr/lib/python3.10/concurrent/futures/_base.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in result(self, timeout)
    456                     raise CancelledError()
    457                 elif self._state == FINISHED:
--> 458                     return self.__get_result()
    459                 else:
    460                     raise TimeoutError()

[/usr/lib/python3.10/concurrent/futures/_base.py](https://f1ggesi86gr-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230823-060135-RC00_559378898#) in __get_result(self)
    401         if self._exception:
    402             try:
--> 403                 raise self._exception
    404             finally:
    405                 # Break a reference cycle with the exception in self._exception

BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!


Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Minified repro

https://colab.research.google.com/drive/1XwD2UpPoi6RFLHA9tcXL7BdbOgkKvQ_7?usp=sharing

Versions

PyTorch version: 2.1.0.dev20230825+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.27.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.109+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 0
BogoMIPS: 4399.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32 KiB (1 instance)
L1i cache: 32 KiB (1 instance)
L2 cache: 256 KiB (1 instance)
L3 cache: 55 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Vulnerable; SMT Host state unknown
Vulnerability Meltdown: Vulnerable
Vulnerability Mmio stale data: Vulnerable
Vulnerability Retbleed: Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-triton==2.1.0+e6216047b8
[pip3] torch==2.1.0.dev20230825+cu121
[pip3] torchaudio==2.1.0.dev20230825+cu121
[pip3] torchdata==0.6.1
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.15.2
[pip3] torchvision==0.15.2+cu118
[pip3] triton==2.0.0
[conda] Could not collect

cc @ezyang @msaroufim @wconstab @bdhirsh @anijain2305

@msaroufim
Copy link
Member

I don't believe this is a torch.compile bug, the problematic line in your notebook is !pip install --quiet --upgrade --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 and when I removed it things worked as expected

When you're pip install'ing the nightlies for torch you might need to make sure that LD_LIBRARY_PATH is set to your libcuda.so path and when you're installing multiple versions of torch you'll have more than 1 which is when this issue pops up. Try setting up a virtual environment and activate it to avoid these kinds of issues if setting the library path does not work

Also keep in mind that torch.compile() won't give you major speedups on a T4 GPU, it's much more optimized for Ampere+ architectures

@msaroufim msaroufim added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module dependency issue labels Aug 25, 2023
@sanchit-gandhi
Copy link
Author

Hey @msaroufim - many thanks for the detailed response! However, I'm not sure this originates from having multiple versions of PyTorch installed.

If I uninstall the base version of PyTorch, and then install PyTorch nightly, I get the libcuda.so error, despite only having one version of PyTorch installed. See updated Colab: https://colab.research.google.com/drive/1XwD2UpPoi6RFLHA9tcXL7BdbOgkKvQ_7?usp=sharing#scrollTo=1uIwmR_FNZ21

If you're confident this is a Colab issue, I can raise it with the Google team! But it would be great to rule out a PyTorch compile issue before doing this!

@poly-mer
Copy link

poly-mer commented Sep 7, 2023

Trying your Colab notebook, I could verify that the issue isn't from PyTorch, but have found the workaround:

After installing the nightly build, by running ldconfig -p | grep libcuda, we can see if libcuda.so is on the list of shared libraries for dynamic linkers:

	libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
	libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so

No libcuda.so to be found. We can find /usr -name 'libcuda.so' or check the environment variables with export:

/usr/local/cuda-11.8/compat/libcuda.so
/usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib64-nvidia/libcuda.so
. . .
declare -x LC_ALL="en_US.UTF-8"
declare -x LD_LIBRARY_PATH="/usr/lib64-nvidia"
declare -x LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
. . .

The stubs paths are not what we want here — we manually add /usr/lib64-nvidia to the shared libraries:

ldconfig /usr/lib64-nvidia

Verify inference both without compile and with compile:

audio = pipe("brazilian samba drums").audios[0]

100% 200/200 [00:20<00:00, 10.95it/s]

The first call with compilation takes a long time (100% 200/200 [03:46<00:00, 20.34it/s]), but here's how the subsequent runs look like:

audio = pipe("brazilian samba drums").audios[0]

100% 200/200 [00:09<00:00, 20.64it/s]

We see a big speed increase even on the T4 GPU (although initial compilation does take quite a while). Keep in mind all the shell commands are to be run with a preceding ! (!export).

@xuuyangg
Copy link

What is the problem and how can solve it? I don't understand.

@nishaanthkanna
Copy link

@poly-mer 's solution worked for me, but there was no speedup for me,

these are the commands I ran step by step:

!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

@bhack
Copy link
Contributor

bhack commented Nov 11, 2023

Same issue on the official pytorch cuda 2.1.0 docker image

@g-i-o-r-g-i-o
Copy link

I'm exausted.

@g-i-o-r-g-i-o
Copy link

@poly-mer 's solution worked for me, but there was no speedup for me,

these are the commands I ran step by step:

!export LC_ALL="en_US.UTF-8" !export LD_LIBRARY_PATH="/usr/lib64-nvidia" !export LIBRARY_PATH="/usr/local/cuda/lib64/stubs" !ldconfig /usr/lib64-nvidia

this works, thanks

@florian98765
Copy link

florian98765 commented Nov 23, 2023

I have the same problem, but also with the preinstalled torch version.
The following code fails when running on T4:

import torch

torch.set_default_device("cuda:0")
@torch.compile
def test(x):
  return torch.sin(x)

a = torch.zeros(100)
test(a)

After a long backtrace it reports:

BackendCompilerFailed: backend='inductor' raised:
AssertionError: libcuda.so cannot found!

Calling ldconfig as mentioned above seems to work-around this problem:

!ldconfig /usr/lib64-nvidia

best wishes
Florian

@takuma-yoneda
Copy link

I'm using Singularity on docker://nvidia/cuda:12.1.0-devel-ubuntu22.04 image with --nv option.
In this case, libcuda.so was at /.singularity.d/libs/libcuda.so, thus I just needed to run

ldconfig /.singularity.d/libs

@bogdanmagometa
Copy link

Do you guys know how to solve "libcuda.so cannot found" if I don't have root access? Cuda drivers work with no problem (checked with nvidia-smi and torch).
Can I somehow point torch.compile to libcuda.so?

@bogdanmagometa
Copy link

Do you guys know how to solve "libcuda.so cannot found" if I don't have root access? Cuda drivers work with no problem (checked with nvidia-smi and torch). Can I somehow point torch.compile to libcuda.so?

Downgrading triton from 2.1.1 to 2.0.* in conda solved "libcuda.so cannot found" for me.

@lp20010415
Copy link

尝试您的 Colab 笔记本,我可以验证问题不是来自 PyTorch,但找到了解决方法:

安装夜间构建后,通过运行ldconfig -p | grep libcuda,我们可以查看是否libcuda.so在动态链接器的共享库列表中:

	libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
	libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so

没有libcuda.so找到。我们可以find /usr -name 'libcuda.so'使用以下命令检查环境变量export

/usr/local/cuda-11.8/compat/libcuda.so
/usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib64-nvidia/libcuda.so
. . .
declare -x LC_ALL="en_US.UTF-8"
declare -x LD_LIBRARY_PATH="/usr/lib64-nvidia"
declare -x LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
. . .

这些stubs路径不是我们想要的 - 我们手动添加/usr/lib64-nvidia到共享库:

ldconfig /usr/lib64-nvidia

验证不编译和编译的推论:

audio = pipe("brazilian samba drums").audios[0]

100% 200/200 [00:20<00:00, 10.95it/s]

第一次编译调用需要很长时间 ( 100% 200/200 [03:46<00:00, 20.34it/s]),但后续运行如下所示:

audio = pipe("brazilian samba drums").audios[0]

100% 200/200 [00:09<00:00, 20.64it/s]

即使在 T4 GPU 上,我们也看到了速度的大幅提升(尽管初始编译确实需要相当长的时间)。!请记住,所有 shell 命令都应在前面带有( )运行!export

Thanks, it works.

@mlazos
Copy link
Contributor

mlazos commented Mar 15, 2024

Closing, feel free to reopen if needed

@mlazos mlazos closed this as completed Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency issue oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests