Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip-installed Taichi crashes on Google colab kernels #235

Closed
znah opened this issue Oct 30, 2019 · 69 comments
Closed

pip-installed Taichi crashes on Google colab kernels #235

znah opened this issue Oct 30, 2019 · 69 comments
Assignees
Labels
stale stale issues and PRs welcome contribution

Comments

@znah
Copy link
Contributor

znah commented Oct 30, 2019

Opening an empty CPU-backed notebook at https://colab.research.google.com and running the following code leads to crash:

!apt install clang-7
!apt install clang-format
!pip install taichi-nightly
import taichi as ti

x, y = ti.var(ti.f32), ti.var(ti.f32)

@ti.layout
def xy():
  ti.root.dense(ti.ij, 16).place(x, y)

@ti.kernel
def laplace():
  for i, j in x:
    if (i + j) % 3 == 0:
      y[i, j] = 4.0 * x[i, j] - x[i - 1, j] - x[i + 1, j] - x[i, j - 1] - x[i, j + 1]
    else:
      y[i, j] = 0.0

for i in range(10):
 x[i, i + 1] = 1.0

laplace()

for i in range(10):
  print(y[i, i + 1])

And the relevant runtime logs say:

Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::operator()()
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::compile()
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&)
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&)
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::CPUCodeGen::lower_cpp()
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*)
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*)
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/lib/x86_64-linux-gnu/libc.so.6: abort
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/lib/x86_64-linux-gnu/libc.so.6: gsignal
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/lib/x86_64-linux-gnu/libc.so.6:
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m***************************
Oct 30, 2019, 3:47:15 PM | WARNING | �[0m�[35m* Taichi Core Stack Trace *
Oct 30, 2019, 3:47:15 PM | WARNING | �[35m***************************
Oct 30, 2019, 3:47:15 PM | WARNING | [E 10/30/19 14:47:15.371] Received signal 6 (Aborted)
Oct 30, 2019, 3:47:15 PM | WARNING | [I 10/30/19 14:47:15.340] [base.cpp:generate_binary@125] Compilation time: 2889.9 ms
Oct 30, 2019, 3:47:12 PM | WARNING | [T 10/30/19 14:47:12.056] [logging.cpp:Logger@67] Taichi core started. Thread ID = 122

Can you please provide some insight into the possible root of the problem if you have it on top of your head?

@yuanming-hu
Copy link
Member

Thanks for reporting this. Taichi crashes during the AST lowering process on Google Colab. The same script runs fine offline though. It might be related to the use of C++ exceptions during AST lowering, however I currently don't have a clear idea what's wrong...

@yuanming-hu
Copy link
Member

In general, I think using Google colab for Taichi is a good idea. I'll dig deeper into this later.

More debug information:

lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic

@yuanming-hu
Copy link
Member

yuanming-hu commented Oct 31, 2019

Update: I tested exception throwing and it works fine on colab. May be some other reason.

@yuanming-hu yuanming-hu self-assigned this Oct 31, 2019
@znah
Copy link
Contributor Author

znah commented Nov 4, 2019

I tried the 0.0.80 version, here is the error log :

[Release mode]
[T 11/04/19 10:03:46.767] [logging.cpp:Logger@67] Taichi core started. Thread ID = 154
[Taichi version 0.0.80, cpu only, commit 5ad67ce]
[I 11/04/19 10:03:46.779] [taichi_llvm_context.cpp:TaichiLLVMContext@59] Creating llvm context for arch: x86_64
Materializing layout...
[I 11/04/19 10:03:46.832] [codegen_llvm_x86.cpp:global_optimize_module_x86_64@93] Global optimization time: 40.946 ms
[I 11/04/19 10:03:46.834] [struct_llvm.cpp:operator()@277] Allocating data structure of size 2048
Initializing runtime with 4 elements
Runtime initialized.
[E 11/04/19 10:03:46.845] Received signal 6 (Aborted)


  • Taichi Compiler Stack Traceback *

/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f27fde1ef20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/lib/x86_64-linux-gnu/libc.so.6: abort
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x1f77728) [0x7f27f28eb728]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendAssignStmt*)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::CPUCodeGen::lower_llvm()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::CPUCodeGen::lower()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::compile()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::operator()()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x9239cd) [0x7f27f12979cd]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x7801ed) [0x7f27f10f41ed]
.........

@znah
Copy link
Contributor Author

znah commented Nov 4, 2019

Here is the notebook where I try to install or build Taichi in colab kernel.

@znah
Copy link
Contributor Author

znah commented Nov 4, 2019

Also, GPU version crashes for a different reason:

[Release mode]
Using CUDA Device [0]: Tesla K80
Device Compute Capability: 3.7
[T 11/04/19 16:17:15.299] [logging.cpp:Logger@67] Taichi core started. Thread ID = 176
[Taichi version 0.0.81, cuda 10.0, commit 54751054]
[E 11/04/19 16:17:15.317] [unified_allocator.cpp:UnifiedAllocator@24] GPU memory allocation failed.
[E 11/04/19 16:17:15.317] Received signal 6 (Aborted)
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f0f30e80f20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::UnifiedAllocator::UnifiedAllocator(unsigned long, bool)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::UnifiedAllocator::create()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Program::Program(taichi::Tlang::Arch)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x9680f9) [0x7f0f241bc0f9]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x7c92dd) [0x7f0f2401d2dd]
...

@yuanming-hu
Copy link
Member

Thanks for testing! I'll try to take a deeper look into this later today.

@znah
Copy link
Contributor Author

znah commented Dec 5, 2019

Hi! Do you have any update on this? I'm now trying to build llvm and taichi on colab, but it takes a while...

@yuanming-hu
Copy link
Member

Hi @znah,

Sorry I haven't got a chance to work on this. I think colab is a great place for using Taichi, however, it's also very hard to debug what's wrong...

A month ago, the crash happened during Taichi IR compilation. I couldn't reproduce this on any other environment.

If you could help investigate what's wrong, that would be great! It's also worth checking if the latest python wheels of Taichi still crashes. You know, I'm in somewhere on earth without access to google.

Thanks,
Yuanming

@yuanming-hu
Copy link
Member

The GPU crash is due to a virtual memory allocation issue. We should first make sure the CPU version works.

@znah
Copy link
Contributor Author

znah commented Dec 5, 2019

Here is the notebook where I try to build a dev version. I'm certainly doing something wrong, but I have a pre-build LLVM, so that we don't have to wait for it again.

@znah
Copy link
Contributor Author

znah commented Dec 5, 2019

So I basically reproduced the same error with taichi that was built on the colab from sources. Where to go from this?

@yuanming-hu
Copy link
Member

Thanks for the notebook! It seems that I don't have permission to access it yet. I requested access. Could you approve? If we can build from source on colab, I think one thing to do is to do a debug build (cmake .. -DCMAKE_BUILD_TYPE="Debug"), run the script with python under gdb and see which line exactly caused the error...If gdb is not supported on colab (since it's interactive), maybe it's better to use printf...

@yuanming-hu
Copy link
Member

Thanks, I have access now! It's late in my place, but let me try doing a debug build now before I go to speed.

@znah
Copy link
Contributor Author

znah commented Dec 5, 2019

Thank you! I've actually started the debug build already. Waiting...

@yuanming-hu
Copy link
Member

Oh, thanks! I'll continue working on this first thing tomorrow morning then. I hope the crashing reason is clear under the debug build. The notebook file you have shared is super useful! Let's see what will happen :-)

@znah
Copy link
Contributor Author

znah commented Dec 5, 2019

Thanks again. No need to rush, I just wanted to make sure Taichi works in colab someday.
Meanwhile I have remote gdb in colab :)

@znah
Copy link
Contributor Author

znah commented Dec 5, 2019

All I have so far:

(gdb) info line
Line 200 of "/content/taichi/taichi/transforms/lower_ast.cpp"
   starts at address 0x7fe6fdd1ad7c <taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendAssignStmt*)+348>
   and ends at 0x7fe6fdd1adb3 <taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendAssignStmt*)+403>.

(gdb) info args
this = 0x7ffed9e0bad8
assign = 0x3127fa0

(gdb) info local
expr = {
  expr = std::shared_ptr<taichi::Tlang::Expression> (use count 1767994415, weak count 795437154) = {get() = 0x30e9910}, const_value = false, atomic = false}
flattened = {stmts = std::vector of length 5, capacity 8 = {
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x7fe70b68db10}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x7fe70b6b5510}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {
      get() = 0x7fe70b6b89f0 <_rtld_global+2448>}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x0}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x7fe70b68db10}}}
(gdb) 

@yuanming-hu
Copy link
Member

Thanks for the info! It might be due to shared pointer issues/memory corruption, but I need to dig deeper into this.

I'm making use of your notebook to build Taichi and diagnose. That's super helpful. Thank you for providing that.

It will also be helpful to have a stack backtrace when it crashes, i.e. bt in gdb, so that we know what exactly triggers the crash.
Is "remote gdb in colab" accessible to everyone or just Google people? :-)

@yuanming-hu
Copy link
Member

The program crashes when the IRModified() exception is thrown.

@znah
Copy link
Contributor Author

znah commented Dec 6, 2019

May the crash happen due while stack unwinding (i.e some destructor is not virtual...)?
I see quite a few compiler warnings, by the way.

here is the stack:

(gdb) bt
#0  0x00007f464844e6c2 in __GI___waitpid (pid=2903, 
    stat_loc=stat_loc@entry=0x7ffe2e7a0c08, options=options@entry=0)
    at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007f46483b9067 in do_system (line=<optimized out>)
    at ../sysdeps/posix/system.c:149
#2  0x00007f463aea9ac3 in taichi::signal_handler (signo=6)
    at /content/taichi/taichi/core/logging.cpp:134
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007f46483aa801 in __GI_abort () at abort.c:79
#6  0x00007f463c7bf348 in _Unwind_Resume ()
   from /content/taichi/build/taichi_core.so
#7  0x00007f463b1d6244 in taichi::Tlang::LowerAST::visit (this=0x7ffe2e7a1ca8, 
    assign=0x3e12840) at /content/taichi/taichi/transforms/lower_ast.cpp:209
#8  0x00007f463af9a7be in taichi::Tlang::FrontendAssignStmt::accept (
    this=0x3e12840, visitor=0x7ffe2e7a1ca8) at /content/taichi/taichi/ir.h:1567
#9  0x00007f463b1d4335 in taichi::Tlang::LowerAST::visit (this=0x7ffe2e7a1ca8, 
    stmt_list=0x27dd170) at /content/taichi/taichi/transforms/lower_ast.cpp:26
#10 0x00007f463afa49ee in taichi::Tlang::Block::accept (this=0x27dd170, 
    visitor=0x7ffe2e7a1ca8) at /content/taichi/taichi/ir.h:1455
#11 0x00007f463b1d420f in taichi::Tlang::LowerAST::run (node=0x27dd170)
    at /content/taichi/taichi/transforms/lower_ast.cpp:285
#12 0x00007f463b1d4075 in taichi::Tlang::irpass::lower (root=0x27dd170)
---Type <return> to continue, or q <return> to quit---
    at /content/taichi/taichi/transforms/lower_ast.cpp:298
#13 0x00007f463ae1af3d in taichi::Tlang::CPUCodeGen::lower_llvm (
    this=0x7ffe2e7a2a48) at /content/taichi/taichi/backends/codegen_x86.cpp:706
#14 0x00007f463ae1cccf in taichi::Tlang::CPUCodeGen::lower (
    this=0x7ffe2e7a2a48) at /content/taichi/taichi/backends/codegen_x86.cpp:825
#15 0x00007f463ae303a6 in taichi::Tlang::KernelCodeGen::compile (
    this=0x7ffe2e7a2a48, prog=..., kernel=...)
    at /content/taichi/taichi/backends/kernel.cpp:13
#16 0x00007f463afc3ff2 in taichi::Tlang::Program::compile (this=0x2b8c580, 
    kernel=...) at /content/taichi/taichi/program.cpp:28
#17 0x00007f463afbd9b0 in taichi::Tlang::Kernel::compile (this=0x3afe000)
    at /content/taichi/taichi/kernel.cpp:37
#18 0x00007f463afbda2d in taichi::Tlang::Kernel::operator() (this=0x3afe000)
    at /content/taichi/taichi/kernel.cpp:43
#19 0x00007f463b164663 in taichi::Tlang::SNode::write_float (this=0x3579870, 
    i=0, j=1, k=0, l=0, val=1) at /content/taichi/taichi/snode.cpp:136
#20 0x00007f463b1337de in pybind11::cpp_function::cpp_function<void, taichi::Tlang::SNode, int, int, int, int, double, pybind11::name, pybind11::is_method, pybind11::sibling>(void (taichi::Tlang::SNode::*)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode*, int, int, int, int, double)#1}::operator()(taichi::Tlang::SNode*, int, int, int, int, double) const (this=0x29b5538, c=0x3579870, 
    args=1, args=1, args=1, args=1, args=1)
---Type <return> to continue, or q <return> to quit---
    at /usr/local/include/python3.6/pybind11/pybind11.h:78
#21 0x00007f463b13373a in pybind11::detail::argument_loader<taichi::Tlang::SNode*, int, int, int, int, double>::call_impl<void, pybind11::cpp_function::cpp_function<void, taichi::Tlang::SNode, int, int, int, int, double, pybind11::name, pybind11::is_method, pybind11::sibling>(void (taichi::Tlang::SNode::*)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode*, int, int, int, int, double)#1}&, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, pybind11::detail::void_type>(pybind11::cpp_function::cpp_function<void, taichi::Tlang::SNode, int, int, int, int, double, pybind11::name, pybind11::is_method, pybind11::sibling>(void (taichi::Tlang::SNode::*)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode*, int, int, int, int, double)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul>, pybind11::detail::void_type&&) (this=0x7ffe2e7a2ee8, f=...)
    at /usr/local/include/python3.6/pybind11/cast.h:1935
#22 0x00007f463b132ea6 in pybind11::detail::argument_loader<taichi::Tlang::SNode*, int, int, int, int, double>::call<void, pybind11::detail::void_type, pybind11::cpp_function::cpp_function<void, taichi::Tlang::SNode, int, int, int, int, double, pybind11::name, pybind11::is_method, pybind11::sibling>(void (taichi::Tlang::SNode::*)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode*, int, int, int, int, double)#1}&> (this=0x7ffe2e7a2ee8, f=...)
    at /usr/local/include/python3.6/pybind11/cast.h:1917```

@znah
Copy link
Contributor Author

znah commented Dec 6, 2019

You can use gdb right in colab, just run the last cell, and it will git you a little prompt (with chars replaced by * :)

@yuanming-hu
Copy link
Member

Thanks for the info x2. In the AST lowering pass, the transformer walks over the AST and modifies it, which might corrupt the call stack in some way. Then the program crashes during exception handling. I'll dig a bit more into it.

@yuanming-hu
Copy link
Member

One possibility is that some node between the leaf node and the root (i.e. on the stack) gets deleted...

@znah
Copy link
Contributor Author

znah commented Dec 6, 2019

I'm trying to debug by adding printf's here and there. You can edit files right in colab, but they need to have *.py extension :/ (so I copy .cpp as .py edit and copy back)

@znah
Copy link
Contributor Author

znah commented Dec 6, 2019

But my C++ debugging skills are quite rusty.

@yuanming-hu
Copy link
Member

I used %%writefile to add some printfs this morning and located the exception during throwing IRModified. I also tried to avoid node on the stack to be deleted, yet that doesn't fix the problem...

@znah
Copy link
Contributor Author

znah commented Dec 6, 2019

We may try to use some clang instrumentation, like https://clang.llvm.org/docs/AddressSanitizer.html

@znah
Copy link
Contributor Author

znah commented Dec 6, 2019

Fun fact: building and running with AddressSanitizer makes the example work :/
(AddressSanitizer found a lot of leaks btw)

@calpa
Copy link

calpa commented Jan 9, 2020

Even I use the minimal example (https://github.com/taichi-dev/taichi/blob/master/examples/minimal.py) to create a simple notebook, the session crashed.

!pip install taichi-nightly

import taichi as ti


@ti.kernel
def p():
  print(42)


p()

Error message: Your session crashed for an unknown reason. View runtime logs

@znah
Copy link
Contributor Author

znah commented Feb 24, 2020

FINALLY!!!! I identified the problem!
Colab kernels have a libtcmalloc library installed and env variable LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 set.
Somehow it causes libstdc++ to use libunwind instead of libgcc_s for stack unwinding on exception. For some reason this causes abort during unwinding complex calls.

Running
LD_PRELOAD= python t.py,
where t.py is some taichi program works, even on GPU kernels.
I'm looking for a way to make work inside colab cells as well.

@yuanming-hu
Copy link
Member

WOW!!!!!! FINALLY!!!!!!! This is a really tricky problem to pinpoint - thank you so much for debugging this!!

I guess this will cause other programs that use exceptions to crash on Colab (and I guess the fact that Google does not use C++ exceptions makes this problem more deeply hidden...)

@znah
Copy link
Contributor Author

znah commented Feb 24, 2020

It's even trickier. I suspect some ABI incompatibility between clang and libunwind, that manifests itself only on unwinding complex virtual calls. So quite few programs are probably affected.

@znah
Copy link
Contributor Author

znah commented Feb 25, 2020

I made a workaround, its pretty ugly, but it makes Taichi run in Colab notebook cells!
https://colab.research.google.com/github/znah/notebooks/blob/master/taichi_colab.ipynb
https://twitter.com/zzznah/status/1232321076014788608

@yuanming-hu
Copy link
Member

Very cool!! What do you think could be a systematic way to solve this? Recompiling Taichi using gcc instead of clang might cause other problems. Would it be possible to override LD_PRELOAD in Colab somewhere through the Colab GUI?

@znah
Copy link
Contributor Author

znah commented Feb 25, 2020

The real way to rectify this issue is to fix a bug somewhere in either clang, or in (nongnu) libunwind, or in tcmalloc. I don't feel like being capable to do this. I'll discuss potential solutions with the Colab team.

@yuanming-hu
Copy link
Member

I don't think I'm able to fix that bug either. Maybe some help from the Colab team would help. Thank you so much for making everything here happen! :-)

@znah
Copy link
Contributor Author

znah commented Feb 25, 2020 via email

@yuanming-hu
Copy link
Member

Oh no.. I'll take a look later today. Thanks for reporting this!

@znah
Copy link
Contributor Author

znah commented Feb 26, 2020

Interesting observation from the Colab team: Taichi works when using tcmalloc_minimal instead of tcmalloc. Relevant bits of documentation:

To use TCMalloc, just link TCMalloc into your application via the "-ltcmalloc" linker flag.

You can use TCMalloc in applications you didn't compile yourself, by using LD_PRELOAD:

   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" 
LD_PRELOAD is tricky, and we don't necessarily recommend this mode of usage.

TCMalloc includes a heap checker and heap profiler as well.

If you'd rather link in a version of TCMalloc that does not include the heap profiler and checker (perhaps to reduce binary size for a static binary), you can link in libtcmalloc_minimal instead.

also this

NOTE: When compiling with programs with gcc, that you plan to link
with libtcmalloc, it's safest to pass in the flags

 -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free

when compiling.  gcc makes some optimizations assuming it is using its
own, built-in malloc; that assumption obviously isn't true with
tcmalloc.  In practice, we haven't seen any problems with this, but
the expected risk is highest for users who register their own malloc
hooks with tcmalloc (using gperftools/malloc_hook.h).  The risk is
lowest for folks who use tcmalloc_minimal (or, of course, who pass in
the above flags :-) ).

I'm continuing the investigation.

@znah
Copy link
Contributor Author

znah commented Mar 3, 2020

Every version >0.5.2 on Colab (0.5.2 works fine)

Invalid bitcode signature
Program aborted due to an unhandled Error:
Invalid bitcode signature[W 03/03/20 17:49:10.868] [llvm_context.cpp:module_from_bitcode_file@187] Bitcode loading error message:
[E 03/03/20 17:49:10.868] [llvm_context.cpp:module_from_bitcode_file@189] Bitcode /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/runtime_x64.bc load failure.
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f60d8c16f20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/lib/x86_64-linux-gnu/libc.so.6: abort
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: llvm::StringError::StringError(std::error_code, llvm::Twine const&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::module_from_bitcode_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::LLVMContext*)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::TaichiLLVMContext::clone_runtime_module()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::TaichiLLVMContext::get_init_module()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::StructCompilerLLVM::StructCompilerLLVM(taichi::lang::Program*, taichi::lang::Arch)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::StructCompiler::make(taichi::lang::Program*, taichi::lang::Arch)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::Program::materialize_layout()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::layout(std::function<void ()> const&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0xd41d59) [0x7f60caf0ad59]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0xb7449d) [0x7f60cad3d49d]

@yuanming-hu
Copy link
Member

Sorry about that. The bitcode loading issue should be fixed in v0.5.6. The buildbots are currently working on compiling/releasing the new version.

@znah znah mentioned this issue May 9, 2020
@github-actions
Copy link

Warning: The issue has been out-of-update for 50 days, marking stale.

@github-actions github-actions bot added the stale stale issues and PRs label May 22, 2020
@znah
Copy link
Contributor Author

znah commented Dec 15, 2021

I'd like to reopen this issue. The problem is still there, and I think supporting colab environment would greatly increase Taichi user adoption.

@znah
Copy link
Contributor Author

znah commented Dec 15, 2021

@yuanming-hu WDYT?

@yuanming-hu
Copy link
Member

Hi @znah, thanks for keeping an eye on this! I do believe supporting colab is very important. One solution is to completely remove exceptions from Taichi. Let me check with people tomorrow and see if that is possible!

@ppwwyyxx
Copy link
Contributor

FYI and off-topic: this opinion from pytorch author: https://twitter.com/soumithchintala/status/1451213207750721538 may lead the maintainers to reconsider whether it's a good idea to "auto-close stale issues". I personally agree with his opinion.
What's more valid (and also used in projects I maintained) is to auto-close invalid issues (e.g. those missing necessary information).

@bobcao3 bobcao3 reopened this Dec 16, 2021
@yuanming-hu
Copy link
Member

yuanming-hu commented Dec 16, 2021

@ppwwyyxx Thanks for pointing this out. I agree that closing stale issues using bots is not a good idea, and will prevent further misuse like this.

@znah After some searching, it turns out that we are now blocked at #1059 - if we can remove all C++ exceptions (which I believe is necessary), then the system will not involve libunwind and we can run Taichi on colab. It may take some time for people (@sjwsl and @lin-hitonami) to fully remove throw IRModified etc. - if you'd like to help that would be awesome!

@epi-morphism
Copy link

Hi, is this still being worked on? The workaround posted https://colab.research.google.com/github/znah/notebooks/blob/master/taichi_colab.ipynb no longer works so I'd love if this was implemented since my local machine doesn't have enough horsepower to try taichi out. I see the issue that blocked progress on this was fixed and closed.

@k-ye
Copy link
Member

k-ye commented Feb 8, 2022

cc @mzmzm @strongoier

@strongoier
Copy link
Contributor

Hi @epi-morphism. I just tried pip install taichi and ran some simple code snippets, but it seems that everything works fine. We don't need any workaround now. Could you provide your code snippets if you meet any problems?

@epi-morphism
Copy link

@strongoier I stand corrected, it appears the 'minimal' taichi code I was using was incorrect (though the lack of error messages makes things a bit hard to decipher). Apologies for pinging you all, seems to work well now :) Excited to try taichi out

@strongoier
Copy link
Contributor

@epi-morphism No worries. Hope you enjoy it :-)

I'll close this issue because Taichi works on Google colab now. Feel free to open a new issue if you meet other problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stale issues and PRs welcome contribution
Projects
None yet
Development

No branches or pull requests

9 participants