-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pip-installed Taichi crashes on Google colab kernels #235
Comments
Thanks for reporting this. Taichi crashes during the AST lowering process on Google Colab. The same script runs fine offline though. It might be related to the use of C++ exceptions during AST lowering, however I currently don't have a clear idea what's wrong... |
In general, I think using Google colab for Taichi is a good idea. I'll dig deeper into this later. More debug information:
|
Update: I tested exception throwing and it works fine on colab. May be some other reason. |
I tried the 0.0.80 version, here is the error log :
|
Here is the notebook where I try to install or build Taichi in colab kernel. |
Also, GPU version crashes for a different reason:
|
Thanks for testing! I'll try to take a deeper look into this later today. |
Hi! Do you have any update on this? I'm now trying to build llvm and taichi on colab, but it takes a while... |
Hi @znah, Sorry I haven't got a chance to work on this. I think colab is a great place for using Taichi, however, it's also very hard to debug what's wrong... A month ago, the crash happened during Taichi IR compilation. I couldn't reproduce this on any other environment. If you could help investigate what's wrong, that would be great! It's also worth checking if the latest python wheels of Taichi still crashes. You know, I'm in somewhere on earth without access to google. Thanks, |
The GPU crash is due to a virtual memory allocation issue. We should first make sure the CPU version works. |
Here is the notebook where I try to build a dev version. I'm certainly doing something wrong, but I have a pre-build LLVM, so that we don't have to wait for it again. |
So I basically reproduced the same error with taichi that was built on the colab from sources. Where to go from this? |
Thanks for the notebook! It seems that I don't have permission to access it yet. I requested access. Could you approve? If we can build from source on colab, I think one thing to do is to do a debug build ( |
Thanks, I have access now! It's late in my place, but let me try doing a debug build now before I go to speed. |
Thank you! I've actually started the debug build already. Waiting... |
Oh, thanks! I'll continue working on this first thing tomorrow morning then. I hope the crashing reason is clear under the debug build. The notebook file you have shared is super useful! Let's see what will happen :-) |
Thanks again. No need to rush, I just wanted to make sure Taichi works in colab someday. |
All I have so far:
|
Thanks for the info! It might be due to shared pointer issues/memory corruption, but I need to dig deeper into this. I'm making use of your notebook to build Taichi and diagnose. That's super helpful. Thank you for providing that. It will also be helpful to have a stack backtrace when it crashes, i.e. |
The program crashes when the IRModified() exception is thrown. |
May the crash happen due while stack unwinding (i.e some destructor is not virtual...)? here is the stack:
|
You can use gdb right in colab, just run the last cell, and it will git you a little prompt (with chars replaced by * :) |
Thanks for the info x2. In the AST lowering pass, the transformer walks over the AST and modifies it, which might corrupt the call stack in some way. Then the program crashes during exception handling. I'll dig a bit more into it. |
One possibility is that some node between the leaf node and the root (i.e. on the stack) gets deleted... |
I'm trying to debug by adding printf's here and there. You can edit files right in colab, but they need to have *.py extension :/ (so I copy .cpp as .py edit and copy back) |
But my C++ debugging skills are quite rusty. |
I used %%writefile to add some printfs this morning and located the exception during throwing IRModified. I also tried to avoid node on the stack to be deleted, yet that doesn't fix the problem... |
We may try to use some clang instrumentation, like https://clang.llvm.org/docs/AddressSanitizer.html |
Fun fact: building and running with AddressSanitizer makes the example work :/ |
Even I use the minimal example (https://github.com/taichi-dev/taichi/blob/master/examples/minimal.py) to create a simple notebook, the session crashed. !pip install taichi-nightly
import taichi as ti
@ti.kernel
def p():
print(42)
p() Error message: |
FINALLY!!!! I identified the problem! Running |
WOW!!!!!! FINALLY!!!!!!! This is a really tricky problem to pinpoint - thank you so much for debugging this!! I guess this will cause other programs that use exceptions to crash on Colab (and I guess the fact that Google does not use C++ exceptions makes this problem more deeply hidden...) |
It's even trickier. I suspect some ABI incompatibility between |
I made a workaround, its pretty ugly, but it makes Taichi run in Colab notebook cells! |
Very cool!! What do you think could be a systematic way to solve this? Recompiling Taichi using gcc instead of clang might cause other problems. Would it be possible to override |
The real way to rectify this issue is to fix a bug somewhere in either clang, or in (nongnu) libunwind, or in tcmalloc. I don't feel like being capable to do this. I'll discuss potential solutions with the Colab team. |
I don't think I'm able to fix that bug either. Maybe some help from the Colab team would help. Thank you so much for making everything here happen! :-) |
By the way, while 0.5.2 works, 0.5.3 crashes for a different reason: [W
02/25/20 17:47:41.893] [taichi_llvm_context.cpp:module_from_bitcode_file@186]
Bitcode loading error message:
Invalid bitcode signature
…On Tue, Feb 25, 2020, 18:46 Yuanming Hu ***@***.***> wrote:
I don't think I'm able to fix that bug either. Maybe some help from the
Colab team would help. Thank you so much for making everything here happen!
:-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#235?email_source=notifications&email_token=AAT2ZOHN2IJCWJ3KBQTTVDDREVKPPA5CNFSM4JG2YL72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM4ZVRY#issuecomment-590977735>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAT2ZOFHOGMTKYT6X3DRHC3REVKPPANCNFSM4JG2YL7Q>
.
|
Oh no.. I'll take a look later today. Thanks for reporting this! |
Interesting observation from the Colab team: Taichi works when using
also this
I'm continuing the investigation. |
Every version >0.5.2 on Colab (0.5.2 works fine)
|
Sorry about that. The bitcode loading issue should be fixed in v0.5.6. The buildbots are currently working on compiling/releasing the new version. |
Warning: The issue has been out-of-update for 50 days, marking |
I'd like to reopen this issue. The problem is still there, and I think supporting colab environment would greatly increase Taichi user adoption. |
@yuanming-hu WDYT? |
Hi @znah, thanks for keeping an eye on this! I do believe supporting colab is very important. One solution is to completely remove exceptions from Taichi. Let me check with people tomorrow and see if that is possible! |
FYI and off-topic: this opinion from pytorch author: https://twitter.com/soumithchintala/status/1451213207750721538 may lead the maintainers to reconsider whether it's a good idea to "auto-close stale issues". I personally agree with his opinion. |
@ppwwyyxx Thanks for pointing this out. I agree that closing stale issues using bots is not a good idea, and will prevent further misuse like this. @znah After some searching, it turns out that we are now blocked at #1059 - if we can remove all C++ exceptions (which I believe is necessary), then the system will not involve |
Hi, is this still being worked on? The workaround posted https://colab.research.google.com/github/znah/notebooks/blob/master/taichi_colab.ipynb no longer works so I'd love if this was implemented since my local machine doesn't have enough horsepower to try taichi out. I see the issue that blocked progress on this was fixed and closed. |
Hi @epi-morphism. I just tried |
@strongoier I stand corrected, it appears the 'minimal' taichi code I was using was incorrect (though the lack of error messages makes things a bit hard to decipher). Apologies for pinging you all, seems to work well now :) Excited to try taichi out |
@epi-morphism No worries. Hope you enjoy it :-) I'll close this issue because Taichi works on Google colab now. Feel free to open a new issue if you meet other problems. |
Opening an empty CPU-backed notebook at https://colab.research.google.com and running the following code leads to crash:
And the relevant runtime logs say:
Can you please provide some insight into the possible root of the problem if you have it on top of your head?
The text was updated successfully, but these errors were encountered: