-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_layer_norm_fwd_1pass_kernel error #84
Comments
@chenwuchen Bumped into the same error. |
I was facing with the same issue. Upon investigating further, I realized that the line 75 of the autotuner.py file in the triton package receives (inside the
My take is that this may:
But, I was able to run the dataparallel by replacing the line 75 of autotuner.py with these modifications:
Though, not sure if the training would overall behave as expected. |
Needed to do the same, I am executing the trainer script from mamba_chat and got a triton error same as above. Patching the package and installing: Get the package git clone https://github.com/openai/triton.git;
git checkout release/2.1.x;
pip install cmake; Patch it by editing replacing full_nargs = {**self.nargs, **current} with full_nargs = {}
if self.nargs:
full_nargs.update(self.nargs)
if current:
full_nargs.update(current) proceed to install the patched version:
install the rest of mamba dependencies as per normal (I am using Pyton 3.11 in a conda environment, currently have training running on a pair of RTX 3090s) |
Hacking it that way could cause silent errors (especially if different args are passed concurrently to the jit) Looks like If you're confident the rest of the forward pass is thread-safe and you must run in threaded mode, you could try to run a single pass first to boostrap the jit before running it in parallel. I don't think mamba has a config.pre_hook, which is the only thing that would be written per run after benching is complete. |
@PheelaV Can you provide more details about how to conduct triton from source? Thanks!
|
@AndssY Sorry I can't really, I was using python 3.10 or .11 on Ubuntu 22.04 LTS and following their readme instructions everything worked out. I think the only thing I had to do out of standard was to follow the specific release branch as requested per mamba dependencies (important). But this whole thing became redundant. I think Mamba was patched and everything suddenly started to work with just mamba-ssm and conv1d install. I still kept the environment with a triton built from source to be sure, but it was no longer necessary for me. Hope that helps at least a little bit. Good luck with getting it up and running. Feel free to DM me if you still struggle, I think I went through all the jumps and hoops I could have met. |
@PheelaV Did you install according to the readme of maba-chat? So I will try python==3.11 and install it again following the readme of mamba-chat. Thanks very much! |
Has it been resolved? |
Title: Error when running multi-GPU training with Mamba
Description:
I am experiencing an issue when running multi-GPU training with Mamba. Specifically, I am getting a TypeError: 'NoneType' object is not a mapping error when running the forward pass of the model. The error occurs when I try to run the model on multiple GPUs using the DataParallel module. However, when I run the model on a single GPU, everything works fine.
I have tried to reproduce the issue with a minimal example, but I was unable to do so. I have also checked the documentation and searched online for similar issues, but I couldn't find anything useful.
Here is the full traceback of the error:
I am using Python 3.10, PyTorch 1.12.1, and causal_conv1d-1.1.1 mamba-ssm-1.1.1 triton-2.1.0
The text was updated successfully, but these errors were encountered: