Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port CUDA Kernels #8

Open
2 tasks
xrsrke opened this issue Oct 25, 2023 · 6 comments
Open
2 tasks

Port CUDA Kernels #8

xrsrke opened this issue Oct 25, 2023 · 6 comments
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@xrsrke
Copy link
Owner

xrsrke commented Oct 25, 2023

Port training CUDA kernels from these librarys, and automatically replace modules in an existing 🤗 transformers model with their corresponding CUDA kernel version.

Check out the following open source projects, and propose which CUDA kernels we should port. Then write a kernel builder which takes a kernel name, and loads it.

Implementation

class KernelBuilder:
    def load(self):
        pass


class FusedOp(KernelBuilder):
    def absolute_name(self):
         # NOTE: the absolute path to the kernel
         pass

fused_op = FusedOp().load()
outputs = fused_op(inputs)

APIs

import torch.nn.functional as F
from pipegoose.nn.fusion import softmax

assert softmax(x, dim=-1) == F.softmax(x, dim=-1)

TODOs

@xrsrke xrsrke added help wanted Extra attention is needed good first issue Good for newcomers labels Oct 25, 2023
@isamu-isozaki
Copy link

isamu-isozaki commented Oct 25, 2023

Hi! So I was thinking of porting substation after reading the paper but a question about how you want the cuda kernels integrated in. In substation, they do optimizations by "generating" cuda files specific to the dimensions of each kernel(from what I understand from looking at https://github.com/spcl/substation/blob/master/pytorch_module/test_softmax.py) so basically it makes cuda code specific for each function so it's faster but may be messier.

Bitsandbytes does it but just loading the .so files from a given location and deepspeed does it by I think building the cstr when doing pip install. So a bit slower but more general.

There might be more ways to do it but which way do you think will work the best for you?

@isamu-isozaki
Copy link

Just checked colossalai, seems like they have functions called op_builders that they use to build certain cuda libraries.

@xrsrke
Copy link
Owner Author

xrsrke commented Oct 26, 2023

@isamu-isozaki This is a good idea. What are the pros and cons of substation? What do you think we should use? If there are some operations that substation is really good at, we could do both substation and manually port kernels.

Also, for manually porting kernel, I think we should do something like this

class KernelBuilder:
    def load(self):
        pass


class FusedOp(KernelBuilder):
    def absolute_name(self):
         # NOTE: the absolute path to the kernel
         pass

fused_op = FusedOp().load()
outputs = fused_op(inputs)

@isamu-isozaki
Copy link

isamu-isozaki commented Oct 26, 2023

@xrsrke I think substation's method in general is faster, but it needs you to generate a new cuda file for each possible shape of tensor. So the main disadvantage is it's not clean I think(My guess is just changing batch size will need a new cuda script if we were to just copy).

I think the way you are doing is similar to colossalai's and deepspeed's ver which we can definitely do. I do remember that setting up colossalai is pretty troublesome compared to say deepspeed.h I'm not sure why but we can probably cross that bridge when we get there. I think this approach is more general but might be slightly slower than say substation. I think we can probably start with this approach and if we want, extend to substation and build kernels specific to a certain dimension input

@isamu-isozaki
Copy link

isamu-isozaki commented Oct 26, 2023

do you think this makes sense? I can check out megatron-lm's way etc if you want

@xrsrke
Copy link
Owner Author

xrsrke commented Oct 27, 2023

@isamu-isozaki Could you try to benchmark the two approaches? Try fusing a softmax using substation, then compare the two approaches... (also, it could be that some operations are performed better by substitution, while others are more efficient when written manually. We should take this into account while benchmarking). Or maybe, we should put this for experimental later on.. and for now, just port these kernels.

Also, I've just added GPT-NeoX's kernel to the issue above.

@xrsrke xrsrke removed the help wanted Extra attention is needed label Nov 2, 2023
@xrsrke xrsrke added the help wanted Extra attention is needed label Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
Status: In Progress
Development

No branches or pull requests

2 participants