Port CUDA Kernels #8

xrsrke · 2023-10-25T05:13:18Z

Port training CUDA kernels from these librarys, and automatically replace modules in an existing 🤗 transformers model with their corresponding CUDA kernel version.

Check out the following open source projects, and propose which CUDA kernels we should port. Then write a kernel builder which takes a kernel name, and loads it.

Implementation

class KernelBuilder:
    def load(self):
        pass


class FusedOp(KernelBuilder):
    def absolute_name(self):
         # NOTE: the absolute path to the kernel
         pass

fused_op = FusedOp().load()
outputs = fused_op(inputs)

APIs

import torch.nn.functional as F
from pipegoose.nn.fusion import softmax

assert softmax(x, dim=-1) == F.softmax(x, dim=-1)

TODOs

FusedScaleMaskSoftmax [link]
MixedFusedLayerNorm [link]

The text was updated successfully, but these errors were encountered:

isamu-isozaki · 2023-10-25T22:23:22Z

Hi! So I was thinking of porting substation after reading the paper but a question about how you want the cuda kernels integrated in. In substation, they do optimizations by "generating" cuda files specific to the dimensions of each kernel(from what I understand from looking at https://github.com/spcl/substation/blob/master/pytorch_module/test_softmax.py) so basically it makes cuda code specific for each function so it's faster but may be messier.

Bitsandbytes does it but just loading the .so files from a given location and deepspeed does it by I think building the cstr when doing pip install. So a bit slower but more general.

There might be more ways to do it but which way do you think will work the best for you?

isamu-isozaki · 2023-10-26T03:43:54Z

Just checked colossalai, seems like they have functions called op_builders that they use to build certain cuda libraries.

xrsrke · 2023-10-26T07:57:26Z

@isamu-isozaki This is a good idea. What are the pros and cons of substation? What do you think we should use? If there are some operations that substation is really good at, we could do both substation and manually port kernels.

Also, for manually porting kernel, I think we should do something like this

class KernelBuilder:
    def load(self):
        pass


class FusedOp(KernelBuilder):
    def absolute_name(self):
         # NOTE: the absolute path to the kernel
         pass

fused_op = FusedOp().load()
outputs = fused_op(inputs)

isamu-isozaki · 2023-10-26T14:59:49Z

@xrsrke I think substation's method in general is faster, but it needs you to generate a new cuda file for each possible shape of tensor. So the main disadvantage is it's not clean I think(My guess is just changing batch size will need a new cuda script if we were to just copy).

I think the way you are doing is similar to colossalai's and deepspeed's ver which we can definitely do. I do remember that setting up colossalai is pretty troublesome compared to say deepspeed.h I'm not sure why but we can probably cross that bridge when we get there. I think this approach is more general but might be slightly slower than say substation. I think we can probably start with this approach and if we want, extend to substation and build kernels specific to a certain dimension input

isamu-isozaki · 2023-10-26T15:00:36Z

do you think this makes sense? I can check out megatron-lm's way etc if you want

xrsrke · 2023-10-27T07:33:58Z

@isamu-isozaki Could you try to benchmark the two approaches? Try fusing a softmax using substation, then compare the two approaches... (also, it could be that some operations are performed better by substitution, while others are more efficient when written manually. We should take this into account while benchmarking). Or maybe, we should put this for experimental later on.. and for now, just port these kernels.

Also, I've just added GPT-NeoX's kernel to the issue above.

xrsrke added help wanted Extra attention is needed good first issue Good for newcomers labels Oct 25, 2023

xrsrke assigned isamu-isozaki Oct 27, 2023

xrsrke removed the help wanted Extra attention is needed label Nov 2, 2023

xrsrke added the help wanted Extra attention is needed label Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port CUDA Kernels #8

Port CUDA Kernels #8

xrsrke commented Oct 25, 2023 •

edited

Loading

isamu-isozaki commented Oct 25, 2023 •

edited

Loading

isamu-isozaki commented Oct 26, 2023

xrsrke commented Oct 26, 2023 •

edited

Loading

isamu-isozaki commented Oct 26, 2023 •

edited

Loading

isamu-isozaki commented Oct 26, 2023 •

edited

Loading

xrsrke commented Oct 27, 2023 •

edited

Loading

Port CUDA Kernels #8

Port CUDA Kernels #8

Comments

xrsrke commented Oct 25, 2023 • edited Loading

isamu-isozaki commented Oct 25, 2023 • edited Loading

isamu-isozaki commented Oct 26, 2023

xrsrke commented Oct 26, 2023 • edited Loading

isamu-isozaki commented Oct 26, 2023 • edited Loading

isamu-isozaki commented Oct 26, 2023 • edited Loading

xrsrke commented Oct 27, 2023 • edited Loading

xrsrke commented Oct 25, 2023 •

edited

Loading

isamu-isozaki commented Oct 25, 2023 •

edited

Loading

xrsrke commented Oct 26, 2023 •

edited

Loading

isamu-isozaki commented Oct 26, 2023 •

edited

Loading

isamu-isozaki commented Oct 26, 2023 •

edited

Loading

xrsrke commented Oct 27, 2023 •

edited

Loading