Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, and FP8 GEMM.
python acceleration deep-learning high-performance cuda inference pytorch nvidia triton gpu-computing quantization gpu-kernels kernel-fusion llm fp8 rotary-position-embedding rmsnorm llm-inference operator-fusion transformer-inference
-
Updated
Apr 29, 2026 - Python