Agent skills for GPU kernel development — writing, debugging, profiling, and optimizing CUDA and TileLang GPU kernels.
The .vscode directory is intentionally tracked as a reference to configure highlighting for humans.
TileLang skills created by Claude Opus 4.6 (probably) in Claude Code with the skill-creator plugin. Based on TileLang v0.1.9 docs and examples, validated on RTX PRO 6000 Blackwell GPUs (sm_120) with CUDA 13.1 and PyTorch 2.11.
| Skill | Description | Key Topics |
|---|---|---|
| writing-tilelang-kernels | Write TileLang GPU kernels from scratch or by adapting patterns | Kernel anatomy, templates (GEMM, elementwise, reduction), memory scopes, T.copy/T.gemm, dynamic shapes |
| debugging-tilelang-programs | Diagnose and fix errors in TileLang programs | Failure taxonomy, T.print, AutoDD, compute-sanitizer, numerical drift, race detection |
| profiling-tilelang-programs | Benchmark and profile TileLang kernels | do_bench backends, TFLOPS/bandwidth, ncu bottleneck diagnosis (pipe utilization, warp stalls), roofline |
| torch-profiling-tilelang-programs | Lightweight torch.profiler alternative to ncu/nsys for TileLang | torch.profiler setup, key_averages, chrome trace, roofline classification (IO / CUDA-core / tensor-core), launch overhead, memory profiling |
| optimizing-tilelang-programs | Optimize TileLang kernels for performance | Tile sizes, pipeline stages, threads, AutoTuner, epilogue fusion, swizzle, ncu-guided tuning |
| testing-fwd-bwd-kernels | Test kernels with forward and backward passes | torch.autograd.Function, compare_backward (not gradcheck), mixed-precision, atomicAdd, attention bwd |
The skills form a natural progression:
writing → debugging → profiling → optimizing
↓ ↓
torch-profiling testing-fwd-bwd (for differentiable ops)
- Write a kernel using templates from the writing skill
- Debug if it fails to compile or produces wrong results
- Profile to measure baseline performance — use
profiling-tilelang-programsfor the ncu/do_bench workflow, ortorch-profiling-tilelang-programswhen ncu/nsys aren't available and you want a quicktorch.profiler-based pass - Optimize to improve performance
- Test fwd+bwd if the kernel needs gradients
General-purpose CUDA development skill for debugging, profiling, and optimizing GPU kernels — independent of any specific framework. Originally from technillogue/ptx-isa-markdown. Includes scraped PTX ISA 9.1, CUDA Runtime API 13.1, and CUDA Driver API 13.1 documentation (640+ markdown files) for grep-based lookup.
| Skill | Description | Key Topics |
|---|---|---|
| cuda-programming | Debug, profile, and optimize CUDA kernels | compute-sanitizer, cuda-gdb, ncu/nsys profiling, NVTX, PTX ISA, coalescing, bank conflicts, inline PTX |