GPU Development Skills

Agent skills for GPU kernel development — writing, debugging, profiling, and optimizing CUDA and TileLang GPU kernels.

The .vscode directory is intentionally tracked as a reference to configure highlighting for humans.

TileLang Skills

TileLang skills created by Claude Opus 4.6 (probably) in Claude Code with the skill-creator plugin. Based on TileLang v0.1.9 docs and examples, validated on RTX PRO 6000 Blackwell GPUs (sm_120) with CUDA 13.1 and PyTorch 2.11.

Skill	Description	Key Topics
writing-tilelang-kernels	Write TileLang GPU kernels from scratch or by adapting patterns	Kernel anatomy, templates (GEMM, elementwise, reduction), memory scopes, T.copy/T.gemm, dynamic shapes
debugging-tilelang-programs	Diagnose and fix errors in TileLang programs	Failure taxonomy, T.print, AutoDD, compute-sanitizer, numerical drift, race detection
profiling-tilelang-programs	Benchmark and profile TileLang kernels	do_bench backends, TFLOPS/bandwidth, ncu bottleneck diagnosis (pipe utilization, warp stalls), roofline
torch-profiling-tilelang-programs	Lightweight torch.profiler alternative to ncu/nsys for TileLang	torch.profiler setup, key_averages, chrome trace, roofline classification (IO / CUDA-core / tensor-core), launch overhead, memory profiling
optimizing-tilelang-programs	Optimize TileLang kernels for performance	Tile sizes, pipeline stages, threads, AutoTuner, epilogue fusion, swizzle, ncu-guided tuning
testing-fwd-bwd-kernels	Test kernels with forward and backward passes	torch.autograd.Function, compare_backward (not gradcheck), mixed-precision, atomicAdd, attention bwd

Workflow

The skills form a natural progression:

writing → debugging → profiling → optimizing
                          ↓           ↓
            torch-profiling   testing-fwd-bwd (for differentiable ops)

Write a kernel using templates from the writing skill
Debug if it fails to compile or produces wrong results
Profile to measure baseline performance — use profiling-tilelang-programs for the ncu/do_bench workflow, or torch-profiling-tilelang-programs when ncu/nsys aren't available and you want a quick torch.profiler-based pass
Optimize to improve performance
Test fwd+bwd if the kernel needs gradients

CUDA Skills

General-purpose CUDA development skill for debugging, profiling, and optimizing GPU kernels — independent of any specific framework. Originally from technillogue/ptx-isa-markdown. Includes scraped PTX ISA 9.1, CUDA Runtime API 13.1, and CUDA Driver API 13.1 documentation (640+ markdown files) for grep-based lookup.

Skill	Description	Key Topics
cuda-programming	Debug, profile, and optimize CUDA kernels	compute-sanitizer, cuda-gdb, ncu/nsys profiling, NVTX, PTX ISA, coalescing, bank conflicts, inline PTX

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
evals		evals
skills		skills
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Development Skills

TileLang Skills

Workflow

CUDA Skills

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU Development Skills

TileLang Skills

Workflow

CUDA Skills

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages