Blackwell / CUDA 13 fork

Unofficial fork at smolix/mxnet — Blackwell (sm_120) port of MXNet 2.0.

Apache MXNet was archived on 2023-11-17. The upstream tree is frozen at CUDA 11 / cuDNN 8 / oneDNN v2 and does not build on Blackwell GPUs or modern CUDA toolchains. This fork carries the minimum set of patches needed to run existing MXNet code on current hardware. It is not an official Apache release.

Current version string: 2.0.0+cu13.bw.20260517 (<upstream-version>+cu<cuda-major>.bw.<YYYYMMDD>).

Why this fork exists

The primary goal is to run existing MXNet notebooks on Blackwell (RTX PRO 4000 / RTX 50-series / B100-class) hardware with the current CUDA 13 + cuDNN 9 + oneDNN v3 + NCCL 2.28 stack. The secondary goal is to keep the residual MXNet user community (legacy research code, frozen production pipelines, niche operators like _contrib_quantize_*) able to use current GPUs without a full rewrite to PyTorch / JAX. See issues.md for the open work list and priority-ordered triage.

What works

Blackwell sm_120 SASS / PTX (CUDA 13.0), plus sm_80 (Ampere), sm_86 (Ada), sm_89 (RTX 40), sm_90 (H100), and PTX 120 fallback in fatbin.
cuDNN 9.22 — including the rewritten v8 RNN path (LSTM / GRU / vanilla RNN, fwd + bwd). TF32 enabled by default on FP32 conv (mirrors PyTorch / TensorFlow defaults; ~2.87× speedup on sm_120 vs the legacy MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION=0 mode).
oneDNN v3.11 — full INT8 path (per-OC weight scales, fused conv/FC, fused sum, dequant-to-fp32 output).
NCCL 2.28 — single-process / multi-GPU.
INT8 quantization (quantize_net, _sg_onednn_conv, _sg_onednn_fully_connected).
fp16 and fp32 forward + backward training.
F16C CPU intrinsics for fast fp16 host (de)serialization.
DNNL subgraph fusion, the activation/eltwise/layer-norm/softmax stack, pooling, batch norm fwd+bwd, transpose, concat, where, masked softmax.

What is experimental or known-broken

bf16 on non-AVX-512-BF16 CPUs — oneDNN falls back to fp32 emulation; not fixable in software, test on Intel SPR or AMD Zen 4 / Granite Rapids.
Backward through quantized ops — forward inference is solid; backward through _sg_onednn_fully_connected and _sg_onednn_conv is unvalidated.
AMP (automatic mixed precision) subgraph — 6 known failures with inner_product primitive creation; investigation pending.
int8 quantized concat — test_pos_single_concat_pos_neg[*-data_shape1] fails with entire output channels zeroed; suspect oneDNN v3 uint8→int8 reorder semantics.
ONNX export / import — both tests/python/onnx/test_models.py and test_operators.py error at collect time; the ONNX path was not updated for MXNet 2.0 numpy ops.
See issues.md for the full open list (45 items).

System requirements

Linux x86_64 (tested on Ubuntu 22.04 / 24.04).
NVIDIA driver supporting CUDA 13 (R570+).
CUDA 13.0 toolkit.
cuDNN 9.22+ (cuDNN 9.22 has the best sm_120 heuristic coverage; earlier 9.x works but routes more conv shapes through generic fallback engines — notably depthwise 3×3 is ~7× faster on 9.22 vs 9.14). The release wheel bundles 9.22 under mxnet/lib/.
NCCL 2.28.3.
Python 3.10+ (3.11 / 3.12 / 3.13 are CI-tested).

Installation

pip install https://github.com/smolix/mxnet/releases/download/v2.0.0.cu13.bw.20260517-beta/mxnet-2.0.0+cu13.bw.20260517-cp311-cp311-linux_x86_64.whl

The wheel itself is 454 MB. pip will transitively pull nvidia-cudnn-cu13 (~1 GB) and nvidia-nccl-cu13 (~190 MB). The remaining CUDA 13 toolkit libs (libcudart.so.13, libcublas.so.13, libcufft.so.12, libcusolver.so.12, libcurand.so.10, libnvrtc.so.13) come from your system CUDA 13 install at /usr/local/cuda/ — NVIDIA has not yet published cu13 wheels for those on PyPI (the rest of the nvidia-*-cu13 packages are placeholder stubs at version 0.0.1 as of 2026-05-17). libmxnet.so's RUNPATH covers both locations.

Requires Python 3.11, Linux x86_64, NVIDIA driver R570+, and the CUDA 13 toolkit installed at /usr/local/cuda/ (e.g. apt install cuda-13).

To build from source see BUILDING.md. The short version is: clone with submodules, install libnccl-dev before invoking cmake, then cmake -DUSE_CUDA=ON -DCUDA_ARCH_LIST="12.0" ...

Acknowledgements

This fork builds on the work of the Apache MXNet community and its contributors. All upstream code is Apache 2.0; the Blackwell / CUDA 13 patches in this fork are likewise Apache 2.0. The original project history follows below — its build status badges, social links, and roadmap targets refer to the (now archived) upstream and are kept for historical reference.

Apache MXNet for Deep Learning (upstream — archived 2023-11-17)

Note: the sections below describe the original Apache MXNet 2.0 project. They are kept verbatim for historical context. Build status badges, mailing lists, Slack channels, and Twitter/Medium links refer to the archived upstream project and are not actively monitored.

Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scalable to many GPUs and machines.

Apache MXNet is more than a deep learning project. It is a community on a mission of democratizing AI. It is a collection of blue prints and guidelines for building deep learning systems, and interesting insights of DL systems for hackers.

Licensed under an Apache-2.0 license.

Branch	Build Status
master
v1.x

Features

NumPy-like programming interface, and is integrated with the new, easy-to-use Gluon 2.0 interface. NumPy users can easily adopt MXNet and start in deep learning.
Automatic hybridization provides imperative programming with the performance of traditional symbolic programming.
Lightweight, memory-efficient, and portable to smart devices through native cross-compilation support on ARM, and through ecosystem projects such as TVM, TensorRT, OpenVINO.
Scales up to multi GPUs and distributed setting with auto parallelism through ps-lite, Horovod, and BytePS.
Extensible backend that supports full customization, allowing integration with custom accelerator libraries and in-house hardware without the need to maintain a fork.
Support for Python, Java, C++, R, Scala, Clojure, Go, Javascript, Perl, and Julia.
Cloud-friendly and directly compatible with AWS and Azure.

What's New

1.9.1 Release - MXNet 1.9.1 Release.
1.8.0 Release - MXNet 1.8.0 Release.
1.7.0 Release - MXNet 1.7.0 Release.
1.6.0 Release - MXNet 1.6.0 Release.
1.5.1 Release - MXNet 1.5.1 Patch Release.
1.5.0 Release - MXNet 1.5.0 Release.
1.4.1 Release - MXNet 1.4.1 Patch Release.
1.4.0 Release - MXNet 1.4.0 Release.
1.3.1 Release - MXNet 1.3.1 Patch Release.
1.3.0 Release - MXNet 1.3.0 Release.
1.2.0 Release - MXNet 1.2.0 Release.
1.1.0 Release - MXNet 1.1.0 Release.
1.0.0 Release - MXNet 1.0.0 Release.
0.12.1 Release - MXNet 0.12.1 Patch Release.
0.12.0 Release - MXNet 0.12.0 Release.
0.11.0 Release - MXNet 0.11.0 Release.
Apache Incubator - We are now an Apache Incubator project.
0.10.0 Release - MXNet 0.10.0 Release.
0.9.3 Release - First 0.9 official release.
0.9.1 Release (NNVM refactor) - NNVM branch is merged into master now. An official release will be made soon.
0.8.0 Release

Ecosystem News

Stay Connected

Channel	Purpose
Follow MXNet Development on Github	See what's going on in the MXNet project.
MXNet Confluence Wiki for Developers	MXNet developer wiki for information related to project development, maintained by contributors and developers. To request write access, send an email to send request to the dev list .
dev@mxnet.apache.org mailing list	The "dev list". Discussions about the development of MXNet. To subscribe, send an email to dev-subscribe@mxnet.apache.org .
discuss.mxnet.io	Asking & answering MXNet usage questions.
Apache Slack #mxnet Channel	Connect with MXNet and other Apache developers. To join the MXNet slack channel send request to the dev list .
Follow MXNet on Social Media	Get updates about new features and events.

Social Media

Keep connected with the latest MXNet news and updates.

Apache MXNet on Twitter

Contributor and user blogs about MXNet

Discuss MXNet on r/mxnet

Apache MXNet YouTube channel

Apache MXNet on LinkedIn

History

MXNet emerged from a collaboration by the authors of cxxnet, minerva, and purine2. The project reflects what we have learned from the past projects. MXNet combines aspects of each of these projects to achieve flexibility, speed, and memory efficiency.

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015

Name		Name	Last commit message	Last commit date
Latest commit History 11,960 Commits
.github		.github
3rdparty		3rdparty
benchmark		benchmark
cd		cd
ci		ci
cmake		cmake
config		config
contrib/tvmop		contrib/tvmop
cpp-package		cpp-package
docker		docker
docs		docs
example		example
include		include
licenses		licenses
plugin		plugin
python		python
src		src
tests		tests
tools		tools
.asf.yaml		.asf.yaml
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.cmakelintrc		.cmakelintrc
.codecov.yml		.codecov.yml
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.licenserc.yaml		.licenserc.yaml
.mxnet_root		.mxnet_root
BUILDING.md		BUILDING.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTORS.md		CONTRIBUTORS.md
DNNL_README.md		DNNL_README.md
LICENSE		LICENSE
NEWS.md		NEWS.md
NOTICE		NOTICE
PERFORMANCE_NOTES.md		PERFORMANCE_NOTES.md
README.md		README.md
SECURITY.md		SECURITY.md
bench_cudnn_sweep.py		bench_cudnn_sweep.py
bench_tf32_conv.py		bench_tf32_conv.py
bugreport.md		bugreport.md
bugreport_gluon_data.md		bugreport_gluon_data.md
conftest.py		conftest.py
cublaslt_scope.md		cublaslt_scope.md
cudnn_bump.md		cudnn_bump.md
doap.rdf		doap.rdf
issues.md		issues.md
prospector.yaml		prospector.yaml
pytest.ini		pytest.ini
rat-excludes		rat-excludes
readthedocs.yml		readthedocs.yml
snap.python		snap.python
tf32_audit.md		tf32_audit.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blackwell / CUDA 13 fork

Why this fork exists

What works

What is experimental or known-broken

System requirements

Installation

Acknowledgements

Apache MXNet for Deep Learning (upstream — archived 2023-11-17)

Features

Contents

What's New

Ecosystem News

Stay Connected

Social Media

History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Blackwell / CUDA 13 fork

Why this fork exists

What works

What is experimental or known-broken

System requirements

Installation

Acknowledgements

Apache MXNet for Deep Learning (upstream — archived 2023-11-17)

Features

Contents

What's New

Ecosystem News

Stay Connected

Social Media

History

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages