Skip to content

smolix/mxnet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11,960 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Blackwell / CUDA 13 fork

Unofficial fork at smolix/mxnet — Blackwell (sm_120) port of MXNet 2.0.

Apache MXNet was archived on 2023-11-17. The upstream tree is frozen at CUDA 11 / cuDNN 8 / oneDNN v2 and does not build on Blackwell GPUs or modern CUDA toolchains. This fork carries the minimum set of patches needed to run existing MXNet code on current hardware. It is not an official Apache release.

Current version string: 2.0.0+cu13.bw.20260517 (<upstream-version>+cu<cuda-major>.bw.<YYYYMMDD>).

Why this fork exists

The primary goal is to run existing MXNet notebooks on Blackwell (RTX PRO 4000 / RTX 50-series / B100-class) hardware with the current CUDA 13 + cuDNN 9 + oneDNN v3 + NCCL 2.28 stack. The secondary goal is to keep the residual MXNet user community (legacy research code, frozen production pipelines, niche operators like _contrib_quantize_*) able to use current GPUs without a full rewrite to PyTorch / JAX. See issues.md for the open work list and priority-ordered triage.

What works

  • Blackwell sm_120 SASS / PTX (CUDA 13.0), plus sm_80 (Ampere), sm_86 (Ada), sm_89 (RTX 40), sm_90 (H100), and PTX 120 fallback in fatbin.
  • cuDNN 9.22 — including the rewritten v8 RNN path (LSTM / GRU / vanilla RNN, fwd + bwd). TF32 enabled by default on FP32 conv (mirrors PyTorch / TensorFlow defaults; ~2.87× speedup on sm_120 vs the legacy MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION=0 mode).
  • oneDNN v3.11 — full INT8 path (per-OC weight scales, fused conv/FC, fused sum, dequant-to-fp32 output).
  • NCCL 2.28 — single-process / multi-GPU.
  • INT8 quantization (quantize_net, _sg_onednn_conv, _sg_onednn_fully_connected).
  • fp16 and fp32 forward + backward training.
  • F16C CPU intrinsics for fast fp16 host (de)serialization.
  • DNNL subgraph fusion, the activation/eltwise/layer-norm/softmax stack, pooling, batch norm fwd+bwd, transpose, concat, where, masked softmax.

What is experimental or known-broken

  • bf16 on non-AVX-512-BF16 CPUs — oneDNN falls back to fp32 emulation; not fixable in software, test on Intel SPR or AMD Zen 4 / Granite Rapids.
  • Backward through quantized ops — forward inference is solid; backward through _sg_onednn_fully_connected and _sg_onednn_conv is unvalidated.
  • AMP (automatic mixed precision) subgraph — 6 known failures with inner_product primitive creation; investigation pending.
  • int8 quantized concattest_pos_single_concat_pos_neg[*-data_shape1] fails with entire output channels zeroed; suspect oneDNN v3 uint8→int8 reorder semantics.
  • ONNX export / import — both tests/python/onnx/test_models.py and test_operators.py error at collect time; the ONNX path was not updated for MXNet 2.0 numpy ops.
  • See issues.md for the full open list (45 items).

System requirements

  • Linux x86_64 (tested on Ubuntu 22.04 / 24.04).
  • NVIDIA driver supporting CUDA 13 (R570+).
  • CUDA 13.0 toolkit.
  • cuDNN 9.22+ (cuDNN 9.22 has the best sm_120 heuristic coverage; earlier 9.x works but routes more conv shapes through generic fallback engines — notably depthwise 3×3 is ~7× faster on 9.22 vs 9.14). The release wheel bundles 9.22 under mxnet/lib/.
  • NCCL 2.28.3.
  • Python 3.10+ (3.11 / 3.12 / 3.13 are CI-tested).

Installation

pip install https://github.com/smolix/mxnet/releases/download/v2.0.0.cu13.bw.20260517-beta/mxnet-2.0.0+cu13.bw.20260517-cp311-cp311-linux_x86_64.whl

The wheel itself is 454 MB. pip will transitively pull nvidia-cudnn-cu13 (~1 GB) and nvidia-nccl-cu13 (~190 MB). The remaining CUDA 13 toolkit libs (libcudart.so.13, libcublas.so.13, libcufft.so.12, libcusolver.so.12, libcurand.so.10, libnvrtc.so.13) come from your system CUDA 13 install at /usr/local/cuda/ — NVIDIA has not yet published cu13 wheels for those on PyPI (the rest of the nvidia-*-cu13 packages are placeholder stubs at version 0.0.1 as of 2026-05-17). libmxnet.so's RUNPATH covers both locations.

Requires Python 3.11, Linux x86_64, NVIDIA driver R570+, and the CUDA 13 toolkit installed at /usr/local/cuda/ (e.g. apt install cuda-13).

To build from source see BUILDING.md. The short version is: clone with submodules, install libnccl-dev before invoking cmake, then cmake -DUSE_CUDA=ON -DCUDA_ARCH_LIST="12.0" ...

Acknowledgements

This fork builds on the work of the Apache MXNet community and its contributors. All upstream code is Apache 2.0; the Blackwell / CUDA 13 patches in this fork are likewise Apache 2.0. The original project history follows below — its build status badges, social links, and roadmap targets refer to the (now archived) upstream and are kept for historical reference.



banner

Apache MXNet for Deep Learning (upstream — archived 2023-11-17)

Note: the sections below describe the original Apache MXNet 2.0 project. They are kept verbatim for historical context. Build status badges, mailing lists, Slack channels, and Twitter/Medium links refer to the archived upstream project and are not actively monitored. GitHub release (latest SemVer) GitHub stars GitHub forks GitHub contributors GitHub issues good first issue GitHub pull requests by-label GitHub license Twitter Twitter Follow

Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scalable to many GPUs and machines.

Apache MXNet is more than a deep learning project. It is a community on a mission of democratizing AI. It is a collection of blue prints and guidelines for building deep learning systems, and interesting insights of DL systems for hackers.

Licensed under an Apache-2.0 license.

Branch Build Status
master CentOS CPU Build Status CentOS GPU Build Status Clang Build Status
Edge Build Status Miscellaneous Build Status Sanity Build Status
Unix CPU Build Status Unix GPU Build Status Website Build Status
Windows CPU Build Status Windows GPU Build Status Documentation Status
v1.x CentOS CPU Build Status CentOS GPU Build Status Clang Build Status
Edge Build Status Miscellaneous Build Status Sanity Build Status
Unix CPU Build Status Unix GPU Build Status Website Build Status
Windows CPU Build Status Windows GPU Build Status Documentation Status

Features

  • NumPy-like programming interface, and is integrated with the new, easy-to-use Gluon 2.0 interface. NumPy users can easily adopt MXNet and start in deep learning.
  • Automatic hybridization provides imperative programming with the performance of traditional symbolic programming.
  • Lightweight, memory-efficient, and portable to smart devices through native cross-compilation support on ARM, and through ecosystem projects such as TVM, TensorRT, OpenVINO.
  • Scales up to multi GPUs and distributed setting with auto parallelism through ps-lite, Horovod, and BytePS.
  • Extensible backend that supports full customization, allowing integration with custom accelerator libraries and in-house hardware without the need to maintain a fork.
  • Support for Python, Java, C++, R, Scala, Clojure, Go, Javascript, Perl, and Julia.
  • Cloud-friendly and directly compatible with AWS and Azure.

Contents

What's New

Ecosystem News

Stay Connected

Channel Purpose
Follow MXNet Development on Github See what's going on in the MXNet project.
MXNet Confluence Wiki for Developers MXNet developer wiki for information related to project development, maintained by contributors and developers. To request write access, send an email to send request to the dev list .
dev@mxnet.apache.org mailing list The "dev list". Discussions about the development of MXNet. To subscribe, send an email to dev-subscribe@mxnet.apache.org .
discuss.mxnet.io Asking & answering MXNet usage questions.
Apache Slack #mxnet Channel Connect with MXNet and other Apache developers. To join the MXNet slack channel send request to the dev list .
Follow MXNet on Social Media Get updates about new features and events.

Social Media

Keep connected with the latest MXNet news and updates.

Apache MXNet on Twitter

Contributor and user blogs about MXNet

reddit Discuss MXNet on r/mxnet

Apache MXNet YouTube channel

Apache MXNet on LinkedIn

History

MXNet emerged from a collaboration by the authors of cxxnet, minerva, and purine2. The project reflects what we have learned from the past projects. MXNet combines aspects of each of these projects to achieve flexibility, speed, and memory efficiency.

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015

About

Blackwell support (CUDA 13)

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 48.6%
  • Python 34.7%
  • Jupyter Notebook 7.6%
  • Cuda 6.1%
  • CMake 0.9%
  • Shell 0.7%
  • Other 1.4%