Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
Updated
Sep 8, 2024 - Cuda
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Algorithms implemented in CUDA + resources about GPGPU
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码
Harness the power of GPU acceleration for fusing visual odometry and IMU data with an advanced Unscented Kalman Filter (UKF) implementation. Developed in C++ and utilizing CUDA, cuBLAS, and cuSOLVER, this system offers unparalleled real-time performance in state and covariance estimation for robotics and autonomous system applications.
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Lab exercise of Parallel Processing course in NTUA regarding CUDA programming
Matrix Exponential Approximation using CUDA
CUDA kernel functions
Generalized Orthogonal Least-Squares in CUDA
Newton's and Halley's Method for the Matrix Polar Decomposition using CUDA
Nonnegative matrix factorizations using CUDA
A MNIST handwritten digit classifier written from scratch in Cuda - C
GPGPU Inverse Distance Weighting using matrix vector multiplication
A library to extract DCT hashes with CUDA
Add a description, image, and links to the cublas topic page so that developers can more easily learn about it.
To associate your repository with the cublas topic, visit your repo's landing page and select "manage topics."