Fast inference engine for Transformer models
-
Updated
Sep 25, 2024 - C++
Fast inference engine for Transformer models
Tuned OpenCL BLAS
Serial and parallel implementations of matrix multiplication
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
DGEMM on KNL, achieve 75% MKL
Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.
CUDA Gemm Convolution implementation
Development of deep learning inference code by OpenCL kerenl function.
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Low Precision Arithmetic for Convolutional Neural Network Inference
My experiments with convolution
OpenMP Matrix Multiplication Offloading Playground
Course Programming on new Architecture-1 (GPU), autumn 2021
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."