Skip to content

codingonion/awesome-cuda-tensorrt-fpga

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

Awesome-CUDA-TensorRT-FPGA

Awesome

🔥🔥🔥 This repository lists some awesome public NVIDIA CUDA, cuBLAS, cuDNN, TensorRT, AMD ROCm and FPGA projects.

Contents

Awesome List

Learning Resources

Frameworks

  • CUDA and TensorRT Framework

    • GPU Interface

      GPU接口
      • CPP Version
        • CCCL : CUDA C++ Core Libraries. The concept for the CUDA C++ Core Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.

        • HIP : HIP: C++ Heterogeneous-Compute Interface for Portability. HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code. rocmdocs.amd.com/projects/HIP/

      • Python Version
      • Rust Version
      • Julia Version
    • Scientific Computing Framework

      科学计算框架
      • cuBLAS : Basic Linear Algebra on NVIDIA GPUs. NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.

      • CUTLASS : CUDA Templates for Linear Algebra Subroutines.

      • MatX : MatX - GPU-Accelerated Numerical Computing in Modern C++. An efficient C++17 GPU numerical computing library with Python-like syntax. nvidia.github.io/MatX

      • CuPy : CuPy : NumPy & SciPy for GPU. cupy.dev

      • GenericLinearAlgebra.jl : Generic numerical linear algebra in Julia.

      • custos-math : This crate provides CUDA, OpenCL, CPU (and Stack) based matrix operations using custos.

    • Machine Learning Framework

      • cuDNN : The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.

      • PyTorch : Tensors and Dynamic neural networks in Python with strong GPU acceleration. pytorch.org

      • PaddlePaddle : PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署). www.paddlepaddle.org/

      • flashlight/flashlight : A C++ standalone library for machine learning. fl.readthedocs.io/en/latest/

      • NVlabs/tiny-cuda-nn : Lightning fast C++/CUDA neural network framework.

      • yhwang-hub/dl_model_infer : his is a c++ version of the AI reasoning library. Currently, it only supports the reasoning of the tensorrt model. The follow-up plan supports the c++ reasoning of frameworks such as Openvino, NCNN, and MNN. There are two versions for pre- and post-processing, c++ version and cuda version. It is recommended to use the cuda version., This repository provides accelerated deployment cases of deep learning CV popular models, and cuda c supports dynamic-batch image process, infer, decode, NMS.

    • AI Inference Framework

      AI推理框架
      • C Implementation
        • llm.c : LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.

        • llama2.c : Inference Llama 2 in one file of pure C. Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file (run.c).

      • CPP Implementation
        • TensorRT : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. developer.nvidia.com/tensorrt

        • TensorRT-LLM : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. nvidia.github.io/TensorRT-LLM

        • gemma.cpp : gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.

        • llama.cpp : Inference of LLaMA model in pure C/C++.

        • whisper.cpp : High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model.

        • ChatGLM.cpp : C++ implementation of ChatGLM-6B and ChatGLM2-6B.

        • MegEngine/InferLLM : InferLLM is a lightweight LLM model inference framework that mainly references and borrows from the llama.cpp project.

        • DeployAI/nndeploy : nndeploy是一款模型端到端部署框架。以多端推理以及基于有向无环图模型部署为内核,致力为用户提供跨平台、简单易用、高性能的模型部署体验。nndeploy-zh.readthedocs.io/zh/latest/

        • zjhellofss/KuiperInfer (自制深度学习推理框架) : 带你从零实现一个高性能的深度学习推理库,支持llama 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step.

        • skeskinen/llama-lite : Embeddings focused small version of Llama NLP model.

        • Const-me/Whisper : High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model.

        • wangzhaode/ChatGLM-MNN : Pure C++, Easy Deploy ChatGLM-6B.

        • ztxz16/fastllm : 纯c++实现,无第三方依赖的大模型库,支持CUDA加速,目前支持国产大模型ChatGLM-6B,MOSS; 可以在安卓设备上流畅运行ChatGLM-6B。

        • davidar/eigenGPT : Minimal C++ implementation of GPT2.

        • Tlntin/Qwen-TensorRT-LLM : 使用TRT-LLM完成对Qwen-7B-Chat实现推理加速。

        • FeiGeChuanShu/trt2023 : NVIDIA TensorRT Hackathon 2023复赛选题:通义千问Qwen-7B用TensorRT-LLM模型搭建及优化。

        • TRT2022/trtllm-llama : ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化。

      • Mojo Implementation
      • Rust Implementation
      • Zig Implementation

        • llama2.zig : Inference Llama 2 in one file of pure Zig.

        • renerocksai/gpt4all.zig : ZIG build for a terminal-based chat client for an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa.

        • EugenHotaj/zig_inference : Neural Network Inference Engine in Zig.

      • Go Implementation
        • Ollama : Get up and running with Llama 2, Mistral, Gemma, and other large language models. ollama.com

        • go-skynet/LocalAI : 🤖 Self-hosted, community-driven, local OpenAI-compatible API. Drop-in replacement for OpenAI running LLMs on consumer-grade hardware. Free Open Source OpenAI alternative. No GPU required. LocalAI is an API to run ggml compatible models: llama, gpt4all, rwkv, whisper, vicuna, koala, gpt4all-j, cerebras, falcon, dolly, starcoder, and many other. localai.io

      • LLM Deployment Engine
        • vllm-project/vllm : A high-throughput and memory-efficient inference and serving engine for LLMs. vllm.readthedocs.io

        • MLC LLM : Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. mlc.ai/mlc-llm

        • Lamini : Lamini: The LLM engine for rapidly customizing models 🦙.

        • datawhalechina/self-llm : 《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程。

      • LLM Inference Benchmark
    • Multi-GPU Framework

      多GPU框架
    • Robotics Framework

      机器人框架
      • Cupoch : Robotics with GPU computing.
    • Web3 Framework

      Web3框架
      • Tachyon : Modular ZK(Zero Knowledge) backend accelerated by GPU.

      • ICICLE : ICICLE is a library for ZK acceleration using CUDA-enabled GPUs.

  • HDL and FPGA Frameworks

    • C HDL

      • LiteX : The LiteX framework provides a convenient and efficient infrastructure to create FPGA Cores/SoCs, to explore various digital design architectures and createfull FPGA based systems.
    • Scala HDL

    • Rust HDL

      • Veryl : Veryl: A Modern Hardware Description Language.

      • RustHDL : A framework for writing FPGA firmware using the Rust Programming Language.

      • VHDL-LS/rust_hdl : This repository contains a fast VHDL language server and analysis library written in Rust.

      • yupferris/kaze : An HDL embedded in Rust. kaze provides an API to describe Modules composed of Signals, which can then be used to generate Rust simulator code or Verilog modules.

      • dalance/sv-parser : SystemVerilog parser library fully compliant with IEEE 1800-2017.

      • dalance/svls : SystemVerilog language server.

      • dalance/svlint : SystemVerilog linter.

      • vivekmalneedi/veridian : A SystemVerilog Language Server.

      • zachjs/sv2v : SystemVerilog to Verilog conversion.

    • Python HDL

      • nMigen : A modern hardware definition language and toolchain based on Python.

      • Migen : A Python toolbox for building complex digital hardware.

      • MyHDL : MyHDL is a free, open-source package for using Python as a hardware description and verification language.

      • Magma : Magma is a hardware design language embedded in python.

      • PyRTL : PyRTL provides a collection of classes for pythonic register-transfer level design, simulation, tracing, and testing suitable for teaching and research.

      • Veriloggen : Veriloggen: A Mixed-Paradigm Hardware Construction Framework.

      • HWT : VHDL/Verilog/SystemC code generator, simulator API written in python/c++.

      • HDL21 : Analog Hardware Description Library in Python.

Applications

Blogs

Videos

Jobs and Interview

Releases

No releases published

Packages

No packages published