Fast Inference of MoE Models with CPU-GPU Orchestration
-
Updated
Nov 18, 2024 - Python
Fast Inference of MoE Models with CPU-GPU Orchestration
LLM chatbot example using OpenVINO with RAG (Retrieval Augmented Generation).
script which performs RAG and use a local LLM for Q&A
Script which takes a .wav audio file, performs speech-to-text using OpenAI/Whisper, and then, using Llama3, summarization and action point from the transcript generated
Add a description, image, and links to the local-inference topic page so that developers can more easily learn about it.
To associate your repository with the local-inference topic, visit your repo's landing page and select "manage topics."