🎙️ Audio RAG

RAG System for Audio Content with Triton Inference Server

Transcribe, index, and query audio content with AI-powered search

Features • Quick Start • Usage • Architecture • Documentation

📋 Table of Contents

Overview
Features
Demo
Requirements
Quick Start
Usage
- Ingest Audio
- Ask Questions
Architecture
Project Structure
Configuration
Development
Troubleshooting
Roadmap
Contributing
License
Acknowledgments

Overview

Audio RAG is a production-ready system for indexing and querying audio content using Retrieval-Augmented Generation (RAG). Built on NVIDIA Triton Inference Server, it provides:

🎯 Speech-to-Text - Whisper-powered transcription with timestamps
🔍 Semantic Search - BGE-M3 multilingual embeddings with Qdrant vector DB
🤖 AI Answers - LLM-generated responses with source citations
🚀 Production Ready - Docker-based deployment with Triton inference server

Perfect for podcasts, interviews, lectures, and any audio content you want to make searchable and queryable.

✨ Features

Feature	Description
🎤 Audio Transcription	Whisper ASR with word-level timestamps
🌍 Multilingual	Support for 99+ languages via BGE-M3 embeddings
📊 Vector Search	Qdrant-powered similarity search
🔄 Reranking	BGE-Reranker for improved relevance
💬 LLM Answers	Contextual answers with citations
🐳 Docker Ready	One-command deployment
⚡ Triton Powered	Scalable inference serving
🎛️ Configurable	Hydra-based configuration

Demo

Ingest a Podcast

python main.py triton-ingest-podcast \
  --source my-podcast \
  --audio-file ./podcast.mp3

{
  "status": "success",
  "chunks_count": 15,
  "source_id": "my-podcast"
}

Ask Questions

python main.py triton-ask "What was discussed about exchange rates?"

Resolved question: What was discussed about exchange rates?

The current exchange rate as of May 5, 2022 is 75.50 rubles per US dollar.

Citations:
• my-podcast [0:18] score=0.847
  "На сегодняшний день 5 мая 2022 года курс доллара составляет 
   75 рублей и 50 копеек к российскому рублю."

Requirements

Minimum

Requirement	Version
Python	3.9+
Docker	20.10+
Docker Compose	2.0+
RAM	4 GB
Disk	10 GB
CPU	2 cores

🚀 Quick Start

1. Clone & Install

# Clone the repository
git clone <repo-url>
cd audio-rag

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or: .venv\Scripts\activate  # Windows

# Install dependencies
pip install -e .

2. Configure Environment

# Copy environment template
cp .env.example .env

# Edit if needed (defaults work for local development)
# nano .env

3. Start Services

# Build and start all services (Qdrant + Triton)
docker-compose up -d

# Wait for models to load (3-10 min on first run)
# Models are downloaded from Hugging Face (~2-3GB total)
docker-compose logs -f triton

Expected output when all models are ready:

✓ successfully loaded 'asr_whisper'
✓ successfully loaded 'bge_embedder'
✓ successfully loaded 'reranker'
✓ successfully loaded 'ingest_bls'
✓ successfully loaded 'query_bls'
✓ successfully loaded 'llm_qwen'

Press Ctrl+C to stop following logs.

4. Verify Setup

# Check Triton server health
curl http://localhost:8000/v2/health/ready

# Check Qdrant health
curl http://localhost:6333/collections

Both should return HTTP 200. You're ready to go! 🎉

Usage

Ingest Audio

Import audio content into the vector database:

python main.py triton-ingest-podcast \
  --source <source-id> \
  --audio-file ./path/to/audio.mp3

Parameters:

Parameter	Description
`--source`	Unique identifier for the audio source
`--audio-file`	Path to audio file (MP3, WAV, M4A, etc.)

Output:

{
  "status": "success",
  "chunks_count": 15,
  "source_id": "my-podcast"
}

Ask Questions

Text Question

python main.py triton-ask "What topics were discussed?"

Audio Question (Voice Input)

python main.py triton-ask-audio --question-audio-file ./question.mp3

Response format:

Resolved question transcript: [Transcribed question]

[LLM-generated answer based on context]

Citations:
- source_id [start:end] score=0.XXX
  "Relevant text snippet from audio..."

Architecture

System Components

┌─────────────────────────────────────────────────────────────────┐
│                        TRITON SERVER                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ ASR Whisper │  │ BGE-M3 Emb. │  │   Reranker  │             │
│  │   (Audio→   │  │  (Text→     │  │ (Relevance  │             │
│  │    Text)    │  │  Vectors)   │  │  Scoring)   │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ Ingest BLS  │  │  Query BLS  │  │  LLM Qwen   │             │
│  │ (Ingestion  │  │  (Query     │  │ (Answer     │             │
│  │  Pipeline)  │  │  Pipeline)  │  │ Generation) │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │     Qdrant      │
                    │  Vector Store   │
                    └─────────────────┘

Data Flow

Ingestion Pipeline

Audio File ──▶ Whisper ASR ──▶ Transcript
                                    │
                                    ▼
                              Chunking (120 words)
                                    │
                                    ▼
                              BGE-M3 Embeddings
                                    │
                                    ▼
                               Qdrant Store

Query Pipeline

Question ──▶ BGE-M3 Embed ──▶ Qdrant Search
                                      │
                                      ▼
                                Reranker
                                      │
                                      ▼
                              LLM Qwen
                                      │
                                      ▼
                               Answer + Citations

Models

Model	Type	Size	Purpose
Whisper	ASR	tiny/base/small/medium	Speech-to-text transcription
BGE-M3	Embedding	568M	Multilingual text embeddings (1024 dim)
BGE-Reranker-v2-m3	Reranker	560M	Cross-encoder relevance scoring
Qwen2.5-0.5B-Instruct	LLM	0.5B	Contextual answer generation

Project Structure

audio-rag/
├── audio_rag/              # Main Python package
│   ├── embedders/          # Text embedding implementations
│   │   ├── bge.py         # BGE-M3 local embedder
│   │   ├── triton_bge.py  # BGE-M3 Triton client
│   │   └── hashing.py     # Deterministic embedder (testing)
│   ├── stores/            # Vector store implementations
│   │   ├── qdrant_store.py  # Qdrant vector database
│   │   └── jsonl_store.py   # JSONL file-based store
│   ├── service.py         # Core RAG business logic
│   ├── cli.py             # Command-line interface
│   ├── config.py          # Configuration loader
│   ├── factories.py       # Component factories
│   └── utils/             # Utilities
│       └── logging.py     # Logging configuration
│
├── model_repo/            # Triton model repository
│   ├── asr_whisper/       # Whisper ASR model
│   ├── bge_embedder/      # BGE-M3 embedding model
│   ├── reranker/          # BGE reranker model
│   ├── ingest_bls/        # Ingestion orchestration
│   ├── query_bls/         # Query orchestration
│   └── llm_qwen/          # LLM answer generation
│
├── conf/                  # Hydra configuration
│   └── config.yaml        # Main configuration
│
├── tests/                 # Test suite
│   ├── test_mvp.py        # Integration tests
│   ├── test_triton_client.py  # Triton client tests
│   └── test_audio_workflow.py # Audio workflow tests
│
├── docker-compose.yml     # Docker services
├── Dockerfile.triton      # Triton container
├── .env.example           # Environment template
├── pyproject.toml         # Project metadata
└── requirements-*.txt     # Dependencies

Configuration

Environment Variables

Variable	Description	Default
`QDRANT_HOST`	Qdrant server hostname	`localhost`
`QDRANT_PORT`	Qdrant server port	`6333`
`TRITON_SERVER`	Set to `"true"` inside Triton	`false`
`AUDIO_RAG_ASR_MODEL_SIZE`	Whisper model (`tiny`/`base`/`small`/`medium`)	`tiny`
`AUDIO_RAG_ASR_DEVICE`	ASR device (`cpu`/`cuda`)	`cpu`
`AUDIO_RAG_LLM_MODEL`	LLM model name	`Qwen/Qwen2.5-0.5B-Instruct`
`AUDIO_RAG_LLM_DEVICE`	LLM device (`cpu`/`cuda`)	`cpu`
`AUDIO_RAG_LLM_MAX_TOKENS`	Max tokens for response	`512`
`AUDIO_RAG_USE_LLM`	Enable/disable LLM	`true`

Hydra Configuration

Edit conf/config.yaml for advanced settings:

chunking:
  chunk_words: 120        # Words per chunk
  overlap_words: 24       # Overlap between chunks

retrieval:
  default_top_k: 5        # Number of results to retrieve

qdrant:
  collection_name: audio_rag_chunks
  vector_size: 1024       # BGE-M3 embedding dimension

bge:
  model_name: BAAI/bge-m3
  device: cpu
  max_length: 512

GPU Support

For GPU acceleration, install NVIDIA Container Toolkit and update docker-compose.yml:

services:
  triton:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      AUDIO_RAG_ASR_DEVICE: cuda
      AUDIO_RAG_LLM_DEVICE: cuda

Development

Running Tests

# Start Qdrant (required for tests)
docker-compose up -d qdrant

# Run all tests
pytest tests/ -v

# Run specific test
pytest tests/test_mvp.py -v

# Run with coverage
pytest tests/ --cov=audio_rag

Code Quality

# Install dev dependencies
pip install -e ".[dev]"

# Format code
black audio_rag/ tests/

# Lint
flake8 audio_rag/ tests/

# Type check
mypy audio_rag/

Docker Commands

# Build Triton image
docker-compose build triton

# Rebuild without cache
docker-compose build --no-cache triton

# View logs
docker-compose logs -f triton

# Stop all services
docker-compose down

# Remove volumes (clears all data)
docker-compose down -v

Troubleshooting

Models Not Loading

Symptoms: Triton logs show model loading errors or timeouts.

Solutions:

Check disk space (models require ~3GB):
```
df -h
```
Check internet connection (models download from Hugging Face):
```
curl -I https://huggingface.co
```

Check logs for errors:

docker-compose logs triton | grep -i error

Restart with clean state:

docker-compose down -v && docker-compose up -d

Connection Refused to Qdrant

Symptoms: ConnectionRefusedError: [Errno 111] Connection refused

Solutions:

Check Qdrant is running:
```
docker-compose ps qdrant
```
Verify Qdrant health:
```
curl http://localhost:6333/collections
```
Check environment:
- Inside Docker: QDRANT_HOST=qdrant
- Local client: QDRANT_HOST=localhost

Out of Memory

Symptoms: Container crashes or becomes unresponsive.

Solutions:

Use smaller models:

AUDIO_RAG_ASR_MODEL_SIZE: tiny
AUDIO_RAG_LLM_MODEL: Qwen/Qwen2.5-0.5B-Instruct

Disable LLM:
```
AUDIO_RAG_USE_LLM: "false"
```
Increase Docker memory allocation (4GB+ recommended)

Slow Model Loading

Symptoms: Models take >15 minutes to load.

This is normal on first run - models download from Hugging Face (~3GB). Subsequent starts are faster.

Check download progress:

docker-compose logs triton | grep "Loading"

Roadmap

Short Term

Multiple audio format support (WAV, M4A, FLAC)
Batch ingestion for multiple files
REST API endpoint
Improved error messages

Medium Term

Telegram bot integration
Web UI for podcast management
Multi-language UI
RAG evaluation metrics

Long Term

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests (pytest tests/ -v)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Style

Follow PEP 8 guidelines
Use type hints
Write docstrings for all functions
Keep functions under 50 lines
Add tests for new features

Reporting Issues

Please include:

Python version
Docker version
Operating system
Steps to reproduce
Expected vs actual behavior
Relevant logs

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

NVIDIA Triton Inference Server - Scalable model serving
OpenAI Whisper - Speech recognition
BAAI BGE-M3 - Multilingual embeddings
Qwen Team - Qwen LLM models
Qdrant - Vector database
Hydra - Configuration framework

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
audio_rag		audio_rag
conf		conf
examples		examples
model_repo		model_repo
tests		tests
tmp		tmp
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.triton		Dockerfile.triton
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-triton.txt		requirements-triton.txt
sitecustomize.py		sitecustomize.py

Requirement	Version
RAM	8 GB+
CPU	4 cores+
GPU	NVIDIA with CUDA (optional)
Disk	20 GB SSD

Folders and files

Latest commit

History

Repository files navigation

🎙️ Audio RAG

📋 Table of Contents

Overview

✨ Features

Demo

Ingest a Podcast

Ask Questions

Requirements

Minimum

Recommended

🚀 Quick Start

1. Clone & Install

2. Configure Environment

3. Start Services

4. Verify Setup

Usage

Ingest Audio

Ask Questions

Text Question

Audio Question (Voice Input)

Architecture

System Components

Data Flow

Ingestion Pipeline

Query Pipeline

Models

Project Structure

Configuration

Environment Variables

Hydra Configuration

GPU Support

Development

Running Tests

Code Quality

Docker Commands

Troubleshooting

Models Not Loading

Connection Refused to Qdrant

Out of Memory

Slow Model Loading

Roadmap

Short Term

Medium Term

Long Term

Contributing

Code Style

Reporting Issues

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages