# Audio RAG System with Whisper and Milvus Lite

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system for audio files using:
- **Whisper** for audio transcription
- **Milvus Lite** for vector storage and retrieval
- **OpenAI text-embedding-3-small** for embeddings
- **OpenAI GPT** for generation

## Features:
- Transcribe MP3 audio files using Whisper
- Store transcriptions with embeddings in Milvus Lite
- Query audio content using RAG
- Retrieve relevant audio segments based on queries


In [None]:
# Install required packages
# Note: ffmpeg must be installed at system level (not via pip)
# On macOS: brew install ffmpeg
# On Linux: sudo apt-get install ffmpeg
# On Windows: Download from https://ffmpeg.org/download.html
%pip install openai-whisper pymilvus openai pydub



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Check if ffmpeg is installed at system level
import subprocess
import shutil

def check_ffmpeg():
    """Check if ffmpeg binary is available in system PATH."""
    ffmpeg_path = shutil.which("ffmpeg")
    if ffmpeg_path:
        # Get version to confirm it's working
        try:
            result = subprocess.run(
                ["ffmpeg", "-version"],
                capture_output=True,
                text=True,
                timeout=5
            )
            if result.returncode == 0:
                version_line = result.stdout.split('\n')[0]
                print(f"✓ ffmpeg is installed: {ffmpeg_path}")
                print(f"  {version_line}")
                return True
        except Exception as e:
            print(f"✗ ffmpeg found but not working: {e}")
            return False
    else:
        print("✗ ffmpeg is NOT installed at system level")
        print("\nTo install ffmpeg on macOS:")
        print("  1. Install Homebrew if you don't have it: /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"")
        print("  2. Install ffmpeg: brew install ffmpeg")
        print("\nOr on Linux:")
        print("  sudo apt-get update && sudo apt-get install -y ffmpeg")
        print("\nOr on Windows:")
        print("  Download from https://ffmpeg.org/download.html or use: choco install ffmpeg")
        return False

# Check ffmpeg availability
has_ffmpeg = check_ffmpeg()
if not has_ffmpeg:
    print("\n⚠️  Please install ffmpeg before proceeding with audio transcription!")


In [None]:
# Import necessary libraries
import os
import whisper
from pymilvus import MilvusClient
from openai import OpenAI
from pathlib import Path
import json
from typing import List, Dict, Tuple

# Set up OpenAI API key (set this as an environment variable or replace with your key)
# Option 1: Set environment variable before running: export OPENAI_API_KEY="your-key-here"
# Option 2: Uncomment and set your API key below:
os.environ["OPENAI_API_KEY"] = ""

# Initialize OpenAI client
openai_client = OpenAI()


In [3]:
# Initialize Whisper model
print("Loading Whisper model...")
whisper_model = whisper.load_model("base")  # Options: tiny, base, small, medium, large
print("Whisper model loaded successfully!")


Loading Whisper model...
Whisper model loaded successfully!


In [4]:
# Function to transcribe audio file
def transcribe_audio(audio_path: str, include_timestamps: bool = True) -> Dict:
    """
    Transcribe an audio file using Whisper.
    
    Args:
        audio_path: Path to the audio file (MP3, WAV, etc.)
        include_timestamps: Whether to include word-level timestamps
    
    Returns:
        Dictionary with transcription and metadata
    """
    print(f"Transcribing: {audio_path}")
    
    # Transcribe audio
    result = whisper_model.transcribe(
        audio_path,
        word_timestamps=include_timestamps
    )
    
    return {
        "text": result["text"],
        "segments": result.get("segments", []),
        "language": result.get("language", "unknown"),
        "audio_path": audio_path
    }

# Test transcription (replace with your audio file path)
# audio_file = "path/to/your/audio.mp3"
# transcription = transcribe_audio(audio_file)
# print(f"Transcription: {transcription['text']}")


In [5]:
# Function to chunk text with overlap
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[Dict]:
    """
    Split text into chunks with overlap.
    
    Args:
        text: Text to chunk
        chunk_size: Size of each chunk in characters
        overlap: Number of characters to overlap between chunks
    
    Returns:
        List of chunks with metadata
    """
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk_text = text[start:end].strip()
        
        if chunk_text:
            chunks.append({
                "text": chunk_text,
                "start_char": start,
                "end_char": end,
                "chunk_id": len(chunks)
            })
        
        # Move start position accounting for overlap
        start = end - overlap
        if start >= len(text):
            break
    
    return chunks

# Example
# text = "This is a long text that needs to be chunked..."
# chunks = chunk_text(text, chunk_size=100, overlap=20)
# print(f"Created {len(chunks)} chunks")


In [6]:
# Function to create embeddings using OpenAI
def create_embeddings(texts: List[str], batch_size: int = 100) -> List[List[float]]:
    """
    Create embeddings for a list of texts using OpenAI text-embedding-3-small.
    
    Args:
        texts: List of texts to embed
        batch_size: Number of texts to process in each batch
    
    Returns:
        List of embedding vectors
    """
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        print(f"Creating embeddings for batch {i//batch_size + 1}/{(len(texts) + batch_size - 1)//batch_size}")
        
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    
    return all_embeddings

# Example
# texts = ["Hello world", "How are you?"]
# embeddings = create_embeddings(texts)
# print(f"Created {len(embeddings)} embeddings, each with {len(embeddings[0])} dimensions")


In [7]:
# Initialize Milvus Lite client
collection_name = "audio_rag_collection"
milvus_client = MilvusClient(uri="./audio_rag_milvus.db")

# Get embedding dimension (text-embedding-3-small has 1536 dimensions)
# Let's create a test embedding to get the dimension
test_embedding = create_embeddings(["test"])[0]
embedding_dim = len(test_embedding)
print(f"Embedding dimension: {embedding_dim}")

# Create collection if it doesn't exist
if not milvus_client.has_collection(collection_name):
    milvus_client.create_collection(
        collection_name=collection_name,
        dimension=embedding_dim,
        metric_type="COSINE"
    )
    print(f"Created collection: {collection_name}")
else:
    print(f"Collection {collection_name} already exists")


  from pkg_resources import DistributionNotFound, get_distribution


Creating embeddings for batch 1/1
Embedding dimension: 1536
Collection audio_rag_collection already exists


In [22]:
# Function to process and store audio file in Milvus
def process_audio_file(audio_path: str, chunk_size: int = 500, overlap: int = 50) -> Dict:
    """
    Transcribe audio, chunk text, create embeddings, and store in Milvus.
    
    Args:
        audio_path: Path to audio file
        chunk_size: Size of text chunks
        overlap: Overlap between chunks
    
    Returns:
        Dictionary with processing results
    """
    # Step 1: Transcribe audio
    # Convert to Path object and resolve to absolute path for better reliability
    audio_path_obj = Path(audio_path).expanduser().resolve()
    
    if not audio_path_obj.exists():
        raise FileNotFoundError(f"Audio file not found: {audio_path_obj}")
    
    transcription = transcribe_audio(str(audio_path_obj))
    print(f"Transcription completed. Length: {len(transcription['text'])} characters")
    
    # Step 2: Chunk the transcription
    chunks = chunk_text(transcription['text'], chunk_size=chunk_size, overlap=overlap)
    print(f"Created {len(chunks)} chunks")
    
    # Step 3: Create embeddings for chunks
    chunk_texts = [chunk['text'] for chunk in chunks]
    embeddings = create_embeddings(chunk_texts)
    print(f"Created {len(embeddings)} embeddings")
    
    # Step 4: Prepare data for Milvus
    audio_filename = audio_path_obj.name
    
    # Get current collection count to generate unique IDs
    # Note: In production, consider using UUID or hash-based IDs
    try:
        collection_stats = milvus_client.get_collection_stats(collection_name)
        next_id = collection_stats.get('row_count', 0)
    except:
        next_id = 0
    
    data_to_insert = []
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        data_to_insert.append({
            "id": next_id + i,
            "vector": embedding,
            "text": chunk['text'],
            "audio_file": audio_filename,
            "chunk_id": chunk['chunk_id'],
            "start_char": chunk['start_char'],
            "end_char": chunk['end_char']
        })
    
    # Step 5: Insert into Milvus
    insert_result = milvus_client.insert(
        collection_name=collection_name,
        data=data_to_insert
    )
    
    print(f"Inserted {insert_result['insert_count']} chunks into Milvus")
    
    return {
        "audio_file": audio_filename,
        "chunks_count": len(chunks),
        "insert_count": insert_result['insert_count'],
        "transcription": transcription['text']
    }

# Example usage:
# result = process_audio_file("path/to/your/audio.mp3")


In [18]:
# Function to search for relevant chunks
def search_chunks(query: str, top_k: int = 5) -> List[Dict]:
    """
    Search for relevant chunks using semantic similarity.
    
    Args:
        query: Search query text
        top_k: Number of top results to return
    
    Returns:
        List of relevant chunks with metadata
    """
    # Create embedding for query
    query_embedding = create_embeddings([query])[0]
    
    # Search in Milvus
    results = milvus_client.search(
        collection_name=collection_name,
        data=[query_embedding],
        limit=top_k,
        output_fields=["text", "audio_file", "chunk_id", "start_char", "end_char"]
    )
    
    # Format results
    retrieved_chunks = []
    for hits in results:
        for hit in hits:
            retrieved_chunks.append({
                "text": hit['entity']['text'],
                "audio_file": hit['entity']['audio_file'],
                "chunk_id": hit['entity']['chunk_id'],
                "similarity": 1 - hit['distance'],  # Convert distance to similarity
                "metadata": {
                    "start_char": hit['entity']['start_char'],
                    "end_char": hit['entity']['end_char']
                }
            })
    
    return retrieved_chunks

# Example
# query = "What was discussed about AI?"
# results = search_chunks(query, top_k=3)
# for i, result in enumerate(results, 1):
#     print(f"\nResult {i} (Similarity: {result['similarity']:.3f}):")
#     print(f"Audio: {result['audio_file']}")
#     print(f"Text: {result['text'][:200]}...")


In [19]:
# Function to answer questions using RAG
def rag_query(question: str, top_k: int = 3, model: str = "gpt-4o-mini") -> Dict:
    """
    Answer a question using Retrieval-Augmented Generation.
    
    Args:
        question: Question to answer
        top_k: Number of relevant chunks to retrieve
        model: OpenAI model to use for generation
    
    Returns:
        Dictionary with answer and sources
    """
    # Step 1: Retrieve relevant chunks
    retrieved_chunks = search_chunks(question, top_k=top_k)
    
    # Step 2: Build context from retrieved chunks
    context = "\n\n".join([
        f"[From {chunk['audio_file']}, Chunk {chunk['chunk_id']}]:\n{chunk['text']}"
        for chunk in retrieved_chunks
    ])
    
    # Step 3: Create prompt
    prompt = f"""Use the following pieces of context from audio transcriptions to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
{context}

Question: {question}

Answer:"""
    
    # Step 4: Generate answer using OpenAI
    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on audio transcriptions."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=500
    )
    
    answer = response.choices[0].message.content
    
    # Step 5: Return answer with sources
    return {
        "question": question,
        "answer": answer,
        "sources": retrieved_chunks,
        "num_sources": len(retrieved_chunks)
    }

# Example
# result = rag_query("What was the main topic discussed?")
# print(f"Question: {result['question']}")
# print(f"\nAnswer: {result['answer']}")
# print(f"\nSources ({result['num_sources']}):")
# for i, source in enumerate(result['sources'], 1):
#     print(f"{i}. {source['audio_file']} (Similarity: {source['similarity']:.3f})")


## Example Usage

Now let's use the system to process audio files and query them:


In [11]:
%pip install ffmpeg



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [23]:
# First, let's verify the file exists
import os
from pathlib import Path

audio_file_path = '/Users/tanmaydhote/Downloads/1.mp3'

# Check if file exists
audio_path_obj = Path(audio_file_path).expanduser().resolve()
print(f"Checking file: {audio_path_obj}")
print(f"File exists: {audio_path_obj.exists()}")
if audio_path_obj.exists():
    print(f"File size: {audio_path_obj.stat().st_size / 1024 / 1024:.2f} MB")
    print(f"File is readable: {os.access(audio_path_obj, os.R_OK)}")
    
    # Now process it
    result = process_audio_file(audio_file_path)
    print(f"\nProcessed: {result['audio_file']}")
    print(f"Chunks created: {result['chunks_count']}")
    print(f"\nFirst 500 characters of transcription:")
    print(result['transcription'][:500])
else:
    print(f"\n❌ File not found! Please check the path.")
    print(f"Current working directory: {os.getcwd()}")


Checking file: /Users/tanmaydhote/Downloads/1.mp3
File exists: True
File size: 12.60 MB
File is readable: True
Transcribing: /Users/tanmaydhote/Downloads/1.mp3




Transcription completed. Length: 10261 characters
Created 23 chunks
Creating embeddings for batch 1/1
Created 23 embeddings
Inserted 23 chunks into Milvus

Processed: 1.mp3
Chunks created: 23

First 500 characters of transcription:
 The neat thing about working in machine learning is that every few years, somebody invents something crazy that makes you totally reconsider what's possible. Like models that can play go or generate hyper-realistic faces. And today, the mind-blowing discovery that's rocking everyone's world is a type of neural network called a transformer. Transformers are models that can translate text, write poems and op-eds, and even generate computer code. These are going to be used in biology to solve the 


In [26]:
# Query the audio RAG system
question = "What are transformers?"

result = rag_query(question, top_k=3)
# 
print("=" * 80)
print(f"Question: {result['question']}")
print("=" * 80)
print(f"\nAnswer:\n{result['answer']}")
print("\n" + "=" * 80)
print(f"\nSources ({result['num_sources']}):")
print("=" * 80)
for i, source in enumerate(result['sources'], 1):
     print(f"\n[{i}] {source['audio_file']} (Similarity: {source['similarity']:.3f})")
     print(f"    {source['text'][:200]}...")


Creating embeddings for batch 1/1
Question: What are transformers?

Answer:
Transformers are a type of neural network architecture that are very effective for analyzing complicated data types like images, videos, audio, and text. They were developed in 2017 by researchers at Google and the University of Toronto, initially designed for translation tasks. Unlike recurrent neural networks, transformers can be efficiently parallelized, allowing for the training of very large models. They are the basis for popular machine learning models like BERT, GPT-3, and T5.


Sources (3):

[1] 1.mp3 (Similarity: 0.392)
    hese are going to be used in biology to solve the protein-folding problem. Transformers are like this magical machine learning hammer that seems to make every problem into an owl. If you've heard of t...

[2] 1.mp3 (Similarity: 0.395)
    going to tell you about what transformers are, how they work, and why they've been so impactful. Let's get to it. So what is a transformer? It's a

In [None]:
# Process multiple audio files from a directory
def process_audio_directory(directory_path: str, chunk_size: int = 500, overlap: int = 50):
    """
    Process all audio files in a directory.
    
    Args:
        directory_path: Path to directory containing audio files
        chunk_size: Size of text chunks
        overlap: Overlap between chunks
    """
    audio_extensions = {'.mp3', '.wav', '.m4a', '.ogg', '.flac', '.mp4'}
    directory = Path(directory_path)
    
    audio_files = [f for f in directory.iterdir() if f.suffix.lower() in audio_extensions]
    
    print(f"Found {len(audio_files)} audio files")
    
    results = []
    for audio_file in audio_files:
        try:
            result = process_audio_file(str(audio_file), chunk_size=chunk_size, overlap=overlap)
            results.append(result)
            print(f"✓ Processed: {audio_file.name}")
        except Exception as e:
            print(f"✗ Error processing {audio_file.name}: {e}")
    
    print(f"\nTotal processed: {len(results)}/{len(audio_files)}")
    return results

# Example
# audio_directory = "/path/to/audio/files"
# results = process_audio_directory(audio_directory)


## Summary

This notebook implements a complete Audio RAG system:

### Workflow:
1. **Transcription**: Use Whisper to transcribe MP3 (or other audio) files to text
2. **Chunking**: Split transcriptions into smaller, overlapping chunks for better retrieval
3. **Embedding**: Generate embeddings using OpenAI `text-embedding-3-small` model
4. **Storage**: Store embeddings and metadata in Milvus Lite vector database
5. **Retrieval**: Search for relevant chunks using cosine similarity
6. **Generation**: Use OpenAI GPT to generate answers based on retrieved context

### Key Functions:
- `transcribe_audio()`: Transcribe audio files using Whisper
- `chunk_text()`: Split text into chunks with overlap
- `create_embeddings()`: Generate embeddings using OpenAI
- `process_audio_file()`: Complete pipeline to process and store audio
- `search_chunks()`: Retrieve relevant chunks for a query
- `rag_query()`: Answer questions using RAG

### Features:
- Supports multiple audio formats (MP3, WAV, M4A, OGG, FLAC, MP4)
- Batch processing of multiple audio files
- Semantic search with similarity scores
- Source attribution for answers
- No LangChain dependency - pure OpenAI and Milvus Lite
