A web application that streams microphone audio in 20ms chunks via WebSocket to a Python backend with Advanced Multi-Feature VAD using RMS Energy, Spectral Flatness, and Zero Crossing Rate.
- Backend: Python with Advanced Signal Processing VAD
- VAD Algorithm: RMS Energy + Spectral Flatness + Zero Crossing Rate
- Frontend: TypeScript with Web Audio API
- Communication: WebSocket real-time streaming
- Audio Processing: 20ms chunks at 44.1kHz → 16kHz for VAD
- RMS Energy: Root Mean Square energy with adaptive noise floor estimation
- Spectral Flatness: Wiener entropy to distinguish tonal (speech) vs noisy content
- Zero Crossing Rate: Detects speech patterns vs noise characteristics
- Adaptive Thresholds: Automatically adjusts to background noise
- Hangover Logic: 240ms minimum speech duration for stability
- Majority Voting: Requires 2/3 features to agree for speech detection
- Feature Smoothing: Moving average over 10 frames for stability
- Spectral Analysis: Additional centroid and rolloff for context
- Measures overall signal power
- Adaptive noise floor estimation
- Robust to different microphone sensitivities
- Lower values = More tonal content (speech)
- Higher values = More noisy content (background noise)
- Excellent for distinguishing speech from noise
- Based on geometric vs arithmetic mean of spectrum
- Speech has characteristic ZCR patterns
- Noise typically has different ZCR characteristics
- Helps distinguish voiced vs unvoiced speech
- Python 3.8+
- pip package manager
- Modern web browser with Web Audio API support
- Install Python dependencies:
pip install -r requirements.txt- Compile TypeScript frontend:
npm install
npm run build- Start the Python VAD server:
python3 server.py- Start the frontend server (in another terminal):
python3 serve_frontend.py- Open your browser and go to
http://localhost:3002
- Click "Connect to Python VAD" to establish WebSocket connection
- Click "Start Recording" to begin streaming microphone audio
- Audio will be sent in 20ms chunks to the Python backend
- Check the Python server console to see detailed VAD analysis:
- 🎤 SPEECH DETECTED with RMS, ZCR, and Spectral Flatness values
- 🔇 SILENCE detection with noise floor information
- Feature-by-feature decision breakdown
- Confidence scores and voting results
- Sample Rate: 44.1kHz (frontend) → 16kHz (VAD processing)
- Chunk Duration: 20ms
- Audio Format: Float32 PCM → Base64 → JSON WebSocket
- VAD Features: RMS Energy, Spectral Flatness, Zero Crossing Rate
- Decision Logic: Majority voting (2/3 features must agree)
- Hangover: 240ms for stable detection
- Adaptation: First 50 frames used for noise floor estimation
- Smoothing: 10-frame moving average for all features
- Window Function: Hann window for spectral analysis
2025-09-30T10:43:53.023Z - INFO - Frame 1 (2025-09-30T10:43:53.023Z):
2025-09-30T10:43:53.023Z - INFO - Buffer length: 10924
2025-09-30T10:43:53.023Z - INFO - Audio samples: 882
2025-09-30T10:43:53.023Z - INFO - 🎤 SPEECH DETECTED - Confidence: 0.667
2025-09-30T10:43:53.023Z - INFO - RMS Energy: 0.023456, ZCR: 0.125
2025-09-30T10:43:53.023Z - INFO - Spectral Flatness: 0.234
2025-09-30T10:43:53.023Z - INFO - Spectral Centroid: 1250.5Hz
2025-09-30T10:43:53.023Z - INFO - Speech Votes: 2/3
2025-09-30T10:43:53.023Z - INFO - Feature Decisions: RMS=True, ZCR=True, Flatness=False
2025-09-30T10:43:53.023Z - INFO - ---
You can modify VAD parameters in server.py:
self.rms_threshold = 0.01
self.zcr_threshold = 0.1
self.spectral_flatness_threshold = 0.3
self.hangover_frames = 12 # ~240ms
self.min_speech_frames = 3- Accuracy: >90% voice activity detection
- Robustness: Handles noise, echo, and poor audio quality
- Latency: <50ms processing time
- Adaptability: Automatically adjusts to different environments
- Scalability: Can handle multiple concurrent connections
- Formula: √(Σ(x²)/N)
- Purpose: Measures overall signal power
- Adaptive: Noise floor estimation from first 50 frames
- Formula: (∏|X(k)|)^(1/N) / (Σ|X(k)|)/N
- Purpose: Distinguishes tonal vs noisy content
- Range: 0 (pure tone) to 1 (white noise)
- Speech: Typically 0.1-0.4 (more tonal)
- Noise: Typically 0.5-1.0 (more noisy)
- Formula: Σ(sign(x[i]) ≠ sign(x[i+1])) / N
- Purpose: Detects signal characteristics
- Speech: Variable patterns based on phonemes
- Noise: Different statistical properties