Skip to content

s260o/VoiceAssistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voice Assistant

A multifunctional AI Voice Assistant that integrates local LLM (Ollama), Speech-to-Text (Whisper), and Text-to-Speech (VoiceVox) to provide a seamless voice interaction experience. It supports information retrieval via web search, application launching, and media playback control.

Overview

This project is a Python-based voice assistant designed to run locally on Windows. It features a GUI for visual feedback and a robust backend for handling voice commands. The assistant can:

  • Understand natural language queries in Japanese.
  • Perform hybrid searches (Wikipedia + DuckDuckGo + Specialized Sites like Qiita/Zenn).
  • Launch local applications (Notepad, Calculator, Browser, etc.).
  • Speak responses using high-quality TTS (VoiceVox).

Usage Flow

  1. Launch: Run main.py (after ensuring prerequisites are met). The GUI will appear.
  2. Speak: The system automatically detects voice activity (VAD). Speak your command or question clearly.
    • Example: "今日のニュースを教えて" (Tell me today's news)
    • Example: "メモ帳を開いて" (Open Notepad)
  3. Transcribe: The audio is converted to text using faster-whisper.
  4. Think: The AI (Ollama/Llama 3) analyzes the intent.
    • If a search is needed, it queries the web first.
    • If a tool is needed (Open App, Music), it executes the tool.
  5. Reply: The AI generates a concise response in Japanese.
  6. Speak Back: The response is read aloud using VoiceVox.

Technical Stack

  • Language: Python 3.10+
  • GUI: Tkinter (Standard Python GUI)
  • Speech-to-Text (STT): faster-whisper (Optimized Whisper implementation)
  • Large Language Model (LLM): Ollama running llama3.2:3b
  • Text-to-Speech (TTS): VoiceVox (Local HTTP Server)
  • Audio I/O: sounddevice, soundfile
  • Voice Activity Detection: webrtcvad
  • Search: duckduckgo_search, wikipedia

Configuration

All configurable settings are stored in config.json.

{
    "audio": {
        "sample_rate": 16000,          // Audio sample rate
        "frame_ms": 30,                // Frame duration for VAD
        "vad_mode": 3,                 // VAD aggressiveness (0-3)
        "start_voiced_frames": 5,      // Frames to trigger speech start
        "end_silence_duration_ms": 1000 // Silence duration to end speech
    },
    "whisper": {
        "model_size": "medium",        // Model size (tiny, base, small, medium, large-v2)
        "device": "cuda",              // "cuda" for GPU, "cpu" for CPU
        "compute_type": "int8"         // Quantization (float16, int8)
    },
    "ollama": {
        "base_url": "http://localhost:11434", // Ollama API URL
        "model": "llama3.2:3b",        // Model tag
        "max_turns": 10                // Context history limit
    },
    "voicevox": {
        "base_url": "http://localhost:50021", // VoiceVox API URL
        "speaker_id": 3                // Speaker ID (3 = Zundamon Normal)
    },
    "prompts": {
        "system_prompt": "...",        // Main persona prompt
        "intent_router_prompt": "..."  // Search intent classification prompt
    }
}

Important Notes

  1. Ollama: Must be installed and running on port 11434.
    • Ensure you have pulled the model: ollama pull llama3.2:3b
  2. VoiceVox: Must be installed and running (Engine) on port 50021.
  3. GPU: A CUDA-capable GPU is highly recommended for faster-whisper and Ollama for acceptable latency. If using CPU, change whisper.device to "cpu" in config.json (will be significantly slower).

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages