A tool for converting difficult-to-understand speech (such as accented English) into clear, pleasant voices with disfluency removal.
- Extract audio from YouTube videos or local files
- Transcribe speech with high accuracy using OpenAI's Whisper (locally)
- Remove speech disfluencies (um, uh, false starts, repetitions)
- Convert to natural-sounding speech using high-quality neural TTS
- Maintain the original timing and pacing of the conversation
- Python 3.8 or higher
- FFmpeg
- 4GB+ RAM for transcription
- 500MB+ disk space for models
Run the provided setup script, which will install all required dependencies and download the necessary models:
# Make the setup script executable
chmod +x setup.sh
# Run the setup script
./setup.sh
If you prefer to install manually:
-
Create a virtual environment:
python3 -m venv .venv source .venv/bin/activate
-
Install Python dependencies:
pip install yt-dlp pydub tqdm openai-whisper piper-tts torch
-
Download the TTS voice model:
piper-download --voice en_US-lessac-medium
python enhanced_speech_tool.py -yt "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
python enhanced_speech_tool.py -f "path/to/audio/file.mp3"
-yt, --youtube
: YouTube URL-f, --file
: Local audio file path-u, --url
: Direct audio URL-c, --config
: Path to configuration file-v, --voice
: Voice to use for synthesis--no-disfluencies
: Remove disfluencies (um, uh, etc.)--simplify
: Simplify language for easier understanding-o, --output-dir
: Output directory
Several high-quality voices are available through Piper TTS:
en_US-lessac-medium
: Clear American English (default)en_GB-alba-medium
: British Englishen_US-ryan-high
: Male American voiceen_AU-sydney-medium
: Australian English
You can customize the tool's behavior by creating a configuration file. Example:
{
"output_dir": "enhanced_audio",
"whisper_model": "base",
"device": null,
"remove_disfluencies": true,
"simplify_language": false,
"tts_engine": "piper",
"voice": "en_US-lessac-medium",
"maintain_timing": true
}
enhanced_speech_tool/
├── enhanced_speech_tool.py # Main script
├── setup.sh # Setup script
├── config/ # Configuration files
│ └── default.json # Default configuration
├── src/ # Source code
│ ├── audio_extractor.py # Audio extraction module
│ ├── transcriber.py # Speech transcription module
│ ├── text_processor.py # Text processing module
│ ├── speech_synthesizer.py # Speech synthesis module
│ ├── audio_mixer.py # Audio mixing module
│ └── config.py # Configuration module
└── enhanced_audio/ # Output directory
- Whisper transcription may not be perfect for all accents or poor audio quality
- Neural TTS voices require downloading models (~200MB per voice)
- Processing long audio files (>30 minutes) can take significant time
This tool uses the following open-source libraries:
- OpenAI Whisper for transcription
- Piper TTS for speech synthesis
- yt-dlp for YouTube audio extraction
- PyDub for audio processing