A Python tool for creating text-to-speech (TTS) training datasets by recording audio for text phrases. This tool provides an interactive command-line interface for recording audio samples with automatic silence removal, audio validation, and progress tracking.
- π€ High-quality audio recording using
sounddevice(22.05 kHz sample rate) - π Flexible input formats - supports both
wavfile|textand plain text formats - βοΈ Automatic silence removal with fade in/out to prevent clicks
- β Audio validation - automatic testing of recorded audio quality
- π Progress tracking - real-time display of recording progress
- πΎ Auto-save - saves progress on interruption (Ctrl+C)
- π Re-record option - easily re-record any phrase
- βοΈ Skip phrases - skip unwanted phrases without recording
- πͺ Escape mode - save current recording and exit gracefully
- π― Random filenames - generates unique 8-character filenames for audio files
- Python 3.6+
librosa- Audio processingsoundfile- Audio file I/Osounddevice- Audio recordingnumpy- Numerical operations
- A working microphone/audio input device
- Terminal/command-line interface
- Sufficient disk space for audio files
- Install the required Python packages:
pip install librosa soundfile sounddevice numpy- Make the script executable (optional):
chmod +x dataset_creator_final.pyEdit the CONFIG dictionary in the main() function to set your paths:
CONFIG = {
'input_text': 't1.txt', # Input text file
'output_text': 't.txt', # Output dataset file
'wave_path': 'audio/', # Directory to save WAV files
}python dataset_creator_final.pyOr if made executable:
./dataset_creator_final.py- Prepare input file: Create a text file with phrases to record (see Input Formats below)
- Run the script: Execute
dataset_creator_final.py - Record phrases: For each phrase:
- Press Enter to start recording
- Speak the phrase
- Press Enter to stop recording
- Choose an action (see Recording Controls below)
- Review output: The tool automatically validates audio and saves the dataset
The tool automatically detects and supports two input formats:
Hello world
This is a test
How are you today?
dummy.wav|Hello world
dummy.wav|This is a test
dummy.wav|How are you today?
Note: In wavfile|text format, the tool extracts only the text portion (after the |). The wavfile name before | is ignored but can be useful for reference.
The output dataset file contains one entry per line in the format:
audio/abc12345.wav|Hello world
audio/def67890.wav|This is a test
audio/ghi11111.wav|How are you today?
Where:
audio/is the wave_path directoryabc12345.wavis a randomly generated 8-character filename|separates the audio file path from the text- The text is the original phrase
After recording a phrase, you'll be prompted with these options:
yoryes- Save this recording and move to next phrasenorno- Record again (re-record the same phrase)sorskip- Skip this phrase (don't save, move to next)eorescape- Save current recording and exit (saves all progress)
Press Ctrl+C at any time to:
- Save all recorded files to the output dataset
- Exit gracefully
- Preserve all progress made so far
-
Silence Removal: Automatically trims silence from the beginning and end of recordings
- Threshold: 0.005 (configurable)
- Padding: 50ms before/after detected audio
-
Click Prevention: Applies fade in/out (10ms) to prevent audio clicks
-
Format Standardization:
- Sample rate: 22.05 kHz
- Format: WAV (PCM_16)
- Channels: Mono
After recording, the tool automatically:
- Checks recording duration
- Validates audio amplitude
- Detects potential clipping
- Measures energy levels
- Warns about very short/long recordings
π― Final Dataset Creator for TTS Training
==================================================
Input file: t1.txt
Output file: t.txt
Audio directory: audio/
==================================================
π Detected format: plain text (one per line)
π Loaded 3 phrases from plain_text format
π Progress: 1/3 | π΅ Saved WAVs: 0
============================================================
Hello world
============================================================
Press Enter to start recording...
π€ Recording... Press Enter to stop.
(Recording...)
Press Enter to stop recording...
β
Recording completed successfully
β±οΈ Recording duration: 2.34 seconds
Options:
y - Save this recording
n - Record again
s - Skip this text (don't save, move to next)
e - Escape (save all data and exit)
What would you like to do? (y/n/s/e): y
β
Saved as: abc12345.wav (2.34s)
β
Added to dataset: abc12345.wav
Total recorded so far: 1
After running the tool, you'll have:
project/
βββ dataset_creator_final.py
βββ t1.txt # Input file
βββ t.txt # Output dataset file
βββ audio/ # Audio files directory
βββ abc12345.wav
βββ def67890.wav
βββ ...
- Check microphone permissions: Ensure your system allows microphone access
- Check audio device: Verify your microphone is connected and working
- Test with system audio tools: Try recording with other applications first
- Low volume: Check microphone input levels in system settings
- Clipping detected: Reduce microphone input gain
- Very short duration: Ensure you're speaking clearly and loudly enough
- Input file: Ensure
t1.txt(or your configured input file) exists - Output directory: The tool creates
audio/automatically, but ensure write permissions
If you get import errors, install missing packages:
pip install librosa soundfile sounddevice numpy- Sample Rate: 22.05 kHz (configurable in
FinalAudioRecorder.__init__) - Audio Format: WAV, PCM_16, Mono
- Silence Threshold: 0.005 (configurable in
remove_silence()) - Fade Duration: 10ms
- Padding: 50ms before/after detected audio
This tool is part of the shramVoice project. Modify and use as needed for your TTS training workflow.
To modify the tool:
- Edit the
CONFIGdictionary inmain()for paths - Adjust
sample_rateinFinalAudioRecorder.__init__()for different sample rates - Modify
remove_silence()threshold for different silence detection sensitivity