Skip to content
Data to generate acoustic model adaptation prompts/transcripts from CMU Arctic prompts
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Sphinx 4 Acoustic Model Adaptation Data
by Brian Romanowski

This project contains data used during the CMU Sphinx 4 acoustic model
adapatation process.  It includes nearly all utterances from the 
CMU ARCTIC project (, about 1108 more 
than the 20 prompts provided in the CMU adaptation tutorial 

A blog post describes some of the issues surrounding acoustic model 
adaptation in Sphinx that are out-of-scope in this narrow project 

Provided are the four files required during the adaptation process:
* prompts for speakers
* transcriptions of the to-be recorded speech 
* pronunciation dictionary entries for transcribed words
* miscellaneous bookkeeping information

And there is an additional file that has pronunciation alternatives
and the dictionary words inline with the transcriptions.  Originally,
the idea was that a human annotator could quickly delete the unused 
pronunciations, but there are so many that this is not practical.

Note that 4 ARCTIC prompts that contained numerals were left-out of 
the generated prompt data, as the mapping from numerals to speech 
is not always clear.

The Python 2.7 code used to generate the adaptation prompt data is 
included.  To use it, you'll have to edit filesystem paths and 
download the CMU pronunciation dictionary and the original CMU 
ARCTIC prompts.

Please let me know about the bugs you find!
Feel free to fork the code to generate prompts for other corpora!
You can’t perform that action at this time.