A web app to help with the pronunciation of Turkish words and phrases
- Install dependencies :
- Lint source code :
- Preprocess data :
- Start development server :
- Build and generate the app page to the
- Serve the generated page in the
How does it work?
Preprocessing steps -
- The words which do not exist in the standard English dictionary are filtered from CMUdict.
- From the filtered CMUdict entries, a reverse mapping (from one pronunciation to possibly multiple words) is generated.
- The raw English word frequency data file is parsed.
- The words with the same pronunciation but lower usage frequency are eliminated from the reverse mapping.
Pronunciation algorithm -
- All possible syllable combinations are generated from the input Turkish word.
- The letters in the syllables are written using the alternatives in CMUdict phonetic alphabet.
- The result is searched in the reverse mapping file.
- If no match is found for a syllable, simple translations are applied to each letter.
- The results are sorted prioritizing:
- the ones with the most English word matches
- the one which fits the Turkish natural hyphenation
- The first 10 of the best results are returned.
['bah', 'ad', 'ır'], ['ba', 'had', 'ır'], ['bah', 'a', 'dır'], ['ba', 'ha', 'dır']
[[['B', 'AA', 'HH'], ['AA', 'D'], ['AH0', 'R']], ... (all combinations) ... ]
- (3, 4).
['baah-odd-er', 'bah-hud-er', 'baah-uh-derr', 'bah-huh-derr']
['bah-hud-er', 'bah-huh-derr', 'baah-odd-er', 'baah-uh-derr']
User interface -
- Consists of a single Next.js statically-generated page with no back-end.
- The reverse mapping file is loaded to the client app, so the algorithm runs on the browser.
- Pronunciation dictionary data source: Carnegie Mellon Pronouncing Dictionary
- Word frequency data source: English Word Frequency dataset on Kaggle
- Text-to-speech API: Voice RSS
- Icons: Freepik on Flaticons
- NPM packages: Next.js, React, Blueprint