All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Added whitelisting capabilities to
postprocess
. (#152) - Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish. (#158, etc.)
- Logged dialect configuration if specified. (#133)
- Added typing to big scrape code. (#140)
- Added argparse to allow limiting 'big scrape' to individual languages
with
--restriction
flag. (#154) - Added Manchu (
mnc
). (#185) - Added Polabian (
pox
). (#186) - Added
aar
,bdq
,jje
, andlsi
. (#202) - Added
tyv
tolanguagecodes.py
(#203, #205) - Added
bcl
,egl
,izh
,ltg
,azg
,kir
andmga
tolanguagecodes.py
. (#205) - Added
nep
tolanguagecodes.py
. (#206) - Added Ingrian (
izh
). (#215) - Added French phoneme list and filtered TSV file. (#213, #217)
- Added Coriscan (
cos
). (#222) - Added Middle Korean (
okm
). (#223) - Added Middle Irish (
mga
). (#224) - Added Old Portuguese (
opt
). (#225) - Added Serbo-Croatian phoneme list and filtered TSV files. (#227)
- Added Tuvan (
tyv
). (#228) - Added Shan (
shn
) with custom extraction. (#229) - Added Northern Kurdish (
kmr
). (#243) - Added a script to facilitate the creation of a
.phones
file. (#246) - Added IPA validity checks for phonemes. (#248)
- Split multiple pronunciations joined by tilde in
eng_us_phonetic
. - Added Italian phoneme list and filtered TSV file. (#260, #261)
- Added Adyghe phone list and filtered TSV file. (#262, #263)
- Added Bulgarian phoneme list and filtered TSV file. (#264, #267)
- Added Icelandic phoneme list and filtered TSV file. (#269, #270)
- Added Slovenian phoneme list and filtered TSV file. (#271, #273)
- Added normalization to
list_phones.py
. Corrected errors relating toipapy
(#275) - Added Welsh .phones lists and filtered TSV files. (#274, #276)
- Improved printing in the README table. (#145)
- Renamed data directory
data
. (#147) - Split
may
into Latin and Arabic files. (#164) - Split
pan
into Gurmukhi and Shahmukhī. (#169) - Split
uig
into Perso-Arabic and Cyrillic. (#173) - Only allowed Latin spellings in Maltese lexicon. (#166).
- Split
mon
into Cyrillic and Mongol Bichig (#179). - Merged whitelist.py into 'big scrape' script. src scrape.py now checks for
existence of whitelist file during scrape to create second filtered TSV.
New TSV placed under
tsv/\*\_filtered.tsv
. (#154). - Updated
generate_summary.py
to reflect presence of 'filtered' tsv. (#154) - Imperial Aramaic (
arc
) split into three scripts properly. (#187) - Flattened data directory structure. (#194)
- Updated Georgian (
geo
) to take advantage of upstream bot-based consistency fixes. (#138) - Split
arm
into Eastern and Western dialects. (#197) - Rescraped files with new whitelists. (#199)
- Updated logging statements for consistency. (#196)
- Renamed
.whitelist
file extension name as.phones
. (#207) - Split
ban
into Latin and Balinese scripts. (#214) - Split
kir
into Cyrillic and Arabic. (#216) - Split Latin (
lat
) into its dialects. (#233) - Added MyPy coverage for
wikipron
,tests
anddata
directories. (#247) - Modified paths in
codes.py
,scrape.py
, andsplit.py
. (#251, #256) - Modified config flags in
languages.json
andscrape.py
. (#258) - Moved
list_phones.py
to parent direcetory. (#265, #266)
- Fixed path issue with phonetic whitelisted files. (#195)
- Added positive flags for stress, syllable boundaries, tones, segment to
cli.py
. (#141) - Added positive flags for space skipping to
cli.py
. (#257) - Added two Vietnamese dialects to
languages.json
. (#139) - Handled additional language codes. (#132, #148)
- Added
--no-skip-spaces-word
and--no-skip-spaces-pron
flag. (#135) - Allowed ASCII apostrophes (0x27) in spellings. (#172).
- Added Vietnamese extraction function. (#181).
- Modified pron selector in Latin extraction function. (#183).
- Added
--no-tone
flag. (#188) - Customized extractor and new scraped prons for
khb
. (#219) - Added
tests/test_data
directory containing two tests. (#226, #251) - Added HTTP User-Agent header to API calls to Wiktionary. (#234)
- Added support for python 3.9 (#240)
- Added black style formatting to
.circleci/config.yml
. (#242) - Added logging for scraping a language with
--dialect
specified that requires its custom extraction logic. (#245) - Improved CircleCI workflow with orbs. (#249)
- Added
test_split.py
totests/test_data
. (#256) - Handled Cantonese for scraping. (#277)
- Renamed arguments to positive statements in
wikipron/config.py
and edited_get_process_pron
function accordingly. (#141, #257) - Changed testing values used in
tests/test_config.py
in order to accomodate the addition of positive flags. (#141) - Specified UTF-8 encoding in handling text files. (#221)
- Moved previous contents of
tests
intotests/test_wikipron
(#226) - Updated the packages version numbers in requirements.txt to their latest according to PyPI (#239)
- Moved Wiktionary querying functions from
test_languagecodes.py
tocodes.py
(#205)
- Added the extraction function for Mandarin Chinese and its scraped data. (#124)
- Integrated Wortschatz frequencies. (#122)
- Updated the Japanese extraction function and Japanese data. (#129)
- Updated all scraped Wiktionary data and frequency data. (#127, #128)
- Generalized the splitting script in the big scrape. (#123)
- Moved small file removal to
generate_summary.py
. (#119) - Updated Russian data. (#115)
- Avoided and logged error in case of pron processing failure. (#130)
- Handled Japanese. (#109, #114)
- Handled Latin, for which the actual graphemes cannot be the Wiktionary page titles and have to come from within the page. (#92, #93)
- Handled Thai, whose pronunciations are embedded in HTML tables. (#90)
- Handled Khmer, whose pronunciations are embedded in HTML tables. (#88)
- IPA segmentation using spaces by default, with the
--no-segment
flag to optionally turn it off. (#69, #79, #83, #89, #100) - Added TSV files for all Wiktionary languages with over 100 entries. (#61, #76, #95, #97, #103, #104)
- Resolved Wiktionary language names for languages with at least 100 pronunciation entries. (#52, #55)
- Removed duplicate <word, pronunciation> pairs in the persisted data. (#85, #111, #116)
- Split Welsh into Northern Wales and Southern dialects in the persisted data. (#110)
- Factored out casefolding. (#102)
- Split Serbo-Croatian into Cyrillic and Latin TSVs. (#96)
- Generalized word and pronunciation extraction. (#88)
- Removed the timeout in smoke tests. (#107)
- Removed the
output
option. (#82) - Removed the
require_dialect_label
option. (#77)
- Skipped pronunciations with a dash. (#106)
- Skipped empty pronunciations in scraping. (#59)
- Updated the
<li>
XPath selector for an optional layer of<span>
to cover previously unhandled languages (e.g., Korean). (#50) - Updated the
<li>
XPath selector fortitle="wikipedia:<language> phonology"
to cover previously unhandled languages (e.g., Estonian and Slovak). (#49)
- Avoided using
exec
to retrieve the version string. Usedpkg_resources
instead. (#63)
- Fixed import bug. (#45)
First release.