Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

Under `data/`

Added

Added whitelisting capabilities to postprocess. (#152)
Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish. (#158, etc.)
Logged dialect configuration if specified. (#133)
Added typing to big scrape code. (#140)
Added argparse to allow limiting 'big scrape' to individual languages with --restriction flag. (#154)
Added Manchu (mnc). (#185)
Added Polabian (pox). (#186)
Added aar, bdq, jje, and lsi. (#202)
Added tyv to languagecodes.py (#203, #205)
Added bcl, egl, izh, ltg, azg, kir and mga to languagecodes.py. (#205)
Added nep to languagecodes.py. (#206)
Added Ingrian (izh). (#215)
Added French phoneme list and filtered TSV file. (#213, #217)
Added Coriscan (cos). (#222)
Added Middle Korean (okm). (#223)
Added Middle Irish (mga). (#224)
Added Old Portuguese (opt). (#225)
Added Serbo-Croatian phoneme list and filtered TSV files. (#227)
Added Tuvan (tyv). (#228)
Added Shan (shn) with custom extraction. (#229)
Added Northern Kurdish (kmr). (#243)
Added a script to facilitate the creation of a .phones file. (#246)
Added IPA validity checks for phonemes. (#248)
Split multiple pronunciations joined by tilde in eng_us_phonetic.
Added Italian phoneme list and filtered TSV file. (#260, #261)
Added Adyghe phone list and filtered TSV file. (#262, #263)
Added Bulgarian phoneme list and filtered TSV file. (#264, #267)
Added Icelandic phoneme list and filtered TSV file. (#269, #270)
Added Slovenian phoneme list and filtered TSV file. (#271, #273)
Added normalization to list_phones.py. Corrected errors relating to ipapy (#275)
Added Welsh .phones lists and filtered TSV files. (#274, #276)

Changed

Improved printing in the README table. (#145)
Renamed data directory data. (#147)
Split may into Latin and Arabic files. (#164)
Split pan into Gurmukhi and Shahmukhī. (#169)
Split uig into Perso-Arabic and Cyrillic. (#173)
Only allowed Latin spellings in Maltese lexicon. (#166).
Split mon into Cyrillic and Mongol Bichig (#179).
Merged whitelist.py into 'big scrape' script. src scrape.py now checks for existence of whitelist file during scrape to create second filtered TSV. New TSV placed under tsv/\*\_filtered.tsv. (#154).
Updated generate_summary.py to reflect presence of 'filtered' tsv. (#154)
Imperial Aramaic (arc) split into three scripts properly. (#187)
Flattened data directory structure. (#194)
Updated Georgian (geo) to take advantage of upstream bot-based consistency fixes. (#138)
Split arm into Eastern and Western dialects. (#197)
Rescraped files with new whitelists. (#199)
Updated logging statements for consistency. (#196)
Renamed .whitelist file extension name as .phones. (#207)
Split ban into Latin and Balinese scripts. (#214)
Split kir into Cyrillic and Arabic. (#216)
Split Latin (lat) into its dialects. (#233)
Added MyPy coverage for wikipron, tests and data directories. (#247)
Modified paths in codes.py, scrape.py, and split.py. (#251, #256)
Modified config flags in languages.json and scrape.py. (#258)
Moved list_phones.py to parent direcetory. (#265, #266)

Fixed

Fixed path issue with phonetic whitelisted files. (#195)

Under `wikipron/` and Elsewhere

Added

Added positive flags for stress, syllable boundaries, tones, segment to cli.py. (#141)
Added positive flags for space skipping to cli.py. (#257)
Added two Vietnamese dialects to languages.json. (#139)
Handled additional language codes. (#132, #148)
Added --no-skip-spaces-word and --no-skip-spaces-pron flag. (#135)
Allowed ASCII apostrophes (0x27) in spellings. (#172).
Added Vietnamese extraction function. (#181).
Modified pron selector in Latin extraction function. (#183).
Added --no-tone flag. (#188)
Customized extractor and new scraped prons for khb. (#219)
Added tests/test_data directory containing two tests. (#226, #251)
Added HTTP User-Agent header to API calls to Wiktionary. (#234)
Added support for python 3.9 (#240)
Added black style formatting to .circleci/config.yml. (#242)
Added logging for scraping a language with --dialect specified that requires its custom extraction logic. (#245)
Improved CircleCI workflow with orbs. (#249)
Added test_split.py to tests/test_data. (#256)
Handled Cantonese for scraping. (#277)

Changed

Renamed arguments to positive statements in wikipron/config.py and edited _get_process_pron function accordingly. (#141, #257)
Changed testing values used in tests/test_config.py in order to accomodate the addition of positive flags. (#141)
Specified UTF-8 encoding in handling text files. (#221)
Moved previous contents of tests into tests/test_wikipron (#226)
Updated the packages version numbers in requirements.txt to their latest according to PyPI (#239)

Deprecated

Removed

Moved Wiktionary querying functions from test_languagecodes.py to codes.py (#205)

Fixed

Security

[1.1.0] - 2020-03-03

Added

Added the extraction function for Mandarin Chinese and its scraped data. (#124)
Integrated Wortschatz frequencies. (#122)

Changed

Updated the Japanese extraction function and Japanese data. (#129)
Updated all scraped Wiktionary data and frequency data. (#127, #128)
Generalized the splitting script in the big scrape. (#123)
Moved small file removal to generate_summary.py. (#119)
Updated Russian data. (#115)

Fixed

Avoided and logged error in case of pron processing failure. (#130)

[1.0.0] - 2019-11-29

Added

Handled Japanese. (#109, #114)
Handled Latin, for which the actual graphemes cannot be the Wiktionary page titles and have to come from within the page. (#92, #93)
Handled Thai, whose pronunciations are embedded in HTML tables. (#90)
Handled Khmer, whose pronunciations are embedded in HTML tables. (#88)
IPA segmentation using spaces by default, with the --no-segment flag to optionally turn it off. (#69, #79, #83, #89, #100)
Added TSV files for all Wiktionary languages with over 100 entries. (#61, #76, #95, #97, #103, #104)
Resolved Wiktionary language names for languages with at least 100 pronunciation entries. (#52, #55)

Changed

Removed duplicate <word, pronunciation> pairs in the persisted data. (#85, #111, #116)
Split Welsh into Northern Wales and Southern dialects in the persisted data. (#110)
Factored out casefolding. (#102)
Split Serbo-Croatian into Cyrillic and Latin TSVs. (#96)
Generalized word and pronunciation extraction. (#88)

Removed

Removed the timeout in smoke tests. (#107)
Removed the output option. (#82)
Removed the require_dialect_label option. (#77)

Fixed

Skipped pronunciations with a dash. (#106)
Skipped empty pronunciations in scraping. (#59)
Updated the <li> XPath selector for an optional layer of <span> to cover previously unhandled languages (e.g., Korean). (#50)
Updated the <li> XPath selector for title="wikipedia:<language> phonology" to cover previously unhandled languages (e.g., Estonian and Slovak). (#49)

Security

Avoided using exec to retrieve the version string. Used pkg_resources instead. (#63)

[0.1.1] - 2019-08-14

Fixed

Fixed import bug. (#45)

[0.1.0] - 2019-08-14

First release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

Unreleased

Under `data/`

Added

Changed

Fixed

Under `wikipron/` and Elsewhere

Added

Changed

Deprecated

Removed

Fixed

Security

[1.1.0] - 2020-03-03

Added

Changed

Fixed

[1.0.0] - 2019-11-29

Added

Changed

Removed

Fixed

Security

[0.1.1] - 2019-08-14

Fixed

[0.1.0] - 2019-08-14

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

Unreleased

Under data/

Added

Changed

Fixed

Under wikipron/ and Elsewhere

Added

Changed

Deprecated

Removed

Fixed

Security

[1.1.0] - 2020-03-03

Added

Changed

Fixed

[1.0.0] - 2019-11-29

Added

Changed

Removed

Fixed

Security

[0.1.1] - 2019-08-14

Fixed

[0.1.0] - 2019-08-14

Under `data/`

Under `wikipron/` and Elsewhere