Improve pt-br wordlist #63

drebs · 2019-05-26T23:11:02Z

Wordlist in pt-br was first introduced in 7743ed5. The differences to
this one are:

9-characters words are introduced.
suffixes removal is made after accounting for popularity.
less frequent words that differ only in the last character are
removed.

The current pt-br wordlist was generated as follows:

Download a dump of portuguese Wikipedia pages, process all pages
and determine the frequency of each word.
Start from /usr/share/dict/brazilian and filter out:
- words not matching /^[a-z]+$/,
- words shorter than 4 characters, and
- words longer than 9 characters.
Sort remaining words using pt Wikipedia frequencies.
Take the top 30K words (just because after filtering we still get
roughly the amount we need).
Filter out:
- all words that are a suffix of any other word in the list.
- less frequent words that differ only by the last character.
Take the 7776 most frequent words.

No further curation was made.

Wordlist in pt-br was first introduced in 7743ed5. The differences to this one are: - 9-characters words are introduced. - suffixes removal is made after accounting for popularity. - less frequent words that differ only in the last character are removed. The current pt-br wordlist was generated as follows: 1. Download a dump of portuguese Wikipedia pages, process all pages and determine the frequency of each word. 2. Start from /usr/share/dict/brazilian and filter out: - words not matching /^[a-z]+$/, - words shorter than 4 characters, and - words longer than 9 characters. 3. Sort remaining words using pt Wikipedia frequencies. 4. Take the top 30K words (just because after filtering we still get roughly the amount we need). 5. Filter out: - all words that are a suffix of any other word in the list. - less frequent words that differ only by the last character. 6. Take the 7776 most frequent words. No further curation was made.

ulif · 2019-05-28T23:34:01Z

Nice, thank you!

drebs force-pushed the wordlist-pt-br branch from 08d75e4 to a678f3e Compare May 26, 2019 23:13

ulif merged commit 9c101be into ulif:master May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pt-br wordlist #63

Improve pt-br wordlist #63

drebs commented May 26, 2019 •

edited

Loading

ulif commented May 28, 2019

Improve pt-br wordlist #63

Improve pt-br wordlist #63

Conversation

drebs commented May 26, 2019 • edited Loading

ulif commented May 28, 2019

drebs commented May 26, 2019 •

edited

Loading