Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pt-br wordlist #63

Merged
merged 1 commit into from
May 28, 2019
Merged

Improve pt-br wordlist #63

merged 1 commit into from
May 28, 2019

Conversation

drebs
Copy link
Contributor

@drebs drebs commented May 26, 2019

Wordlist in pt-br was first introduced in 7743ed5. The differences to
this one are:

  • 9-characters words are introduced.
  • suffixes removal is made after accounting for popularity.
  • less frequent words that differ only in the last character are
    removed.

The current pt-br wordlist was generated as follows:

  1. Download a dump of portuguese Wikipedia pages, process all pages
    and determine the frequency of each word.
  2. Start from /usr/share/dict/brazilian and filter out:
    • words not matching /^[a-z]+$/,
    • words shorter than 4 characters, and
    • words longer than 9 characters.
  3. Sort remaining words using pt Wikipedia frequencies.
  4. Take the top 30K words (just because after filtering we still get
    roughly the amount we need).
  5. Filter out:
    • all words that are a suffix of any other word in the list.
    • less frequent words that differ only by the last character.
  6. Take the 7776 most frequent words.

No further curation was made.

Wordlist in pt-br was first introduced in 7743ed5. The differences to
this one are:

  - 9-characters words are introduced.
  - suffixes removal is made after accounting for popularity.
  - less frequent words that differ only in the last character are
    removed.

The current pt-br wordlist was generated as follows:

  1. Download a dump of portuguese Wikipedia pages, process all pages
     and determine the frequency of each word.
  2. Start from /usr/share/dict/brazilian and filter out:
       - words not matching /^[a-z]+$/,
       - words shorter than 4 characters, and
       - words longer than 9 characters.
  3. Sort remaining words using pt Wikipedia frequencies.
  4. Take the top 30K words (just because after filtering we still get
     roughly the amount we need).
  5. Filter out:
       - all words that are a suffix of any other word in the list.
       - less frequent words that differ only by the last character.
  6. Take the 7776 most frequent words.

No further curation was made.
@ulif ulif merged commit 9c101be into ulif:master May 28, 2019
@ulif
Copy link
Owner

ulif commented May 28, 2019

Nice, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants