Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse and Filter Dictionaries #9

Closed
SSoelvsten opened this issue Jul 4, 2022 · 0 comments
Closed

Parse and Filter Dictionaries #9

SSoelvsten opened this issue Jul 4, 2022 · 0 comments
Labels
📁 dict Libre words...

Comments

@SSoelvsten
Copy link
Owner

SSoelvsten commented Jul 4, 2022

We need a Rust program

  1. Parse a dictionary file,
  2. Filter out all words that (a) include illegal letters or (b) are too long.
  3. Filter for "bad" words
  4. Add all the valid pre- and suffixes to each word.

To this end, we should reuse all of the .dic and .aff files available with Libre Office or similar. Here, we need to first parse the .aff file and create all of the finite state automata related to all the rules (that do not result in anything that does not introduce non-alphabetic letters). Then, we can in a single sweep go over the entire library and produce a stream of all possible words.

.dic files

Each word is of the form ([a-z,A-Z]*)(?/.*). For example, the british dictionary includes

absurdity/MS

The M and S are keywords related to pre- and suffixes from the .aff file.

.aff files

For each symbol A we have two or more lines. The first one is of the form

_FX A Y N

where N is the number of lines that follow with rules related to A. The lines that follow then can look as follows.

PFX A 0 re [^e] 
PFX A 0 re- e 

That is, prefix re if the start of the word does not match the regex [^e], i.e. if the word does not match [^e].*. Otherwise, you may prefix with re-. A more interesting rule is the S suffix rule.

SFX S Y 9
SFX S y ies [^aeiou]y 
SFX S 0 s [aeiou]y 
SFX S 0 es [sxz] 
SFX S 0 es [cs]h 
SFX S 0 s [^cs]h 
SFX S 0 s [ae]u 
SFX S 0 x [ae]u 
SFX S 0 s [^ae]u 
SFX S 0 s [^hsuxyz]

Again, the first line is the number of rules that follow. The next line says that any word that ends with a y after some consonant then the y should be replaced with ies (e.g. academy turned into academies). The other rules append to the entire word, since they do not write any suffix of the given word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📁 dict Libre words...
Projects
None yet
Development

No branches or pull requests

1 participant