Ouiche is a fast spell checker in C++ with a Radix Tree and a dynamic Damerau-Levenshtein table
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
subject
test
.gitattributes
.gitignore
AUTHORS
CMakeLists.txt
LICENSE
Makefile
README
README.md
configure

README.md

Ouiche

Ouiche is a fast spell-checker in C++, designed to be efficient by using a compact Radix Trie with a dynamic Damerau-Levenshtein distance to compute the word suggestions.

Peter — Et moi pour les gonzesses je suis super d’accord avec toi. Mais pour la bouffe je vois pas ce que tu veux dire. Tu aurais envie de manger quoi exactement ?

Steven — Ben je sais pas, par exemple une quiche lorraine.

Peter — Une ouiche.

Steven — Quoi ?

Peter — On dit « une ouiche lorraine ».

Steven — Tu es sûr ?

— La Classe Américaine - http://ouich.es/#ouiche

Build instructions

To compile Ouiche, use:

$ ./configure
$ make

This will create two binaries in the main folder, TextMiningCompiler and TextMiningApp.

Ouiche is released under the Beerware License. See the LICENSE file for more information.

Usage

  • TextMiningCompiler is used to compile the Radix Trie from a dictionary
  • TextMiningApp takes the compiled dictionary and correct the words given in the standard input.

Example

./TextMiningCompiler words.txt dict.bin
./TextMiningApp dict.bin
> approx 0 cats
[{"word":"cats","freq":815877,"distance":0}]
> approx 1 bachibousouk
[{"word":"bachibouzouk","freq":8998,"distance":1}]
> approx 2 analphabaite
[{"word":"analphabete","freq":84674,"distance":2}]
> approx 3 aoviaiateur
[{"word":"aviateur","freq":194553,"distance":3},
 {"word":"naviagateur","freq":530,"distance":3},
 {"word":"affiliateur","freq":382,"distance":3},
 {"word":"avigateur","freq":336,"distance":3}]

FAQ

What are the main design choices of ouiche?

The compiler reads the words from the input and build a dynamic Radix Trie. This Trie is stored on the heap, which allows us to add words during the runtime. Once this dynamic tree is built, we compact it so that it is represented contiguously in memory. This reduces its memory usage since we're only using offsets instead of raw pointers. It also allows for a faster deserialization since we only mmap the file in memory and use the structure as-is.

The app uses mmap(2) to fetch the trie in memory. Then, for every word asked, it constructs a Damerau-Levenshtein table with this words. This dynamic table is able to be "fed" with a character, and "rollbacked" to a previous state. This allows us to feed the characters of the trie as they come, until the DL table has no way to output a distance with a smaller size than the maximum distance, in which case we try another branch of the trie by resetting the DL table to the previous distance.

For instance, with a radix trie like this one:

+---+  chien   +-------+
| 5 | <------- |   1   |
+---+          +-------+
                 |
                 | nich
                 v
+---+  ien     +-------+
| 4 | <------- |   2   |
+---+          +-------+
                 |
                 | e
                 v
               +-------+
               |   3   |
               +-------+

We would get this table for the first word:

      c  h  i  e  n
   0  1  2  3  4  5
n  1  1  2  3  4 oo
i  2  2  2  2  3  4
c  3  2  3  3  3  4
h  4  3  2  3  4  4
e  5 oo  3  3  3  4

Then we rollback to size 4:

      c  h  i  e  n
   0  1  2  3  4  5
n  1  1  2  3  4 oo
i  2  2  2  2  3  4
c  3  2  3  3  3  4
h  4  3  2  3  4  4

Then we feed the other characters:

      c  h  i  e  n
   0  1  2  3  4  5
n  1  1  2  3  4 oo
i  2  2  2  2  3  4
c  3  2  3  3  3  4
h  4  3  2  3  4  4
i  5 oo  3  2  3  4
e  6 oo oo  3  2  3
n  7 oo oo oo  3  2

Then we rollback to 0, and we feed the other branch:

      c  h  i  e  n
   0  1  2  3  4  5
c  1  0  1  2  3 oo
h  2  1  0  1  2  3
i  3  2  1  0  1  2
e  4  3  2  1  0  1
n  5 oo  3  2  1  0

Note that we don't have to compute the values of all the cases, the ones marked with a oo are not necessary since we do have a maximum distance, hence we only need to compute the "diagonal window" of size 2 * max_dist + 1.

With this table, we are able to efficiently get the distances without having to compute useless data.

How was ouiche tested?

Ouiche was tested using both unit testing for the Damerau-Levenshtein distance and manual testing using some tool binaries.

We performed some diffs of our output with the reference program to ensure the results were correct.

Have you found cases where the Damerau-Levenshtein distance is not accurate?

There are a lot of cases where this distance is not adapted. For instance, it does not take into account the position of letters on the keyboard, therefore it could not detect that "piocje" is like "ouiche" with the right hand shifted to the right. There are also cases where using the default weights for all the editions is not really accurate: a transposition of two adjacent characters is more common than a deletion: "golbal" probably means "global" and not "gobal".

Why did you implement a Radix Trie in your project?

A Radix Trie is a memory efficient trie, which allowed us to reduce the amount of indirections and access to the characters in an efficient manner. This also reduced the memory consumption of the program, when compared to a regular trie.

Given only a string, could we automatically find an accurate max distance?

We could guess a good distance according to the length of the input string. A value of length / 5 would give good results since it allows for a mistake every 5 characters.

Do you have any ideas for performance improvements?

We could probably have a more compact representation of the radix trie, since most of the bits are unused in the alphabet we're working on. We also could parallelize the search, since the queries are completely independant and never change the same structures, and have no shared state.

What is missing to your spell-checker for it to be state-of-the-art?

State-of-the art spell checkers are able to take the keyboard position of the letters, the grammar of the sentence, the position in the sentence, the punctuation and the odds of doing a specific mistake into account. We are really far from that in this basic implementation.