The goal is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.
Norvig's solution is quite simple without the use of neural approaches, in contrast to this solution, an idea emerged on how much more effective modern neural approaches are than old probabilistic approaches.
-
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources.
name train test default 120000 7600
-
Google's query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus
name train valid default 17500 3750
The approach is to generate some modifications of word with an optional probability parameter (default: 0.4) which determines the likelihood of applying a modification to the word. The types of modifications(all edit distance 1) include deletion, transposition, replacement, and insertion. The probabilities for each type of modification are equal.
Here's a brief overview of what each modification does:
- Deletion: Randomly deletes one character from the word, unless the word has only one character.
- Transposition: Randomly selects a character in the word (except the first and last characters) and swaps it with its adjacent character.
- Replacement: Randomly selects a character in the word and replaces it with a randomly chosen alphabet character.
- Insertion: Randomly inserts a randomly chosen alphabet character at a random position in the word.
The reason to use only edit 1 is because edit distance 2 can cause huge deviations from the original word and edit distance 1 is the most common type of misspelling.
Solution | Recall | Precision |
---|---|---|
Norvig solution | 0.844 | 0.895 |
t5-small spellchecker | 0.917 | 0.948 |
- Increase the edit distance depending on the length of the word.
- Train on more data.
You can try your example using HuggingFace Inference API
he askd with a deep vocie -> he asked with a deep voice