Context-sensitive Spelling Correction

The goal is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Animation of model inference

Solution is to train encoder-decoder transformer for seq2seq spelling correction task

Norvig's solution is quite simple without the use of neural approaches, in contrast to this solution, an idea emerged on how much more effective modern neural approaches are than old probabilistic approaches.

The model was trained on two datasets:

Ag_news

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources.

Data Splits

name train test

default 120000 7600

Google wellformed query

Google's query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus

Data Splits

name train valid

default 17500 3750

How to create spelling mistake?

The approach is to generate some modifications of word with an optional probability parameter (default: 0.4) which determines the likelihood of applying a modification to the word. The types of modifications(all edit distance 1) include deletion, transposition, replacement, and insertion. The probabilities for each type of modification are equal.

Here's a brief overview of what each modification does:

Deletion: Randomly deletes one character from the word, unless the word has only one character.
Transposition: Randomly selects a character in the word (except the first and last characters) and swaps it with its adjacent character.
Replacement: Randomly selects a character in the word and replaces it with a randomly chosen alphabet character.
Insertion: Randomly inserts a randomly chosen alphabet character at a random position in the word.

The reason to use only edit 1 is because edit distance 2 can cause huge deviations from the original word and edit distance 1 is the most common type of misspelling.

Metrics

Solution	Recall	Precision
Norvig solution	0.844	0.895
t5-small spellchecker	0.917	0.948

Further improvement

Increase the edit distance depending on the length of the word.
Train on more data.

Example

You can try your example using HuggingFace Inference API

he askd with a deep vocie -> he asked with a deep voice

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
Context-sensitive Spelling Correction.ipynb		Context-sensitive Spelling Correction.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context-sensitive Spelling Correction

Solution is to train encoder-decoder transformer for seq2seq spelling correction task

The model was trained on two datasets:

Data Splits

Data Splits

How to create spelling mistake?

Metrics

Further improvement

Example

About

Releases

Packages

Languages

thehir0/t5-spellchecker

Folders and files

Latest commit

History

Repository files navigation

Context-sensitive Spelling Correction

Solution is to train encoder-decoder transformer for seq2seq spelling correction task

The model was trained on two datasets:

Data Splits

Data Splits

How to create spelling mistake?

Metrics

Further improvement

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages