Skip to content

Commit

Permalink
Merge 86f45b2 into a159ce2
Browse files Browse the repository at this point in the history
  • Loading branch information
Jason3S committed Jan 12, 2022
2 parents a159ce2 + 86f45b2 commit c398750
Show file tree
Hide file tree
Showing 3 changed files with 134 additions and 4 deletions.
20 changes: 18 additions & 2 deletions rfc/rfc-0001 suggestions/README.md
@@ -1,5 +1,12 @@
# Suggestion Lists

Suggestion lists are useful in addressing common mistakes and noted by [Wikipedia:Lists of common misspellings - Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings)

The idea is to also make it easier for companies to define a list of forbidden terms with a list of suggested replacements.

Below is a proposal on two ways to define suggestions.
The intention is to implement both. Since `flagWords` is easier to do, it might get done first.

## Flag Words

The idea is to enhance the definition of `flagWords` to allow for suggestions.
Expand Down Expand Up @@ -78,14 +85,23 @@ amature->armature, amateur
boaut->boat, bout, about
```

<!--- cspell:enable -->

Validation:

```regexp
/^(\p{L}+)\s*->\s*(\p{L}+)(?:,\s*(\p{L}+))*$/gmu
/^((?:\p{L}\p{M}*)+)\s*->\s*((?:\p{L}\p{M}*)+)(?:,\s*((?:\p{L}\p{M}*)+))*$/gmu
```

![image](https://user-images.githubusercontent.com/3740137/149126237-455c6674-ed1f-4dd8-8136-083531d2c63b.png)

<!--- cspell:enable -->
### Dictionary Definition

```yaml
dictionaryDefinitions:
- name: en-us-suggestions
path: ./en-us-suggestions.txt.gz
type: suggestions
```

<!--- cspell:ignore acadmic accension -->
15 changes: 13 additions & 2 deletions rfc/rfc-0001 suggestions/src/config-definitions.ts
@@ -1,6 +1,17 @@
import type { DictionaryDefinitionPreferred, BaseSetting } from '@cspell/cspell-types';
import type { DictionaryDefinitionPreferred, BaseSetting, DictionaryId, DictionaryPath } from '@cspell/cspell-types';

export interface DictionaryDefinitionSuggestions extends Omit<DictionaryDefinitionPreferred, 'type'> {
interface ChangesToBase {
type: 'suggestions' | 'words';
}

export interface DictionaryDefinitionSuggestions extends Omit<DictionaryDefinitionPreferred, 'type'>, ChangesToBase {
/** The name of the dictionary */
name: DictionaryId;

/** Path to the file. */
path: DictionaryPath;

/** The type of dictionary */
type: 'suggestions';
}

Expand Down
103 changes: 103 additions & 0 deletions rfc/rfc-0002 improve dictionary suggestions/README.md
@@ -0,0 +1,103 @@
# Improving Dictionary Suggestions

The `cspell-trie-lib` packages currently handles making suggestions on dictionaries.

It walks the trie using a modified weighted Levenshtein algorithm. The weights are currently weighted towards English and do not lend themselves well to other languages. See [RFC | Ways to improve dictionary suggestions. · Issue #2249 · streetsidesoftware/cspell](https://github.com/streetsidesoftware/cspell/issues/2249)

This proposal is to allow weights to be defined in with the `DictionaryDefinition`.

## Defining Weights / Costs

There are 4 types of edit operations:

- insert - inserts a character
- delete - deletes a character
- replace - replaces a character
- swap - swaps two adjacent characters - swap is singled out because it is a common spelling mistake, otherwise it would be considered 2 edits.

In the current implementation: `1 edit = 100 cost`. This was done to allow for partial edits without the need for decimal numbers.

| Field | Description |
| ----------- | ----------------------------------------------------------------------------------- |
| map | For conciseness, a `map` can contain multiple sets separated by <code>&#124;</code> |
| insert | The cost to insert a character from the map into a word |
| delete | The cost to delete a character from the map from a word |
| replace | The cost to replace a character in a set with another from the same set |
| swap | The cost to swap any two characters in the same set |
| description | A comment about why a cost is defined |

#### Example of costs:

```yaml
costs:
- description: Accented Vowel Letters
map: 'aáâäãå|eéêë|iíîï|oóôöõ|uúûü|yÿ'
insert: 50
delete: 50
replace: 10
- description: Vowels
map: 'aáâäãåeéêëiíîïoóôöõuúûüyÿ'
insert: 50
delete: 50
replace: 25 # Replacing one vowel with another is cheap
swap: 25 # Swapping vowels are cheap
- description: Multi Character example
map: 'ß(ss)|œ(ae)|f(ph)'
replace: 10
- description: Appending / Removing Accent Marks
map: '\u0641' # Shadda
insert: 10
delete: 10
- description: Arabic Vowels
map: '\u064f\u0648\u064e\u0627\u0650\u64a\u' # Damma, Wāw, Fatha, Alif, Kasra, Ya', Sukūn
insert: 20
delete: 20
replace: 20
- description: Keyboard Adjacency
map: 'qwas|aszx|wesd|sdxc|erdf|dfcv|rtfg|fgvb|tygh|ghbn|yuhj|hjnm|uijk|jkm|iokl|opl'
replace: 50 # make it cheaper to replace near-by keyboard characters
```

<!---
cspell:ignore aáâäãå eéêë iíîï oóôöõ uúûü yÿ
cspell:ignore aáâäãåeéêëiíîïoóôöõuúûüyÿ
cspell:ignore Shadda Damma Fatha Alif Kasra Sukūn
cspell:ignore aszx dfcv erdf fgvb ghbn hjnm iokl qwas rtfg sdxc tygh uijk wesd yuhj
-->

# The Algorithm

The current algorithm uses a Levenshtein like algorithm to calculate the edit cost. This is different from Hunspell which tries to morph the misspelled word in many possible ways to see if it exists in the dictionary. This can be very expensive, therefor it is not used.

## A two step process

The current suggestion mechanism currently comes up with a list of suggestions in a single pass.

The proposal here is to change the algorithm slightly to come up with a course grain list very quickly and then to refine it with a more expensive weighted algorithm.

Note: the course grain algorithm needs to be very fast because it needs to cull through millions of nodes in the trie. It should NOT visit all possible words in a trie because of word compounding allowed by some languages effective means that the number of words are infinite.

The current algorithm walks the trie in a depth first manner deciding to not go deeper when the `edit_count` exceeds the `max_edit_count`. Deeper in this case could also mean linking to a compound root. As it walks, the `max_edit_count` is adjusted based upon the candidates found. Quickly finding a group of candidates can help reduce the search time, which is why a depth first search is preferred.

# Notes

## Unicode and Accents

The current dictionary compiler normalizes all unicode strings using `.normalize('NFC')`[^1]. It might be necessary to allow storage of decomposed characters for

Even though it is composed by default, the dictionaries still contains accent marks.

Node REPL example:

```js
> x = 'ą́'
'ą́'
> x.split('')
[ 'a', '̨', '́' ]
> x.normalize('NFC').split('')
[ 'ą', '́' ]
```

Notice that even though `ą́` was normalized, it still contained an apart accent mark.

[^1]: [String.prototype.normalize - JavaScript | MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize)

0 comments on commit c398750

Please sign in to comment.