Merge 86f45b2 into a159ce2

streetsidesoftware · Jan 12, 2022 · c398750 · c398750
2 parents a159ce2 + 86f45b2
commit c398750
Show file tree

Hide file tree

Showing 3 changed files with 134 additions and 4 deletions.
diff --git a/rfc/rfc-0001 suggestions/README.md b/rfc/rfc-0001 suggestions/README.md
@@ -1,5 +1,12 @@
 # Suggestion Lists
 
+Suggestion lists are useful in addressing common mistakes and noted by [Wikipedia:Lists of common misspellings - Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings)
+
+The idea is to also make it easier for companies to define a list of forbidden terms with a list of suggested replacements.
+
+Below is a proposal on two ways to define suggestions.
+The intention is to implement both. Since `flagWords` is easier to do, it might get done first.
+
 ## Flag Words
 
 The idea is to enhance the definition of `flagWords` to allow for suggestions.
@@ -78,14 +85,23 @@ amature->armature, amateur
 boaut->boat, bout, about
 ```
 
+<!--- cspell:enable -->
+
 Validation:
 
 ```regexp
-/^(\p{L}+)\s*->\s*(\p{L}+)(?:,\s*(\p{L}+))*$/gmu
+/^((?:\p{L}\p{M}*)+)\s*->\s*((?:\p{L}\p{M}*)+)(?:,\s*((?:\p{L}\p{M}*)+))*$/gmu
 ```
 
 ![image](https://user-images.githubusercontent.com/3740137/149126237-455c6674-ed1f-4dd8-8136-083531d2c63b.png)
 
-<!--- cspell:enable -->
+### Dictionary Definition
+
+```yaml
+dictionaryDefinitions:
+  - name: en-us-suggestions
+    path: ./en-us-suggestions.txt.gz
+    type: suggestions
+```
 
 <!--- cspell:ignore acadmic accension -->
diff --git a/rfc/rfc-0001 suggestions/src/config-definitions.ts b/rfc/rfc-0001 suggestions/src/config-definitions.ts
@@ -1,6 +1,17 @@
-import type { DictionaryDefinitionPreferred, BaseSetting } from '@cspell/cspell-types';
+import type { DictionaryDefinitionPreferred, BaseSetting, DictionaryId, DictionaryPath } from '@cspell/cspell-types';
 
-export interface DictionaryDefinitionSuggestions extends Omit<DictionaryDefinitionPreferred, 'type'> {
+interface ChangesToBase {
+    type: 'suggestions' | 'words';
+}
+
+export interface DictionaryDefinitionSuggestions extends Omit<DictionaryDefinitionPreferred, 'type'>, ChangesToBase {
+    /** The name of the dictionary */
+    name: DictionaryId;
+
+    /** Path to the file. */
+    path: DictionaryPath;
+
+    /** The type of dictionary */
     type: 'suggestions';
 }
 

diff --git a/rfc/rfc-0002 improve dictionary suggestions/README.md b/rfc/rfc-0002 improve dictionary suggestions/README.md
@@ -0,0 +1,103 @@
+# Improving Dictionary Suggestions
+
+The `cspell-trie-lib` packages currently handles making suggestions on dictionaries.
+
+It walks the trie using a modified weighted Levenshtein algorithm. The weights are currently weighted towards English and do not lend themselves well to other languages. See [RFC | Ways to improve dictionary suggestions. · Issue #2249 · streetsidesoftware/cspell](https://github.com/streetsidesoftware/cspell/issues/2249)
+
+This proposal is to allow weights to be defined in with the `DictionaryDefinition`.
+
+## Defining Weights / Costs
+
+There are 4 types of edit operations:
+
+- insert - inserts a character
+- delete - deletes a character
+- replace - replaces a character
+- swap - swaps two adjacent characters - swap is singled out because it is a common spelling mistake, otherwise it would be considered 2 edits.
+
+In the current implementation: `1 edit = 100 cost`. This was done to allow for partial edits without the need for decimal numbers.
+
+| Field       | Description                                                                         |
+| ----------- | ----------------------------------------------------------------------------------- |
+| map         | For conciseness, a `map` can contain multiple sets separated by <code>&#124;</code> |
+| insert      | The cost to insert a character from the map into a word                             |
+| delete      | The cost to delete a character from the map from a word                             |
+| replace     | The cost to replace a character in a set with another from the same set             |
+| swap        | The cost to swap any two characters in the same set                                 |
+| description | A comment about why a cost is defined                                               |
+
+#### Example of costs:
+
+```yaml
+costs:
+  - description: Accented Vowel Letters
+    map: 'aáâäãå|eéêë|iíîï|oóôöõ|uúûü|yÿ'
+    insert: 50
+    delete: 50
+    replace: 10
+  - description: Vowels
+    map: 'aáâäãåeéêëiíîïoóôöõuúûüyÿ'
+    insert: 50
+    delete: 50
+    replace: 25 # Replacing one vowel with another is cheap
+    swap: 25 # Swapping vowels are cheap
+  - description: Multi Character example
+    map: 'ß(ss)|œ(ae)|f(ph)'
+    replace: 10
+  - description: Appending / Removing Accent Marks
+    map: '\u0641' # Shadda
+    insert: 10
+    delete: 10
+  - description: Arabic Vowels
+    map: '\u064f\u0648\u064e\u0627\u0650\u64a\u' # Damma, Wāw, Fatha, Alif, Kasra, Ya', Sukūn
+    insert: 20
+    delete: 20
+    replace: 20
+  - description: Keyboard Adjacency
+    map: 'qwas|aszx|wesd|sdxc|erdf|dfcv|rtfg|fgvb|tygh|ghbn|yuhj|hjnm|uijk|jkm|iokl|opl'
+    replace: 50 # make it cheaper to replace near-by keyboard characters
+```
+
+<!---
+  cspell:ignore aáâäãå eéêë iíîï oóôöõ uúûü yÿ
+  cspell:ignore aáâäãåeéêëiíîïoóôöõuúûüyÿ
+  cspell:ignore Shadda Damma Fatha Alif Kasra Sukūn
+  cspell:ignore aszx dfcv erdf fgvb ghbn hjnm iokl qwas rtfg sdxc tygh uijk wesd yuhj
+-->
+
+# The Algorithm
+
+The current algorithm uses a Levenshtein like algorithm to calculate the edit cost. This is different from Hunspell which tries to morph the misspelled word in many possible ways to see if it exists in the dictionary. This can be very expensive, therefor it is not used.
+
+## A two step process
+
+The current suggestion mechanism currently comes up with a list of suggestions in a single pass.
+
+The proposal here is to change the algorithm slightly to come up with a course grain list very quickly and then to refine it with a more expensive weighted algorithm.
+
+Note: the course grain algorithm needs to be very fast because it needs to cull through millions of nodes in the trie. It should NOT visit all possible words in a trie because of word compounding allowed by some languages effective means that the number of words are infinite.
+
+The current algorithm walks the trie in a depth first manner deciding to not go deeper when the `edit_count` exceeds the `max_edit_count`. Deeper in this case could also mean linking to a compound root. As it walks, the `max_edit_count` is adjusted based upon the candidates found. Quickly finding a group of candidates can help reduce the search time, which is why a depth first search is preferred.
+
+# Notes
+
+## Unicode and Accents
+
+The current dictionary compiler normalizes all unicode strings using `.normalize('NFC')`[^1]. It might be necessary to allow storage of decomposed characters for
+
+Even though it is composed by default, the dictionaries still contains accent marks.
+
+Node REPL example:
+
+```js
+> x = 'ą́'
+'ą́'
+> x.split('')
+[ 'a', '̨', '́' ]
+> x.normalize('NFC').split('')
+[ 'ą', '́' ]
+```
+
+Notice that even though `ą́` was normalized, it still contained an apart accent mark.
+
+[^1]: [String.prototype.normalize - JavaScript | MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize)