doc: Document Parsing RFC (#2682)

streetsidesoftware · Apr 9, 2022 · fe8e40b · fe8e40b
1 parent d464fd2
commit fe8e40b
Show file tree

Hide file tree

Showing 2 changed files with 194 additions and 0 deletions.
diff --git a/rfc/rfc-0003 parsing files/README.md b/rfc/rfc-0003 parsing files/README.md
@@ -0,0 +1,123 @@
+# RFC Parsing files before spell checking
+
+In its current form, the spell checker doesn't understand the context of what it is spell checking.
+
+This RFC proposes to use a parser to add context.
+
+In its simplest form, a parser will ingest the document and output an iterable list of transformed text with the associated context.
+
+The parser is responsible for two things:
+
+1. Transform the document text
+1. Annotate the transformed text with context (scope).
+
+# Transformation
+
+Parsers transform the document text to prepare it for spell checking. This is a solution for the following issues:
+
+- [Markdown > ignore formatting when spell checking Markdown · Issue #2672 · streetsidesoftware/cspell](https://github.com/streetsidesoftware/cspell/issues/2672)
+- [Latex special character codes · Issue #361 · streetsidesoftware/cspell](https://github.com/streetsidesoftware/cspell/issues/361)
+
+By providing the context / scope of the transformed text, powerful configurations become possible, see [Context and Scope](#context-and-scope).
+In addition, parsers will enable the spell checker to work on a wider range of documents than the existing spell checker, possibly unlocking PDF and other proprietary formats.
+
+## Source Maps
+
+A SourceMap is used to map transform a piece of text back to its original text.
+This is necessary in order to report the correct location of a spelling issue.
+
+An empty source map indicates that it was a 1:1 transformation.
+The values in a source map are number pairs (even, odd) relative to the beginning of each
+string segment.
+
+- even - offset in the source text
+- odd - offset in the transformed text
+
+Offsets start a 0
+
+Example:
+
+- Original text: `Grand Caf\u00e9 Bj\u00f8rvika`
+- Transformed text: `Grand Café Bjørvika`
+- Map: [9, 9, 15, 10, 18, 13, 24, 14]
+
+**Map Explained:**
+
+| offset | original    | offset | transformed |
+| ------ | ----------- | ------ | ----------- |
+| 0-9    | `Grand Caf` | 0-9    | `Grand Caf` |
+| 9-15   | `\u00e9`    | 9-10   | `é`         |
+| 15-18  | ` Bj`       | 10-13  | ` Bj`       |
+| 18-24  | `\u00f8`    | 13-14  | `ø`         |
+| 24-29  | `rvika`     | 14-19  | `rvika`     |
+
+Notice that the starting `0, 0` and ending `29, 19` pairs were not necessary.
+
+<!--- cspell:ignore Bjørvika rvika --->
+
+# Context and Scope
+
+Parsers provide context by adding scope to the text they transform. This is similar to the scope added by colorizer parsers.
+The scope is used by the spell checker to select appropriate configuration.
+
+Scope is a list of strings from global to local:
+
+**Example Scopes**
+
+- `['source.ts', 'meta.var.expr.ts', 'string.quoted.single.ts', 'text.transformed']`
+- `'source.ts meta.var.expr.ts string.quoted.single.ts text.transformed'` - as a string separated by spaces
+- `'text'`
+- `'text.html.markdown meta.paragraph.markdown markup.bold.markdown'`
+- `'text.html.markdown markup.heading.markdown heading.1.markdown entity.name.section.markdown'`
+
+In a CSS selector like fashion, scope is used to apply matching configuration. This is a very powerful option. It allows the end user to apply different configuration based upon the context. For example, in an i18n translation file, the translations keys use the code splitter and English, while to strings would be case sensitive and use the word splitter.
+
+This also addresses the following types of common requests:
+
+- To check only comments
+- To check only strings
+- To check only code
+- To check code in English and template strings in another language.
+
+```yaml
+documentSettings:
+  - when:
+      # Source code should ignore case and use the code splitter
+      scope: source, code
+    use:
+      splitter: code # the text will be divided based upon code splitting rules
+      caseSensitive: false
+  - when:
+      scope: text
+    use:
+      splitter: word
+      caseSensitive: true
+  - when:
+      scope: text.transformed
+    use:
+      splitter: word # the text will be divided upon word boundaries
+  - when:
+      scope: text.transformed.word
+    use:
+      splitter: none # the text will not be divided into words before searching the dictionary.
+  - when:
+      # Match against TypeScript and JSON double quote strings.
+      scope: source.ts string.quoted.double, string.quoted.double.json
+    use:
+      locale: 'es' # enable Spanish on strings
+      caseSensitive: true
+```
+
+# Delegation
+
+Some document types, like Markdown, contain sections that are best left to another parser. It should be possible to delegate that responsibility. For example, when Markdown encounters a JavaScript code block, it returns the code block indicating that it needs further processing including `.js` as a possible file extension and file type of `javascript`.
+
+# Parsers as Plug-ins
+
+This RFC proposes that parsers are added as a plug-in with their own possible configuration. This allows for the spell checker to be easily extended to support new file types while also allowing 3rd parties to offer custom parsers overriding the built-in ones.
+
+# Parser Configuration
+
+Each parser should have a default configuration as well as the ability (if appropriate) to be configurable.
+
+For example, the Markdown could have a setting to strip formatting as asked for in [#2672 - treat `**R**peat` as `Repeat`](<(https://github.com/streetsidesoftware/cspell/issues/2672)>). Code parsers could have settings to ignore imports and variable declarations.
diff --git a/rfc/rfc-0003 parsing files/src/types.ts b/rfc/rfc-0003 parsing files/src/types.ts
@@ -0,0 +1,71 @@
+export interface ParsedText {
+    text: string;
+    start: number;
+    end: number;
+    /**
+     * The source map is used to support text transformations.
+     *
+     * See: {@link SourceMap}
+     */
+    map?: SourceMap;
+    delegate?: DelegateInfo;
+}
+
+export interface TextSegment {
+    text: string;
+    start: number;
+    end: number;
+}
+
+/**
+ * A SourceMap is used to map transform a piece of text back to its original text.
+ * This is necessary in order to report the correct location of a spelling issue.
+ * An empty source map indicates that it was a 1:1 transformation.
+ *
+ * The values in a source map are number pairs (even, odd) relative to the beginning of each
+ * string segment.
+ * - even - offset in the source text
+ * - odd - offset in the transformed text
+ *
+ * Offsets start a 0
+ *
+ * Example:
+ *
+ * - Original text: `Grand Caf\u00e9 Bj\u00f8rvika`
+ * - Transformed text: `Grand Café Bjørvika`
+ * - Map: [9, 9, 15, 10, 18, 13, 24, 14]
+ *
+ * | offset | original    | offset | transformed |
+ * | ------ | ----------- | ------ | ----------- |
+ * | 0-9    | `Grand Caf` | 0-9    | `Grand Caf` |
+ * | 9-15   | `\u00e9`    | 9-10   | `é`         |
+ * | 15-18  | ` Bj`       | 10-13  | ` Bj`       |
+ * | 18-24  | `\u00f8`    | 13-14  | `ø`         |
+ * | 24-29  | `rvika`     | 14-19  | `rvika`     |
+ *
+ * <!--- cspell:ignore Bjørvika rvika --->
+ */
+export type SourceMap = number[];
+
+/**
+ * DelegateInfo is used by a parser to delegate parsing a subsection of a document to
+ * another parser. The following information is used by the spell checker to match
+ * the parser.
+ */
+export interface DelegateInfo {
+    /**
+     * Proposed virtual file name including the extension.
+     * Example: `./README.md.js`
+     */
+    filename: string;
+    /**
+     * Proposed file extension
+     * Example: `.js`
+     */
+    extension: string;
+    /**
+     * Filetype to use
+     * Example: `javascript`
+     */
+    fileType: string;
+}