Skip to content

Commit

Permalink
doc: Document Parsing RFC (#2682)
Browse files Browse the repository at this point in the history
  • Loading branch information
Jason3S committed Apr 9, 2022
1 parent d464fd2 commit fe8e40b
Show file tree
Hide file tree
Showing 2 changed files with 194 additions and 0 deletions.
123 changes: 123 additions & 0 deletions rfc/rfc-0003 parsing files/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# RFC Parsing files before spell checking

In its current form, the spell checker doesn't understand the context of what it is spell checking.

This RFC proposes to use a parser to add context.

In its simplest form, a parser will ingest the document and output an iterable list of transformed text with the associated context.

The parser is responsible for two things:

1. Transform the document text
1. Annotate the transformed text with context (scope).

# Transformation

Parsers transform the document text to prepare it for spell checking. This is a solution for the following issues:

- [Markdown > ignore formatting when spell checking Markdown · Issue #2672 · streetsidesoftware/cspell](https://github.com/streetsidesoftware/cspell/issues/2672)
- [Latex special character codes · Issue #361 · streetsidesoftware/cspell](https://github.com/streetsidesoftware/cspell/issues/361)

By providing the context / scope of the transformed text, powerful configurations become possible, see [Context and Scope](#context-and-scope).
In addition, parsers will enable the spell checker to work on a wider range of documents than the existing spell checker, possibly unlocking PDF and other proprietary formats.

## Source Maps

A SourceMap is used to map transform a piece of text back to its original text.
This is necessary in order to report the correct location of a spelling issue.

An empty source map indicates that it was a 1:1 transformation.
The values in a source map are number pairs (even, odd) relative to the beginning of each
string segment.

- even - offset in the source text
- odd - offset in the transformed text

Offsets start a 0

Example:

- Original text: `Grand Caf\u00e9 Bj\u00f8rvika`
- Transformed text: `Grand Café Bjørvika`
- Map: [9, 9, 15, 10, 18, 13, 24, 14]

**Map Explained:**

| offset | original | offset | transformed |
| ------ | ----------- | ------ | ----------- |
| 0-9 | `Grand Caf` | 0-9 | `Grand Caf` |
| 9-15 | `\u00e9` | 9-10 | `é` |
| 15-18 | ` Bj` | 10-13 | ` Bj` |
| 18-24 | `\u00f8` | 13-14 | `ø` |
| 24-29 | `rvika` | 14-19 | `rvika` |

Notice that the starting `0, 0` and ending `29, 19` pairs were not necessary.

<!--- cspell:ignore Bjørvika rvika --->

# Context and Scope

Parsers provide context by adding scope to the text they transform. This is similar to the scope added by colorizer parsers.
The scope is used by the spell checker to select appropriate configuration.

Scope is a list of strings from global to local:

**Example Scopes**

- `['source.ts', 'meta.var.expr.ts', 'string.quoted.single.ts', 'text.transformed']`
- `'source.ts meta.var.expr.ts string.quoted.single.ts text.transformed'` - as a string separated by spaces
- `'text'`
- `'text.html.markdown meta.paragraph.markdown markup.bold.markdown'`
- `'text.html.markdown markup.heading.markdown heading.1.markdown entity.name.section.markdown'`

In a CSS selector like fashion, scope is used to apply matching configuration. This is a very powerful option. It allows the end user to apply different configuration based upon the context. For example, in an i18n translation file, the translations keys use the code splitter and English, while to strings would be case sensitive and use the word splitter.

This also addresses the following types of common requests:

- To check only comments
- To check only strings
- To check only code
- To check code in English and template strings in another language.

```yaml
documentSettings:
- when:
# Source code should ignore case and use the code splitter
scope: source, code
use:
splitter: code # the text will be divided based upon code splitting rules
caseSensitive: false
- when:
scope: text
use:
splitter: word
caseSensitive: true
- when:
scope: text.transformed
use:
splitter: word # the text will be divided upon word boundaries
- when:
scope: text.transformed.word
use:
splitter: none # the text will not be divided into words before searching the dictionary.
- when:
# Match against TypeScript and JSON double quote strings.
scope: source.ts string.quoted.double, string.quoted.double.json
use:
locale: 'es' # enable Spanish on strings
caseSensitive: true
```

# Delegation

Some document types, like Markdown, contain sections that are best left to another parser. It should be possible to delegate that responsibility. For example, when Markdown encounters a JavaScript code block, it returns the code block indicating that it needs further processing including `.js` as a possible file extension and file type of `javascript`.

# Parsers as Plug-ins

This RFC proposes that parsers are added as a plug-in with their own possible configuration. This allows for the spell checker to be easily extended to support new file types while also allowing 3rd parties to offer custom parsers overriding the built-in ones.

# Parser Configuration

Each parser should have a default configuration as well as the ability (if appropriate) to be configurable.

For example, the Markdown could have a setting to strip formatting as asked for in [#2672 - treat `**R**peat` as `Repeat`](<(https://github.com/streetsidesoftware/cspell/issues/2672)>). Code parsers could have settings to ignore imports and variable declarations.
71 changes: 71 additions & 0 deletions rfc/rfc-0003 parsing files/src/types.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
export interface ParsedText {
text: string;
start: number;
end: number;
/**
* The source map is used to support text transformations.
*
* See: {@link SourceMap}
*/
map?: SourceMap;
delegate?: DelegateInfo;
}

export interface TextSegment {
text: string;
start: number;
end: number;
}

/**
* A SourceMap is used to map transform a piece of text back to its original text.
* This is necessary in order to report the correct location of a spelling issue.
* An empty source map indicates that it was a 1:1 transformation.
*
* The values in a source map are number pairs (even, odd) relative to the beginning of each
* string segment.
* - even - offset in the source text
* - odd - offset in the transformed text
*
* Offsets start a 0
*
* Example:
*
* - Original text: `Grand Caf\u00e9 Bj\u00f8rvika`
* - Transformed text: `Grand Café Bjørvika`
* - Map: [9, 9, 15, 10, 18, 13, 24, 14]
*
* | offset | original | offset | transformed |
* | ------ | ----------- | ------ | ----------- |
* | 0-9 | `Grand Caf` | 0-9 | `Grand Caf` |
* | 9-15 | `\u00e9` | 9-10 | `é` |
* | 15-18 | ` Bj` | 10-13 | ` Bj` |
* | 18-24 | `\u00f8` | 13-14 | `ø` |
* | 24-29 | `rvika` | 14-19 | `rvika` |
*
* <!--- cspell:ignore Bjørvika rvika --->
*/
export type SourceMap = number[];

/**
* DelegateInfo is used by a parser to delegate parsing a subsection of a document to
* another parser. The following information is used by the spell checker to match
* the parser.
*/
export interface DelegateInfo {
/**
* Proposed virtual file name including the extension.
* Example: `./README.md.js`
*/
filename: string;
/**
* Proposed file extension
* Example: `.js`
*/
extension: string;
/**
* Filetype to use
* Example: `javascript`
*/
fileType: string;
}

0 comments on commit fe8e40b

Please sign in to comment.