Scanner should recognize non-ASCII punctuation chars #53

SilverRainZ · 2024-03-16T07:02:09Z

Hi stsewd, thank for your awesome rst parser!

I found this parser works no so well when parse documentation written in CJK.
For example: :strong:`text`。 (trailing with a Chinese full stop 。, in Engish it is .) is a valid inline markup (OK for rst2pseudoxml), but can not be correctly recognize by tree-sitter-rst.

How to reproduce

$ echo ':strong:`text`。' > example.rst
$ rst2pseudoxml example.rst
<document source="example.rst">
    <paragraph>
        <strong>
            text
        。
$ tree-sitter p example.rst
(document [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 8]
    (role [0, 0] - [0, 8]))
  (paragraph [0, 8] - [0, 17]))
example.rst        0.03 ms         607 bytes/ms (ERROR [0, 0] - [0, 8])

How to fix

According to Inline markup recognition rules:

Inline markup start-strings must start a text block or be immediately preceded by

whitespace,

one of the ASCII characters - : / ' " < ( [ {

or a similar non-ASCII punctuation character. [18]

Inline markup end-strings must end a text block or be immediately followed by

whitespace,

one of the ASCII characters - . , : ; ! ? \ / ' " ) ] } >

or a similar non-ASCII punctuation character. [19]

I have make a PR(#10) for this, but it is not a good fix.
Docutils provides some regex for matching these non-ASCII punctuation characters. According to my current understanding, matching them in src/tree_sitter_rst/chars.c::is_{start,end}_char should fix this issue.

The text was updated successfully, but these errors were encountered:

SilverRainZ · 2024-03-19T05:51:47Z

Just FYI, I am working on this, by generating C chars array from docutils.utils.punctuation_chars, and replacing the valid_chars inside is_{start,end}_char function.

I will file PR soon.

ref: stsewd/tree-sitter-rst#53

The `punctuation_chars.h` header file is auto-generated from `gen_punctuation_chars.py`. I also add a test case "Unicode Punctuation Chars": before: ``` inline_markup: ✗ Unicode Punctuation Chars 1 failure: correct / expected / unexpected 1. Unicode Punctuation Chars: (document (paragraph) (paragraph) (paragraph) (paragraph)) (paragraph (emphasis)) (paragraph (emphasis) (strong)) (paragraph (emphasis)) (paragraph (emphasis))) ``` after: ``` inline_markup: ✓ Unicode Punctuation Chars ``` Any comments are welcome. Close #53. --------- Co-authored-by: Santos Gallegos <stsewd@proton.me>

stsewd added the bug Something isn't working label Mar 16, 2024

SilverRainZ mentioned this issue Mar 19, 2024

Recognize non-ASCII punctuation chars #54

Merged

SilverRainZ added a commit to SilverRainZ/dotfiles that referenced this issue Mar 31, 2024

nvim: Overwrite default rst parser

1294758

ref: stsewd/tree-sitter-rst#53

stsewd closed this as completed in #54 Mar 31, 2024

jmacmahon mentioned this issue Apr 3, 2024

punctuation_chars.h compilation errors on Windows (MSVC) + Clang #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scanner should recognize non-ASCII punctuation chars #53

Scanner should recognize non-ASCII punctuation chars #53

SilverRainZ commented Mar 16, 2024

SilverRainZ commented Mar 19, 2024

Scanner should recognize non-ASCII punctuation chars #53

Scanner should recognize non-ASCII punctuation chars #53

Comments

SilverRainZ commented Mar 16, 2024

How to reproduce

How to fix

SilverRainZ commented Mar 19, 2024