Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanner should recognize non-ASCII punctuation chars #53

Closed
SilverRainZ opened this issue Mar 16, 2024 · 1 comment · Fixed by #54
Closed

Scanner should recognize non-ASCII punctuation chars #53

SilverRainZ opened this issue Mar 16, 2024 · 1 comment · Fixed by #54
Labels
bug Something isn't working

Comments

@SilverRainZ
Copy link
Contributor

Hi stsewd, thank for your awesome rst parser!

I found this parser works no so well when parse documentation written in CJK.
For example: :strong:`text`。 (trailing with a Chinese full stop , in Engish it is .) is a valid inline markup (OK for rst2pseudoxml), but can not be correctly recognize by tree-sitter-rst.

How to reproduce

$ echo ':strong:`text`。' > example.rst
$ rst2pseudoxml example.rst
<document source="example.rst">
    <paragraph>
        <strong>
            text
        。
$ tree-sitter p example.rst
(document [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 8]
    (role [0, 0] - [0, 8]))
  (paragraph [0, 8] - [0, 17]))
example.rst        0.03 ms         607 bytes/ms (ERROR [0, 0] - [0, 8])

How to fix

According to Inline markup recognition rules:

Inline markup start-strings must start a text block or be immediately preceded by

  • whitespace,
  • one of the ASCII characters - : / ' " < ( [ {
  • or a similar non-ASCII punctuation character. [18]

Inline markup end-strings must end a text block or be immediately followed by

  • whitespace,
  • one of the ASCII characters - . , : ; ! ? \ / ' " ) ] } >
  • or a similar non-ASCII punctuation character. [19]

I have make a PR(#10) for this, but it is not a good fix.
Docutils provides some regex for matching these non-ASCII punctuation characters. According to my current understanding, matching them in src/tree_sitter_rst/chars.c::is_{start,end}_char should fix this issue.

@stsewd stsewd added the bug Something isn't working label Mar 16, 2024
@SilverRainZ
Copy link
Contributor Author

Just FYI, I am working on this, by generating C chars array from docutils.utils.punctuation_chars, and replacing the valid_chars inside is_{start,end}_char function.

I will file PR soon.

SilverRainZ added a commit to SilverRainZ/dotfiles that referenced this issue Mar 31, 2024
stsewd added a commit that referenced this issue Mar 31, 2024
The `punctuation_chars.h` header file is auto-generated from `gen_punctuation_chars.py`.
I also add a test case "Unicode Punctuation Chars":

before:

```
  inline_markup:
    ✗ Unicode Punctuation Chars

1 failure:

correct / expected / unexpected

  1. Unicode Punctuation Chars:

    (document
      (paragraph)
      (paragraph)
      (paragraph)
      (paragraph))
      (paragraph
        (emphasis))
      (paragraph
        (emphasis)
        (strong))
      (paragraph
        (emphasis))
      (paragraph
        (emphasis)))
```

after:

```
  inline_markup:
    ✓ Unicode Punctuation Chars
```

Any comments are welcome.

Close #53.

---------

Co-authored-by: Santos Gallegos <stsewd@proton.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants