Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow [ or ] in XML names. #187

Merged
merged 1 commit into from Jun 20, 2023
Merged

Conversation

jgm
Copy link
Contributor

@jgm jgm commented Jun 19, 2023

This is an example of a DOCTYPE that was not being parsed correctly before:

<!DOCTYPE language[
  <!ENTITY nmtoken "[\-\w\d\.:_]+">
  <!ENTITY entref  "(#[0-9]+|#[xX][0-9A-Fa-f]+|&nmtoken;);">
]>

xml-conduit was parsing language[ as the root element name.

I have kept to the most minimal possible change in this PR, because I don't want to break anything inadvertently. However, the current parser is still far from correct. As I understand it, only a few symbols (_, -, .) are allowed in element names (in addition, : can be used for a namespace, but that is supported separately in this parser). The current parser would accept things like <foo~bar>.

This is an example of a DOCTYPE that was not being parsed correctly
before:

```
<!DOCTYPE language[
  <!ENTITY nmtoken "[\-\w\d\.:_]+">
  <!ENTITY entref  "(#[0-9]+|#[xX][0-9A-Fa-f]+|&nmtoken;);">
]>
```

xml-conduit was parsing `language[` as the root element name.

I have kept to the most minimal possible change in this
PR, because I don't want to break anything inadvertently. However,
the current parser is still far from correct. As I understand
it, only a few symbols (`_`, `-`, `.`) are allowed in element names
(in addition, `:` can be used for a namespace, but that is supported
separately in this parser).  The current parser would accept things
like `<foo~bar>`.
@jgm
Copy link
Contributor Author

jgm commented Jun 19, 2023

The CI failures seem unrelated to my change.

jgm added a commit to jgm/skylighting that referenced this pull request Jun 20, 2023
This is to work around a bug in xml-conduit:
snoyberg/xml#187
@k0ral
Copy link
Collaborator

k0ral commented Jun 20, 2023

Looks good to me, thank you. I've opened #188 to keep track of the remaining gap with XML names specification.

I'll have a look at the CI issue.

@k0ral k0ral merged commit 70e4200 into snoyberg:master Jun 20, 2023
6 of 12 checks passed
@jgm
Copy link
Contributor Author

jgm commented Jun 20, 2023

Thanks. It would be great to have a release with this fix!

@k0ral
Copy link
Collaborator

k0ral commented Jun 21, 2023

Released as 1.9.1.3.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 30, 2023
0.14
* Add rWeakDeliminators field to Rule. [API change]
* Make WordDetect sensitive to weakDeliminator. This fixes parsing of
  floats beginning with '0.' in C (#174).
* Add debiancontrol syntax (#173).

0.13.4.1
* Update syntax definitions: ada, bash, cmake, css, html, isocpp, java,
  javascript, kotlin, latex, makefile, markdown, php, python, qml, r, sass,
  scss, typescript, zsh.
* Don't require word boundary at end of Int, Float, HlCHex, HlCOct
  (#170). KDE does not. This fixes things like 7L in R.

0.13.4
* Add dosbat syntax (MS DOS batch file) (#169).
* Derive Bounded Instance for TokenType (#168, Pavan Pikhi). Add Bounded to
  the derived instances for the TokenType type. This allows consumers to
  use [minBound .. maxBound] to generate a list of all token types when
  writing a Style.
* Require xml-conduit >= 1.9.1.3. This fixes a bug that prevents parsing
  certain DOCTYPE declarations, e.g. in agda.xml.
* Updated cmake syntax definition.

0.13.3
* Add gap language (#167).
* Update syntax definitions.
* Add patches for agda.xml and dtd.xml, to wor around a bug in xml-conduit:
  snoyberg/xml#187
* Store compiled regexes in RE (#166, Jonathan Coates). This changes the RE
  type to (lazily) compile the regex when constructed, rather than in the
  tokenizer. This allows us to avoid re-compiling regexes for each separate
  tokenize call, instead sharing them globally. We try to hide the
  internals of this, exposing the previous interface (RE { reString,
  reCaseSensitive }) with pattern synonyms.
* ConTeXt: fix handling of spaces in non-normal tokens (Albert
  Krewinkel). This ensures that multiple spaces won't be collapsed into a
  single space.

0.13.2.1
* Update tango style for new token types (#164). The original tango style
  didn't have colors defined for many token types that have been added
  since it was added. This commit updates the style to support them. Thanks
  to @danbraswell for providing the values needed.
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 30, 2023
0.14
* Add rWeakDeliminators field to Rule. [API change]
* Make WordDetect sensitive to weakDeliminator. This fixes parsing of
  floats beginning with '0.' in C (#174).
* Add debiancontrol syntax (#173).

0.13.4.1
* Update syntax definitions: ada, bash, cmake, css, html, isocpp, java,
  javascript, kotlin, latex, makefile, markdown, php, python, qml, r, sass,
  scss, typescript, zsh.
* Don't require word boundary at end of Int, Float, HlCHex, HlCOct
  (#170). KDE does not. This fixes things like 7L in R.

0.13.4
* Add dosbat syntax (MS DOS batch file) (#169).
* Derive Bounded Instance for TokenType (#168, Pavan Pikhi). Add Bounded to
  the derived instances for the TokenType type. This allows consumers to
  use [minBound .. maxBound] to generate a list of all token types when
  writing a Style.
* Require xml-conduit >= 1.9.1.3. This fixes a bug that prevents parsing
  certain DOCTYPE declarations, e.g. in agda.xml.
* Updated cmake syntax definition.

0.13.3
* Add gap language (#167).
* Update syntax definitions.
* Add patches for agda.xml and dtd.xml, to wor around a bug in xml-conduit:
  snoyberg/xml#187
* Store compiled regexes in RE (#166, Jonathan Coates). This changes the RE
  type to (lazily) compile the regex when constructed, rather than in the
  tokenizer. This allows us to avoid re-compiling regexes for each separate
  tokenize call, instead sharing them globally. We try to hide the
  internals of this, exposing the previous interface (RE { reString,
  reCaseSensitive }) with pattern synonyms.
* ConTeXt: fix handling of spaces in non-normal tokens (Albert
  Krewinkel). This ensures that multiple spaces won't be collapsed into a
  single space.

0.13.2.1
* Update tango style for new token types (#164). The original tango style
  didn't have colors defined for many token types that have been added
  since it was added. This commit updates the style to support them. Thanks
  to @danbraswell for providing the values needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants