Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't convert ascii to unicode in URLs #10042

Open
cscheid opened this issue Jun 17, 2024 · 4 comments
Open

Don't convert ascii to unicode in URLs #10042

cscheid opened this issue Jun 17, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request pandoc yaml-validation Issues with YAML validation and autocompletion in quarto
Milestone

Comments

@cscheid
Copy link
Collaborator

cscheid commented Jun 17, 2024

This happens in metadata, but it's likely that it also happens in other parts.

Repro:

---
format: html
value: https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto
---

{{< meta value >}}

(This is, ultimately, the same bug that caused us to have to reimplement shortcodes: {{< foo https://pr-350--quartodoc.netlify.app >}} would destroy the URL in the past.)

@cscheid
Copy link
Collaborator Author

cscheid commented Jun 17, 2024

Quarto really ought to do this automatically ahead of Pandoc.

But I really do feel that this is a Pandoc "bug" in that its smart ASCII-to-unicode processing is way too eager in the presence of URLs.

In Quarto 1.5, we have a (n admittedly fairly gross) syntax for "escaping" arbitrary Pandoc content through its Markdown representation. Consider this:

---
format: html
value: xn--oh.no
value2: '`Str "xn--oh.no"`{=pandoc-native}'
---

{{< meta value >}}

{{< meta value2 >}}
image

@cscheid
Copy link
Collaborator Author

cscheid commented Jun 17, 2024

Notably, my workaround is format-agnostic (because it produces an actual pandoc.Str entry in the metadata object). In contrast, @mcanouil's suggestion in #10021 only works for specific formats.

@machow If you need 1.5 to ship pristine URLs across metadata, and you know that they're URLs, you can use that syntax.

We should have a transparent mechanism for this, but the {=pandoc-native} trick should get you going.

@cscheid cscheid self-assigned this Jun 17, 2024
@cscheid cscheid added enhancement New feature or request yaml-validation Issues with YAML validation and autocompletion in quarto pandoc labels Jun 17, 2024
@cscheid cscheid added this to the v1.6 milestone Jun 17, 2024
@cderv
Copy link
Collaborator

cderv commented Jun 18, 2024

Re-posting below for context the explanation regarding why Pandoc does convert to en-dash.

This is all due to Pandoc Markdown reader when +smart extension is set, which is the default for from: markdown

So we could also opt-out this extension in our qmd reader (from: markdown-smart) and this won't ever happen.

Though it would have other impact on content output (especially for TeX ligatures in LaTeX pdf output)

From #10021 (comment)

How smart extension causes -- to be read as unicode by markdown reader

Nothing should be turning -- into en-dashes. (Maybe?) Pandoc is doing that,

Just want to add additional information on this.

This is Pandoc. It has a +smart extension that does this. See https://pandoc.org/MANUAL.html#extension-smart

Interpret straight quotes as curly quotes, --- as em-dashes, -- as en-dashes, and ... as ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.”

This extensions is activated by default and impact how things are written in output. HTML is among the format where en-dash are used

With smart extension without smart extensions
❯ quarto pandoc --from markdown --to html
pr--450
pr-450
^Z
<p>pr–450 pr-450</p>
❯ quarto pandoc --from markdown-smart --to html
pr--450
pr-450
^Z
<p>pr--450 pr-450</p>

Note the two dashes without smart enabled.

This all happens in the Markdown reader !

With smart extension without smart extensions
❯ quarto pandoc --from markdown --to native
pr--450
^Z
[ Para [ Str "pr\8211\&450" ] ]
❯ quarto pandoc --from markdown-smart --to native
pr--450
^Z
[ Para [ Str "pr--450" ] ]

Why does it happens with metadata field ?

Because they are parsed as Markdown values by Pandoc
From https://pandoc.org/MANUAL.html#extension-yaml_metadata_block

Metadata can contain lists and objects (nested arbitrarily), but all string scalars will be interpreted as Markdown.

Related issue in the past where internally using the new pandoc-native raw block feature from pandoc was the way

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 18, 2024

If pandoc-native is the way, then I think the following part (and subsequent parts) of the codebase for href might need refactoring:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pandoc yaml-validation Issues with YAML validation and autocompletion in quarto
Projects
None yet
Development

No branches or pull requests

3 participants