Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain urls get mangled when rendering with python kernel #10021

Closed
machow opened this issue Jun 14, 2024 · 21 comments
Closed

Certain urls get mangled when rendering with python kernel #10021

machow opened this issue Jun 14, 2024 · 21 comments
Labels
html Issues with HTML and related web technology (html/css/scss) support a request for support

Comments

@machow
Copy link

machow commented Jun 14, 2024

This occurred when running a jupyter kernel.

example.qmd

---
jupyter: python3
---

[a link](https://pr-350–quartodoc.netlify.app/api/Auto.html#quartodoc.Auto)

The link target becomes https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto.

Here's a 1 minute screencast demo'ing: https://www.loom.com/share/ef148ccc145e482abb17c72d62ecd9aa

quarto version

Using quarto v1.5.45, but checked on several earlier versions of quarto also

quarto check
Quarto 1.5.45
[✓] Checking versions of quarto binary dependencies...
      Pandoc version 3.2.0: OK
      Dart Sass version 1.70.0: OK
      Deno version 1.41.0: OK
      Typst version 0.11.0: OK
[✓] Checking versions of quarto dependencies......OK
[✓] Checking Quarto installation......OK
      Version: 1.5.45
      Path: /Applications/quarto/bin

[✓] Checking tools....................OK
      TinyTeX: (not installed)
      Chromium: (not installed)

[✓] Checking LaTeX....................OK
      Tex:  (not detected)

[✓] Checking basic markdown render....OK

[✓] Checking Python 3 installation....OK
      Version: 3.10.2
      Path: /Users/machow/.virtualenvs/quartodoc/bin/python3
      Jupyter: 5.5.0
      Kernels: REDACTED
[✓] Checking Jupyter engine render....OK

[✓] Checking R installation...........OK
      Version: 4.1.2
      Path: /Library/Frameworks/R.framework/Resources
      LibPaths:
        - /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
      knitr: 1.39
      rmarkdown: 2.14

[✓] Checking Knitr engine render......OK


@machow
Copy link
Author

machow commented Jun 14, 2024

After a bit of tinkering, it feels like it has something to do with two hypens around a number, or something like that.

This also gets mangled: https://a-3–website.com

@mcanouil
Copy link
Collaborator

Thanks for the report.

The engine has nothing to do with this behaviour as you can see the same in every engines.

Why do you have a unicode character in the URL?

image

@mcanouil
Copy link
Collaborator

The HTML produced is correct (from Pandoc and Quarto):

Quarto documentHTML
---
format: html
minimal: true
engine: markdown
---

[a link](https://pr-350–quartodoc.netlify.app/api/Auto.html#quartodoc.Auto)
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-99.9.9">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">


<title>index</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em);}
div.column{flex: auto; overflow-x: auto;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
ul.task-list li input[type="checkbox"] {
  width: 0.8em;
  margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ 
  vertical-align: middle;
}
.display.math{display: block; text-align: center; margin: 0.5rem auto;}
</style>




</head>

<body>





<p><a href="https://pr-350–quartodoc.netlify.app/api/Auto.html#quartodoc.Auto">a link</a></p>





</body></html>

Now, if you open the HTML in a browser, the browser has issues.

The whole issue is the unicode character in the URL.

Screen.Recording.2024-06-15.at.19.50.57.mov

That's not a Quarto issue/bug.

@mcanouil mcanouil added support a request for support html Issues with HTML and related web technology (html/css/scss) labels Jun 15, 2024
@cscheid
Copy link
Collaborator

cscheid commented Jun 17, 2024

@machow Could you have copied+pasted this URL somewhere, or maybe your text editor replaced the - with a unicode dash of some kind?

@cscheid
Copy link
Collaborator

cscheid commented Jun 17, 2024

In any case, I think @mcanouil's diagnosis is correct, and this isn't a Quarto bug. I'm going to close this, but we should reopen if I'm wrong.

@cscheid cscheid closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
@machow
Copy link
Author

machow commented Jun 17, 2024

Hello, I think the endash is an unfortunate misdirection. @mcanouil's response explains well how an endash might get handled. But it doesn't explain this:

  • the output link was: https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto

How did things like the leading xn characters happen?

I think a important piece here is that I copied the link with the endash from the quartodoc interlinks filter log (while debugging), but the input in _quarto.yml doesn't have an endash. So I'm unsure of how this URL in the linked quartodoc docs config (https://pr-350--quartodoc.netlify.app/), gained an endash. But the thing I'm more confused about is the mangled url, with xn and ju9h mixed in.

The endash comes from pandoc.utils.stringify?

edit: wait, no it's there in the metadata before stringify is called.

When I log the URL in a filter, the endash seems to be added by pandoc.utils.stringify? Is this a pandoc or a quarto issue (or user error 😅)?

filter.lua

function Meta(meta)
  for k, v in pairs(meta.interlinks.sources) do
      quarto.log.warning(v.url)
      prefix = pandoc.utils.stringify(v.url)
      quarto.log.warning(prefix)
  end
end

test.qmd

---
filters:
  - filter.lua
interlinks:
  sources:
    - url: "https://pr-350--quartodoc.netlify.app/"
---

I am some content

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 17, 2024

So, your URL comes from YAML.

You see the transformation as endash with the below code.

---
format: html
value: https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto
---

{{< meta value >}}

And if you don't want any markdown transformation:

---
format: html
value: "`https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto`{=html}"
---

{{< meta value >}}

I am not sure Quarto can do something here.

Maybe you could have a small Lua filter that ensure the YAML keys are treated as is.

I don't how how Quarto treats URLs in YAML, but I think it's done via TypeScript and not Lua.

@machow
Copy link
Author

machow commented Jun 17, 2024

I think this is a clear quarto issue and needs to be reopened. Either in terms of needing to communicate to people that URLs in _quarto.yml might be transformed in a way (as you showed) might break them, or to fix.

edit: URLs are used frequently in quarto-web's _quarto.yml. Unless I have screwed something up to cause this (which is totally possible 😬), I suspect people are going to expect to be able to put urls in there.

edit2: to restate the second part of my concern. It's clear how an endash could mess things up. My concern was that...

  • this url in _quarto.yml: https://pr-350--quartodoc.netlify.app/api/Auto.html#quartodoc.Auto
  • became this ur: https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto

Note that xn is prepended to the url. (totally my bad for including an endash in the initial url I shared 😞).

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 17, 2024

Sorry, I'm confused.

How does your https://pr-350--quartodoc.netlify.app/api/Auto.html#quartodoc.Auto becomes https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto?

The only thing I can reproduce is the endash.

image
---
format: html
interlinks:
  sources:
    - url: https://pr-350--quartodoc.netlify.app/api/Auto.html#quartodoc.Auto
filters:
  - filter.lua
---


{{< meta interlinks.sources.1.url >}}

@machow
Copy link
Author

machow commented Jun 17, 2024

Please see my initial issue, which includes instructions to reproduce in a jupyter kernel (I suspect the engine is important? but could be wrong).

#10021 (comment)

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 17, 2024

FYI, for href in Quarto the trick (contrary to using raw span like I showed) was basically to implement escaping internally: 6e5976f

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 17, 2024

Please see my initial issue, which includes instructions to reproduce in a jupyter kernel (I suspect the engine is important? but could be wrong).

#10021 (comment)

That's because the URL is not valid in the first place as it contains an endash.
That's not Quarto or Pandoc. This is done by the browsers as showed in #10021 (comment).

I'm really confused here.

  • Pandoc will transform -- in YAML as endash.
  • Browsers do not like unicode in URLs and in the case of endash create a very weird URL while the HTML code is "correct".

@machow
Copy link
Author

machow commented Jun 17, 2024

Browsers do not like unicode in URLs and in the case of endash create a very weird URL while the HTML code is "correct".

Ahhh, this is so helpful, thanks! I didn't notice that in the source it's "okay", but has the mixed in characters when hovered over.

I think it's just the issue of _quarto.yml config producing endashes then that is the big issue for me and consumers of quartodoc.

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 17, 2024

In your case, is it not possible to write a small LUA filter that adds the raw block/span code? Or that escape some characters as done in 6e5976f

@cscheid cscheid reopened this Jun 17, 2024
@cscheid
Copy link
Collaborator

cscheid commented Jun 17, 2024

Ok:

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 17, 2024

The "Punycode" (i did not know the term) is the browser as the emitted HTML is fine, so Pandoc is only responsible for endash, emdash, etc.

@mcanouil
Copy link
Collaborator

mcanouil commented Jun 17, 2024

Maybe a more general fix would be as the one for href keys and would consist to detect URLs/paths in YAML and escape characters. (A bit tricky though)

@cscheid
Copy link
Collaborator

cscheid commented Jun 17, 2024

OK. I'm going to close this and reopen an issue with a better repro.

@cscheid cscheid closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
@mcanouil
Copy link
Collaborator

So, your URL comes from YAML.

You see the transformation as endash with the below code.

---

format: html

value: https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto

---



{{< meta value >}}

And if you don't want any markdown transformation:

---

format: html

value: "`https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto`{=html}"

---



{{< meta value >}}

I am not sure Quarto can do something here.

Maybe you could have a small Lua filter that ensure the YAML keys are treated as is.

I don't how how Quarto treats URLs in YAML, but I think it's done via TypeScript and not Lua.

@cscheid the above shows the issue 😉 (in a new issue indeed)

@cscheid
Copy link
Collaborator

cscheid commented Jun 17, 2024

See #10042

@cderv
Copy link
Collaborator

cderv commented Jun 18, 2024

Nothing should be turning -- into en-dashes. (Maybe?) Pandoc is doing that,

Just want to add additional information on this.

This is Pandoc. It has a +smart extension that does this. See https://pandoc.org/MANUAL.html#extension-smart

Interpret straight quotes as curly quotes, --- as em-dashes, -- as en-dashes, and ... as ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.”

This extensions is activated by default and impact how things are written in output. HTML is among the format where en-dash are used

With smart extension without smart extensions
❯ quarto pandoc --from markdown --to html
pr--450
pr-450
^Z
<p>pr–450 pr-450</p>
❯ quarto pandoc --from markdown-smart --to html
pr--450
pr-450
^Z
<p>pr--450 pr-450</p>

Note the two dashes without smart enabled.

This all happens in the Markdown reader !

With smart extension without smart extensions
❯ quarto pandoc --from markdown --to native
pr--450
^Z
[ Para [ Str "pr\8211\&450" ] ]
❯ quarto pandoc --from markdown-smart --to native
pr--450
^Z
[ Para [ Str "pr--450" ] ]

Why does it happens with metadata field ?

Because they are parsed as Markdown values by Pandoc
From https://pandoc.org/MANUAL.html#extension-yaml_metadata_block

Metadata can contain lists and objects (nested arbitrarily), but all string scalars will be interpreted as Markdown.

We already had issues of this type in the past, and internally using the new pandoc-native raw block feature from pandoc was the way

Hope it helps understand. I'll also post in the other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
html Issues with HTML and related web technology (html/css/scss) support a request for support
Projects
None yet
Development

No branches or pull requests

4 participants