Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL in front matters yaml/toml/json are wrongly picked as markdown URLs #128

Closed
codingepaduli opened this issue Oct 29, 2020 · 6 comments
Closed
Labels
duplicate This issue was already asked, please comment on the original one wontfix This will not be fixed, see reason in the comments

Comments

@codingepaduli
Copy link

codingepaduli commented Oct 29, 2020

I use Hugo static site generator to write articles and generate web pages. Hugo uses front matter in yaml or json format.

So the following is a valid markdown page:

---
type: "p5js"
title: "Title"
description: "description"
date: 2020-09-11
externalJS: ['https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.1.9/p5.js']
customJS: ["/static/coding/web/p5js/basics.js"]
---

# Title
article ....

When markdown-link-check run against the file above, the link extracted is wrong and wrongly evaluated:

*** Running: markdown-link-check --config .github/link_checker_config.json content/coding/web/p5js/basics.md

FILE: content/coding/web/p5js/basics.md

ERROR: 1 dead links found!
[✖] https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.1.9/p5.js']
[✓] https://p5js.org/assets/learn/coordinate-system-and-shapes/images/drawing-03.svg

3 links checked.
[✖] https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.1.9/p5.js'] → Status: 403
*** ERROR: Something went wrong - see the errors above...
@NicolasMassart
Copy link
Contributor

NicolasMassart commented Nov 2, 2020

Cloud flare has anti bot protection ( it requires to have a cookie or else solve a captcha) so we can't test theses urls. You may need to exclude them by adding a pattern in the config file : "pattern": "^http(s)?://cdnjs.cloudflare.com"

Marked as duplicate of #109

Let me know if the workaround is enough for now until we fix #109 thanks.

@NicolasMassart NicolasMassart added duplicate This issue was already asked, please comment on the original one wontfix This will not be fixed, see reason in the comments and removed wontfix This will not be fixed, see reason in the comments labels Nov 2, 2020
@codingepaduli
Copy link
Author

codingepaduli commented Nov 3, 2020

The workaround is enough, thanks.
But let me know how can I check if a site has an anti-bot protection. Is there a way?

For example, I'm experiencing the same issue on another link:
[✖] https://www.nationalgeographic.com/science/phenomena/2015/04/24/when-hubble-stared-at-nothing-for-100-hours/ → Status: 0
*** ERROR: Something went wrong - see the errors above...

Can I use a tool like curl in order to find the existence of the anti-bot?

curl -I https://www.nationalgeographic.com/science/phenomena/2015/04/24/when-hubble-stared-at-nothing-for-100-hours/
HTTP/2 200 
access-control-allow-credentials: true
access-control-allow-origin: *
content-type: text/html
server: Apache/2.4.18 (Ubuntu)
x-frame-options: SAMEORIGIN
x-akamai-path-stats: [3:19060:3940]
cache-control: max-age=1
expires: Tue, 03 Nov 2020 18:53:00 GMT
date: Tue, 03 Nov 2020 18:52:59 GMT

@NicolasMassart
Copy link
Contributor

NicolasMassart commented Nov 6, 2020

It's not easy. You can first try to visit the site with a browser where you block all cookies and disable all javascript. It should display some captcha or error if the site is protected.

But in your cases, for https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.1.9/p5.js or https://www.nationalgeographic.com/science/phenomena/2015/04/24/when-hubble-stared-at-nothing-for-100-hours/, it seems that it's only that the site is very slow to answer and the test times out.
Try again to run the test, sometimes it's better once their site is warm...

> mocha -g cloudflare -R spec --exit



  markdown-link-check
[
  LinkCheckResult {
    link: 'https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.1.9/p5.js',
    statusCode: 200,
    err: null,
    status: 'alive'
  }
] null
    ✓ cloudflare (96ms)


  1 passing (112ms)

Otherwise you will have to increase the timeout, but we are still trying to figure how to merge this PR #129

I love this Hubble picture by the way, nice gravitational lenses.

@codingepaduli
Copy link
Author

codingepaduli commented Nov 11, 2020

Thank you for your reply. And sorry to reopen, but I really think there is a bug when checking markdown files with yaml or json front matters.

Step to reproduce:

Create 2 files with the same link:

---
type: "html"
title: "Good"
description: "Good"
date: 2020-11-11
---

# Title
[https://www.joomla.org/](https://www.joomla.org/ "joomla")

and

---
type: "html"
title: "Evil"
description: "Evil"
date: 2020-11-11
link: "https://www.joomla.org/"
---

# Title

The output is:

*** Running: markdown-link-check --config .github/link_checker_config.json good.md

FILE: good.md
[✓] https://www.joomla.org/

1 links checked.

*** Running: markdown-link-check --config .github/link_checker_config.json evil.md

FILE: evil.md

ERROR: 1 dead links found!
[✖] https://www.joomla.org/"

1 links checked.
[✖] https://www.joomla.org/" → Status: 404
*** ERROR: Something went wrong - see the errors above...

I also tried with different links, and got the same behavior.

@NicolasMassart
Copy link
Contributor

Sorry your first questions were not clear enough and the fact that the links were really not valid did not help.
Now it's different.
Markdown link search clearly doesn't work well with front matter as it's not markdown but can be yaml, toml or json. So it has to be taken in account.
Thanks for not giving up on this one!

@NicolasMassart NicolasMassart changed the title Quoted URL in Hugo front matters are invalids URL in front matters yaml/toml/json are wrongly picked as markdown URLs Nov 12, 2020
@NicolasMassart
Copy link
Contributor

After some analysis and reproducing the issue, it clearly appears that it's an issue with the markdown parser https://github.com/markdown-it/markdown-it and that it's a known limitation and they made some choices. See markedjs/marked#485
Anyway it's probably not to be handled at markdown-link-check level but at least at https://github.com/tcort/markdown-link-extractor level. I will then create an issue there.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue was already asked, please comment on the original one wontfix This will not be fixed, see reason in the comments
Projects
None yet
Development

No branches or pull requests

2 participants