Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 characters cause valid links to be detected as broken #234

Open
matkoniecz opened this issue Aug 5, 2021 · 3 comments
Open

UTF8 characters cause valid links to be detected as broken #234

matkoniecz opened this issue Aug 5, 2021 · 3 comments

Comments

@matkoniecz
Copy link

I prepared test case with https://github.com/matkoniecz/broken-link-checker-local-utf8

blc https://matkoniecz.github.io/broken-link-checker-local-utf8 -r

See https://matkoniecz.github.io/broken-link-checker-local-utf8/ - both link work, one with utf8 characters gets BLC_UNKNOWN/HTTP_undefined errors

mateusz@grima:~$ blc https://matkoniecz.github.io/broken-link-checker-local-utf8 -r
Getting links from: https://matkoniecz.github.io/broken-link-checker-local-utf8
├───OK─── https://matkoniecz.github.io/broken-link-checker-local-utf8/test%20space.html
└─BROKEN─ https://matkoniecz.github.io/broken-link-checker-local-utf8/test_zażółć.html (BLC_UNKNOWN)
Finished! 2 links found. 1 broken.

Getting links from: https://matkoniecz.github.io/broken-link-checker-local-utf8/test%20space.html
└─BROKEN─ https://matkoniecz.github.io/broken-link-checker-local-utf8/test_zażółć.html (HTTP_undefined)
Finished! 2 links found. 1 excluded. 1 broken.

Finished! 4 links found. 1 excluded. 2 broken.
Elapsed time: 1 second

Sorry if that is my misunderstanding but as I understand it the UTF8 is de facto working in links

UTF8 may be internally different but browsers seems 100% fine with links including letters like https://en.wikipedia.org/wiki/Ogonek

Sanity check: https://stackoverflow.com/questions/22357509/can-urls-have-utf-8-characters

Even DNS supports URF8 characters (with some workarounds and restrictions) https://en.wikipedia.org/wiki/Internationalized_domain_name

replaces LukasHechenberger/broken-link-checker-local#50

@rezaalavi
Copy link

I have the same problem with websites in Chinese and Thai languages.
While the links exist the program reports an error of type (BLC_UNKNOWN)

@mayrsascha
Copy link

I have the same problem with grave accents and acute accents, those are very common in Latin languages and present in other languages too. For example https://www.iswatersafetodrink.in/Italy/Cantù

@matkoniecz
Copy link
Author

Can I do anything so as the first step "needs confirmation" can be dropped?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants