Bug: Crawling non-ASCII characters (URL) #4

safesploit · 2022-11-15T23:15:02Z

When crawling the Japanese Wikipedia ja.wikipedia.org/wiki/メインページ the following URL is indexed
https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

The text was updated successfully, but these errors were encountered:

dehlirious · 2023-03-31T06:32:01Z

Hey, I've written this up and it works, but am I missing anything?

Tested and it functions fine, tested a url with a ` character(only thing not covered by htmlspecialchars) and it didn't break it

I've also noticed that html tags are removed from URL titles(if title says "<b>Hi" it results in "Hi", which kindof is an issue depending on the circumstance, I'd rather it be processed with htmlspecialchars than removed. Anyway,

Line 88 of crawl-manual insert
$url = htmlspecialchars(urldecode($url),ENT_QUOTES, "UTF-8");

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Crawling non-ASCII characters (URL) #4

Bug: Crawling non-ASCII characters (URL) #4

safesploit commented Nov 15, 2022

dehlirious commented Mar 31, 2023 •

edited

Loading

Bug: Crawling non-ASCII characters (URL) #4

Bug: Crawling non-ASCII characters (URL) #4

Comments

safesploit commented Nov 15, 2022

dehlirious commented Mar 31, 2023 • edited Loading

dehlirious commented Mar 31, 2023 •

edited

Loading