New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unicode URLs, added ISO date template #1769

Merged
merged 1 commit into from Nov 24, 2014

Conversation

Projects
None yet
3 participants
@theirix
Contributor

theirix commented Jun 10, 2014

Store in hashmap strings parsed from percent-encoded unicode URLs.
Added internationalized 'isodate' template besides of locale-dependent date.

We use wkhtmltopdf in cross-language web project with cyrillic and english urls. This patch fixes behaviour of wkhtmltopdf with percent-encoded urls. They are gathered from body as unicode strings but are read from argument list as percent-encoded strings.

Another portion of patch added a template 'isodate' because webserver system locale where wkhtmltopdf is launched is not always a good choose.

@ashkulz

This comment has been minimized.

Member

ashkulz commented Jul 2, 2014

@theirix: can you describe the problem you are trying to solve for the "Fix unicode URLs" changes with an example?

@theirix

This comment has been minimized.

Contributor

theirix commented Jul 2, 2014

@ashkulz It is described a little in commit message, let me explain more. We have some interconnected wiki pages with cyrillic links. Because it's a wiki page url has these unicode characters too. Wiki content is scraped for links, some links are extracted as UTF-8 strings and placed into map. The same links are passed to the wkhtmltopdf as arguments percent-encoding. So input should be percent-decoded first - this is what this patch does. If all links in document are latin-1 then this problem can't be seen.

@ashkulz

This comment has been minimized.

Member

ashkulz commented Jul 3, 2014

Can you give an example of this, so I can understand it clearly? Do you have a sample HTML so that I can understand it?

@ashkulz ashkulz added the NeedInfo label Jul 3, 2014

@theirix

This comment has been minimized.

Contributor

theirix commented Jul 5, 2014

@ashkulz, please take a look to this example - http://omniverse.ru/public/wkhtmltopdf-test.zip
It contains three files, main page with local hyperlinks to other two pages - TestPage and ТестоваяСтраница (TestPage in Russian).

I converted it in FreeBSD with

wkhtmltopdf main.html TestPage.html ТестоваяСтраница.html test.pdf"

and in Linux with

wkhtmltopdf main.html TestPage.html "file://`pwd`/%D0%A2%D0%B5%D1%81%D1%82%D0%BE%D0%B2%D0%B0%D1%8F%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0.html" test.pdf"

It seems like Linux's QT can't use unicode urls without percent-encoding them.

Original and patched wkhtmltopdf successfully build a pdf with three pages but original can't deduce link from first page to unicode-named last page (ТестоваяСтраница). Patched version does.

@ashkulz

This comment has been minimized.

Member

ashkulz commented Nov 17, 2014

Can you check if this is still relevant, as I suspect 6f2cafe would have fixed the unicode issues in the command line?

@theirix

This comment has been minimized.

Contributor

theirix commented Nov 18, 2014

No, version with this commit (0.12.2-dev-85cdf36 Debian binary) does not fix this problem. Cyrillic link from the first page of resulting PDF is still unclickable. I checked Linux only because FreeBSD because port compilation takes a lot of time.

@ashkulz

This comment has been minimized.

Member

ashkulz commented Nov 21, 2014

Ok, just reviewed the code: I think that the following would do equally well, what do you think?

diff --git a/src/lib/pdfconverter.cc b/src/lib/pdfconverter.cc
index 162741c..34b14f1 100644
--- a/src/lib/pdfconverter.cc
+++ b/src/lib/pdfconverter.cc
@@ -559,9 +559,10 @@ void PdfConverterPrivate::findLinks(QWebFrame * frame, QVector<QPair<QWebElement
            QUrl href(h);
            if (href.isEmpty()) continue;
            href=frame->baseUrl().resolved(href);
-           if (urlToPageObj.contains(href.toString(QUrl::RemoveFragment))) {
+           QString key = QUrl::fromPercentEncoding(href.toString(QUrl::RemoveFragment).toLocal8Bit());
+           if (urlToPageObj.contains(key)) {
                if (ulocal) {
-                   PageObject * p = urlToPageObj[href.toString(QUrl::RemoveFragment)];
+                   PageObject * p = urlToPageObj[key];
                    QWebElement e;
                    if (!href.hasFragment())
                        e = p->page->mainFrame()->findFirstElement("body");
@@ -757,7 +758,7 @@ void PdfConverterPrivate::tocLoaded(bool ok) {
        for (int d=0; d < objects.size(); ++d) {
            if (!objects[d].loaderObject || objects[d].loaderObject->skip) continue;
            if (objects[d].settings.isTableOfContent) continue;
-           urlToPageObj[ objects[d].page->mainFrame()->url().toString(QUrl::RemoveFragment) ] = &objects[d];
+           urlToPageObj[ QUrl::fromPercentEncoding(objects[d].page->mainFrame()->url().toString(QUrl::RemoveFragment).toLocal8Bit()) ] = &objects[d];
        }

        for (int d=0; d < objects.size(); ++d) {

If it works for you, please rebase your PR on the latest master and resubmit it with an entry in the changelog (like I did for 86ce2a1).

@ashkulz ashkulz removed the NeedInfo label Nov 21, 2014

Fix unicode URLs
Store in hashmap strings parsed from percent-encoded unicode URLs.
Patch fixes behaviour of wkhtmltopdf with percent-encoded urls.
They are gathered from body as unicode strings but are read from
argument list as percent-encoded strings.
@theirix

This comment has been minimized.

Contributor

theirix commented Nov 23, 2014

@ashkulz , I checked your version of patch and it works perfectly on my test. I had updated changelog and rebased pull request. Please check it out.

@ashkulz ashkulz merged commit 7486408 into wkhtmltopdf:master Nov 24, 2014

@ashkulz ashkulz added the Merged label Nov 24, 2014

@ashkulz

This comment has been minimized.

Member

ashkulz commented Nov 24, 2014

Thanks a lot, and sorry for the delay 😄

@ashkulz

This comment has been minimized.

Member

ashkulz commented Jan 4, 2015

A release candidate 0.12.2-rc-71e97c1 is available, which has this change. Please report back if you encounter an issue with the above build.

@ashkulz ashkulz added this to the 0.12.2 milestone Jan 4, 2015

@ashkulz

This comment has been minimized.

Member

ashkulz commented Jan 10, 2015

0.12.2 has been released, which includes changes related to this issue.

@vicancy

This comment has been minimized.

vicancy commented Jun 26, 2017

Hi @ashkulz is the latest version containing the fix? I can still repro the issue with the latest 0.12.4:

  1. Create a localized file, e.g. ТестоваяСтраница.html
  2. Create the main file main.html containing:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>title</title>
</head>
<body>
<p>Ссылка на <a href="%D0%A2%D0%B5%D1%81%D1%82%D0%BE%D0%B2%D0%B0%D1%8F%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0.html">
	ТестоваяСтраница</a></p>
</body>
</html>
  1. Convert: wkhtmltopdf main.html ТестоваяСтраница.html test.pdf
  2. test.pdf is generated, however the link is not correctly resolved, it still link to an external link.
@theirix

This comment has been minimized.

Contributor

theirix commented Jul 2, 2017

@vicancy , I just installed a 0.12.4 version from the https://github.com/wkhtmltopdf/wkhtmltopdf/releases/tag/0.12.4 page to the Debian 9 docker container. Your test works fine. Do you use binary from the github or from the OS repository?

@vicancy

This comment has been minimized.

vicancy commented Jul 3, 2017

Hi @theirix I tried the github version however with no luck 😢 Is the fix has specific requirements on OS region or locale settings? I am using en-US as the default locale, doubt if that matters.

@theirix

This comment has been minimized.

Contributor

theirix commented Jul 3, 2017

Maybe your locale is not UTF-8? Could you please post the locale and locale -a output? What is your OS?

Mine locale on Debian 9 is:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

On a Debian system UTF-8 locale can be added via localedef -i en_US -c -f UTF-8 en_US.UTF-8 call. Reproducible test - https://gist.github.com/theirix/0876a5047a7da5da961444d2fd78c2e7

@vicancy

This comment has been minimized.

vicancy commented Jul 6, 2017

@theirix I am on Windows... 😭

@theirix

This comment has been minimized.

Contributor

theirix commented Jul 6, 2017

@vicancy I'm afraid I cannot check it quickly. I'll try to check it this Saturday.

@theirix

This comment has been minimized.

Contributor

theirix commented Jul 11, 2017

@vicancy this issue is really reproducible on Windows (tried 7 x64) even with a Russian locale. So my fix doesn't work for Windows. Unfortunately I have no enough expertise in Windows and QT to fix this quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment