Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special language characters not rendered to PDF #13

Closed
honzik1 opened this issue May 6, 2015 · 13 comments
Closed

Special language characters not rendered to PDF #13

honzik1 opened this issue May 6, 2015 · 13 comments

Comments

@honzik1
Copy link

honzik1 commented May 6, 2015

First I really appreciate your work on this great maven plugin. Thank you.

Following czech sentence for encoding tests written in MD is not fully generated into PDF.

Input:
příliš žluťoučký kůň úpěl ďábelské ódy
PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY

Output:
píliš žluouký k úpl ábelské ódy
PÍLIŠ ŽLUOUKÝ K ÚPL ÁBELSKÉ ÓDY

@honzik1
Copy link
Author

honzik1 commented May 6, 2015

pom file plugin configuration follows:

<plugin>
    <groupId>se.natusoft.tools.doc.markdowndoc</groupId>
    <artifactId>markdowndoc-maven-plugin</artifactId>
    <version>1.3.8</version>
    <executions>
        <execution>
            <id>generate-docs</id>
            <goals>
                <goal>doc</goal>
            </goals>
            <phase>install</phase>
            <configuration>
                <generatorOptions>
                    <generator>pdf</generator>
                    <inputPaths>${basedir}/src/.*.md</inputPaths>
                </generatorOptions>
                <pdfGeneratorOptions>
                    <resultFile>${basedir}/readme.pdf</resultFile>
                    <pageSize>A4</pageSize>
                    <title>Readme</title>
                    <subject>Knowledge Base</subject>
                    <keywords/>
                    <version>1.0</version>
                    <author>...</author>
                    <copyright>Copyright © 2015 ...</copyright>
                    <hideLinks>false</hideLinks>
                    <unorderedListItemPrefix>• </unorderedListItemPrefix>
                    <firstLineParagraphIndent>false</firstLineParagraphIndent>
                    <backgroundColor>255:255:255</backgroundColor>
                    <blockQuoteColor>128:128:128</blockQuoteColor>
                    <codeColor>0:0:0</codeColor>
                    <generateTitlePage>true</generateTitlePage>
                    <generateTOC>true</generateTOC>
                </pdfGeneratorOptions>
            </configuration>
        </execution>
    </executions>
</plugin>

@tombensve
Copy link
Owner

Hmm, Is that outside of UTF-8 ? HTML of course supports any character encoding, and I think iText (PDF library) does too. So the best solution is for the user to be able to specify which character encoding to use. I should actually have thought about that from the beginning.

I will add that feature. I'm currently adding some other new features now that I figured out how to not render section numbers. I will thereby also make section numbers a setting for the PDFGenerator.


Also I did not see this issue until now. GitHub refuses to email me any notifications at all for any of my own repositories. I do get notifications from other repos I'm watching but never ever from my own. I have contacted GitHub about this. I do have every "send me mail" setting enables and a correct mail address.

@honzik1
Copy link
Author

honzik1 commented May 28, 2015

Czech national characters are in UTF-8 (see http://www.utf8-chartable.de/unicode-utf8-table.pl - block: Latin 1 Suplement)
e.g.:

U+0158  Ř  c5 98   LATIN CAPITAL LETTER R WITH CARON
U+0159  ř  c5 99   LATIN SMALL LETTER R WITH CARON

@tombensve
Copy link
Owner

Does this problem apply to both generated PDF and HTML or only one of the 2 ?

@honzik1
Copy link
Author

honzik1 commented May 28, 2015

OK, sorry for that. I had bad encoding in the MD file. For now I correct it to UTF-8, and it behaves totally different. I'm attaching output for generated MD file in UTF-8 and in CP1250.

UTF-8

HTML:

příliš žluťoučký ků�? úpěl ďábelské ódy 
P�?ÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL Ď�?BELSKÉ ÓDY

PDF:

p™-li luouÄk k pÄ›l Äbelsk© dy 
PLI LU¤OUÄK K®‡ šPÄšL ÄŽBELSK‰ “DY

CP1250

p��li� �lu�ou�k� k�� �p�l ��belsk� �dy
P��LI� �LU�OU�K� K�� �P�L ��BELSK� �DY

PDF:

píliš žluouký k úpl ábelské ódy
PÍLIŠ ŽLUOUKÝ K ÚPL ÁBELSKÉ ÓDY

@tombensve
Copy link
Owner

Just so that I have something correct to test with, can you please send me your test .md file. Mail it to tommy@natusoft.se mailto:tommy@natusoft.se.

28 maj 2015 kl. 14:22 skrev honzik1 notifications@github.com:

OK, sorry for that. I had bad encoding in the MD file. For now I correct it to UTF-8, and it behaves totally different. I'm attaching output for generated MD file in UTF-8 and in CP1250.

UTF-8

HTML:

příliš žluťoučký ků�? úpěl ďábelské ódy
P�?ÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL Ď�?BELSKÉ ÓDY
PDF:

p™-li luouÄk k pÄ›l Äbelsk© dy
PLI LU¤OUÄK K®‡ šPÄšL ÄŽBELSK‰ “DY
CP1250

p��li� �lu�ou�k� k�� �p�l ��belsk� �dy
P��LI� �LU�OU�K� K�� �P�L ��BELSK� �DY
PDF:

píliš žluouký k úpl ábelské ódy
PÍLIŠ ŽLUOUKÝ K ÚPL ÁBELSKÉ ÓDY

Reply to this email directly or view it on GitHub #13 (comment).

@tombensve
Copy link
Owner

I get a little bit different result than you, but I'm on a Mac. I assume you are running windows.

For HTML i get a result that is identical to the source. I've tested in Safari, Chrome and Opera and it looks the same in all.

For PDF however I got something that at first looked correct, but then I noticed that some of the characters where missing in the result.

I've googled and looked in the iText book (I'm using iText to create PDFs) and came to the conclusion that it is not only upp to the character set, but the font used must also support the characters. MarkdownDoc currently hardcodes the Helvetica font. Making all fonts available in the configuration would make the config rather messy so I decided agains it. However, for the next version I have come up with another solution for allowing users to select their choice of fonts for PDF files. It will support a JSON file that will be something similar in functionality to a CSS file. That is, it will allow you to specify the font, size, and colors for all formats. It will be possible to have several of these, and you specify the one you want to use when generating. Apparently iText can also import and use external .ttf fonts, so I will also make this file take paths to .ttf files.

I can't unfortunately tell you when this version will be available since it depends a lot on when I have time to work on this. It is first in the queue of things I plan to do :-).

@honzik1
Copy link
Author

honzik1 commented Jun 1, 2015

Thank you for your response. Yes, I'm on windows. We used iText on several projects and I've never faced this kind of problem. If there will be some time I can try to check it out and create some push requets. I'm glad you've done project like this.

@tombensve
Copy link
Owner

Can it be a platform difference ? It does sound strange however. I've worked with Java since 1.0 and I've never ran into platform differences before (other than web applications and IE which is only available on windows :-)).

But then again, Apple is in some ways worse than Microsoft when it comes to quality. They do good hardware, but suck as software. But nowadays the Mac JDK comes from Oracle, not Apple.

When it comes to iText and character encoding it only reads and writes from streams, not readers/writers. The latter supports character encodings. I'm not finding any other way to tell iText what encoding the input has. Maybe it can figure it out itself.

Please take a look at the code yourself. It sounds like you have more iText experience than I have. This code is my only use of iText. If you are not familiar with Groovy, don't worry, this was my first Groovy project and it is using very few Groovy features, I was still kind of stuck in Java when I wrote this :-).

@hilton
Copy link

hilton commented Jul 9, 2017

The problem is probably that iText doesn't do the font substitution that web browsers do. I've solved this before by using fonts for the PDF that have extensive character set support. Check fonts like the Microsoft standard ones, Souce Sans Pro and some of the open-source ones.

@tombensve
Copy link
Owner

Thanks hilton for waking me up! I've forgotten about this old hard to solve problem, it was from 2015!

For last summer coding project I made version 2.0, currently 2.0.1. It replaced iText with PDFBox. PDFBox is a very different creature and gives me as a developer more options and control over output. This version also added support for what I call MSS - Markdown Style Sheet, in the format of a JSON document. With the MSS it is possible to make use of external ttf fonts. The MarkdownDoc manual makes use of such a font.

This might also solve honzik1 problem by using an appropriate external ttf font supporting the characters of his language.

In Swedish we have only 3 "strange" characters, but they are available in UTF-8, ISO-8859-1, CP1252, and a few more. Thereby we seldom have such problems with fonts not containing them. But due to them being available in several encodings, where the code for these characters are not the same, they are 3 truly expensive characters in the amount of trouble they cause :-)

@honzik1
Copy link
Author

honzik1 commented Jul 10, 2017

Hi, I've missed new releases. I tryed 2.0.1 but it is not possible to generate PDF. It failed because o unknown characters (either with fonts from my czech windows 7).

I post rewrited project to see. Arial and Times fonts failed on unknown characters with UTF-8 encoding and on
Caused by: java.lang.IllegalStateException: subset is empty
at org.apache.fontbox.ttf.TTFSubsetter.writeToStream(TTFSubsetter.java:938)
with CP1250 encoding.

Strange is Java heap space with ARIALUNI.TTF and set MAVEN_OPTS=-Xmx2G && mvn install

Kind Regards,
Jan
docs.zip

@tombensve
Copy link
Owner

Jan, I took you docs.zip and ran it to see what happened clearly for myself. That helped. I concluded the problem was in PDFBox. At PDFBox site there was now a version 2.0.6 and I was using version 2.0.2. So I made an upgrade of MarkdownDoc to 2.0.2 and bumped PDFBox to 2.0.6. After that your docs.zip example generated a result, that at least to me looked OK. PDFBox still warns abut the "subset is empty" but no longer breaks on it.

The new 2.0.2 version of MarkdownDoc is up on GitHub and Bintray, so you can just bump the version number in your docs.zip pom and try yourself.

If you wonder why your images are tilted, it is because the my.mss at

 ...
"document": {
  "pageFormat": "A4",
  "color": "black",
  "background": "white",
  "family": "ArialMT",
  "size": 10,
  "style": "Normal",

  "image": {
    "imgScalePercent": 50.0,
    "imgAlign": "MIDDLE",
    "imgRotateDegrees": -1.0 <--- change to 0 for no tilt, or just remove line.
  },
 ...

I assume you copied this from MarkdownDocs manual mss file. I rotate the editor picture slightly with this.

Let me know how it worked for you.

/Tommy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants