New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special language characters not rendered to PDF #13
Comments
pom file plugin configuration follows:
|
Hmm, Is that outside of UTF-8 ? HTML of course supports any character encoding, and I think iText (PDF library) does too. So the best solution is for the user to be able to specify which character encoding to use. I should actually have thought about that from the beginning. I will add that feature. I'm currently adding some other new features now that I figured out how to not render section numbers. I will thereby also make section numbers a setting for the PDFGenerator. Also I did not see this issue until now. GitHub refuses to email me any notifications at all for any of my own repositories. I do get notifications from other repos I'm watching but never ever from my own. I have contacted GitHub about this. I do have every "send me mail" setting enables and a correct mail address. |
Czech national characters are in UTF-8 (see http://www.utf8-chartable.de/unicode-utf8-table.pl - block: Latin 1 Suplement)
|
Does this problem apply to both generated PDF and HTML or only one of the 2 ? |
OK, sorry for that. I had bad encoding in the MD file. For now I correct it to UTF-8, and it behaves totally different. I'm attaching output for generated MD file in UTF-8 and in CP1250. UTF-8HTML:
PDF:
CP1250
PDF:
|
Just so that I have something correct to test with, can you please send me your test .md file. Mail it to tommy@natusoft.se mailto:tommy@natusoft.se.
|
I get a little bit different result than you, but I'm on a Mac. I assume you are running windows. For HTML i get a result that is identical to the source. I've tested in Safari, Chrome and Opera and it looks the same in all. For PDF however I got something that at first looked correct, but then I noticed that some of the characters where missing in the result. I've googled and looked in the iText book (I'm using iText to create PDFs) and came to the conclusion that it is not only upp to the character set, but the font used must also support the characters. MarkdownDoc currently hardcodes the Helvetica font. Making all fonts available in the configuration would make the config rather messy so I decided agains it. However, for the next version I have come up with another solution for allowing users to select their choice of fonts for PDF files. It will support a JSON file that will be something similar in functionality to a CSS file. That is, it will allow you to specify the font, size, and colors for all formats. It will be possible to have several of these, and you specify the one you want to use when generating. Apparently iText can also import and use external .ttf fonts, so I will also make this file take paths to .ttf files. I can't unfortunately tell you when this version will be available since it depends a lot on when I have time to work on this. It is first in the queue of things I plan to do :-). |
Thank you for your response. Yes, I'm on windows. We used iText on several projects and I've never faced this kind of problem. If there will be some time I can try to check it out and create some push requets. I'm glad you've done project like this. |
Can it be a platform difference ? It does sound strange however. I've worked with Java since 1.0 and I've never ran into platform differences before (other than web applications and IE which is only available on windows :-)). But then again, Apple is in some ways worse than Microsoft when it comes to quality. They do good hardware, but suck as software. But nowadays the Mac JDK comes from Oracle, not Apple. When it comes to iText and character encoding it only reads and writes from streams, not readers/writers. The latter supports character encodings. I'm not finding any other way to tell iText what encoding the input has. Maybe it can figure it out itself. Please take a look at the code yourself. It sounds like you have more iText experience than I have. This code is my only use of iText. If you are not familiar with Groovy, don't worry, this was my first Groovy project and it is using very few Groovy features, I was still kind of stuck in Java when I wrote this :-). |
The problem is probably that iText doesn't do the font substitution that web browsers do. I've solved this before by using fonts for the PDF that have extensive character set support. Check fonts like the Microsoft standard ones, Souce Sans Pro and some of the open-source ones. |
Thanks hilton for waking me up! I've forgotten about this old hard to solve problem, it was from 2015! For last summer coding project I made version 2.0, currently 2.0.1. It replaced iText with PDFBox. PDFBox is a very different creature and gives me as a developer more options and control over output. This version also added support for what I call MSS - Markdown Style Sheet, in the format of a JSON document. With the MSS it is possible to make use of external ttf fonts. The MarkdownDoc manual makes use of such a font. This might also solve honzik1 problem by using an appropriate external ttf font supporting the characters of his language. In Swedish we have only 3 "strange" characters, but they are available in UTF-8, ISO-8859-1, CP1252, and a few more. Thereby we seldom have such problems with fonts not containing them. But due to them being available in several encodings, where the code for these characters are not the same, they are 3 truly expensive characters in the amount of trouble they cause :-) |
Hi, I've missed new releases. I tryed 2.0.1 but it is not possible to generate PDF. It failed because o unknown characters (either with fonts from my czech windows 7). I post rewrited project to see. Arial and Times fonts failed on unknown characters with UTF-8 encoding and on Strange is Java heap space with ARIALUNI.TTF and set MAVEN_OPTS=-Xmx2G && mvn install Kind Regards, |
Jan, I took you docs.zip and ran it to see what happened clearly for myself. That helped. I concluded the problem was in PDFBox. At PDFBox site there was now a version 2.0.6 and I was using version 2.0.2. So I made an upgrade of MarkdownDoc to 2.0.2 and bumped PDFBox to 2.0.6. After that your docs.zip example generated a result, that at least to me looked OK. PDFBox still warns abut the "subset is empty" but no longer breaks on it. The new 2.0.2 version of MarkdownDoc is up on GitHub and Bintray, so you can just bump the version number in your docs.zip pom and try yourself. If you wonder why your images are tilted, it is because the my.mss at
I assume you copied this from MarkdownDocs manual mss file. I rotate the editor picture slightly with this. Let me know how it worked for you. /Tommy |
First I really appreciate your work on this great maven plugin. Thank you.
Following czech sentence for encoding tests written in MD is not fully generated into PDF.
Input:
příliš žluťoučký kůň úpěl ďábelské ódy
PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY
Output:
píliš žluouký k úpl ábelské ódy
PÍLIŠ ŽLUOUKÝ K ÚPL ÁBELSKÉ ÓDY
The text was updated successfully, but these errors were encountered: