Skip to content

Conversion Issues

Tyler Clemens edited this page Nov 19, 2013 · 13 revisions

Description: this document catalogs conversion issues found when converting pdfs to html via pdf2htmlEX. It gives examples of each problem and an estimate of the frequency of each problem. Also, It describes steps taken to reproduce the problem.

Capitalization of letters is off at times

Frequency (low-medium)
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf

How it was reproduced: have not been able to reproduce this yet using InDesign.

Word selection is broke due to spans that are inserted to correct for kerning

frequency (Medium-High)

pdf2htmlEX applies a with and a margin to spans to correct for curning


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf

How it was reproduced: Many different fonts were used to create text in InDesign. The less traditional fonts when used with a large font size seem to exhibit the behavior more. More traditional smaller fonts don't seem to exhibit this behavior at all.

Word selection is broken due to lack of spaces at the end of divs

frequency(low)

This is not a problem on the kindle


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf

How it was reproduced: Two text boxes were aligned vertically in InDesign without spaces at the end of the first and the beginning of the second.

Word selection is broken due to placement of divs

frequency(low)


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf

How it was reproduced: Have not been able to reproduce this in InDesign

Justification of text is slightly off making the text look ragged right

Frequency(medium-high): This problem occurs every time text in the pdf is justified. sometimes it looks close to being justified, and other times it is significantly off.

This problem occurs because we use the command line option optimize text to remove some spans that interfere with word selection. Optimize text reduces the number of spans in a line and adjusts the letter spacing and word spacing of the entire line to account for this reduction. Its an imperfect approximation.


*SS taken from HTML converted Generation Kill.pdf
*SS taken from HTML converted clifford.pdf

How it was reproduced: created a large portion of generated "Lorem Ispum" text in InDesign. When this text was exported as a pdf and converted, it showed the justification issue.

text selection is broken because of a word split by a space in a span

frequency(low)

pdf2htmlEX guesses when to insert a space in its offset spans. It guesses based on the width of a space and the curning of characters. If a false positive occurs, a word will be broken by a space character.


*SS taken from HTML converted Fire-in-My-Belly-TEST-RGB-LINKED.pdf

How it was reproduced: Have not been able to reproduce this using InDesign.

text selection is broken because of a word split by a space character

frequency(low)

pdf2htmlEX guesses when to insert spaces between characters when it reduces spans with optimize text. It guesses based on the width of a space and the curning of characters. When this guessing renders a false positive, an extra space appears in the text output sometimes breaking up words.


*SS taken from HTML converted GS-26-pdftk.pdf

How it was reproduced: Have not been able to reproduce this using InDesign.

text alignment is off

frequency(low-medium)


*SS taken from HTML converted clifford.pdf
*SS taken from HTML converted clifford.pdf

text does not line up with background images

frequency(low-medium)


*SS taken from HTML converted Minecraft.pdf

text that does not appear in the PDF appears in the HTML output

frequency(low)


**SS taken from HTML converted

PDFs Referenced

Clone this wiki locally