Skip to content

Conversion Issues

Tyler Clemens edited this page Nov 14, 2013 · 13 revisions

Description: this document catalogs conversion issues found when converting pdfs to html via pdf2htmlEX. It gives examples of each problem from each pdf they were found in. Also, an estimate to the frequency of each problem is included.

Capitalization of letters is off at times

Frequency(low-medium)
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf

Word selection is broke due to spans that are inserted to correct for kerning

pdf2htmlEX applies a with and a margin to spans to correct for curning


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf

Word selection is broken due to lack of spaces at the end of divs

This is not a problem on the kindle


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf

Word selection is broken due to placement of divs


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf

Justification of text is slightly off making the text look ragged right

This happens because we use the command line option optimize text to remove some spans that interfere with word selection. Optimize text reduces the number of spans in a line and adjusts the letter spacing and word spacing of the entire line to account for this reduction. Its an imperfect approximation.


*SS taken from HTML converted Generation Kill.pdf
*SS taken from HTML converted clifford.pdf

text selection is broken because of a word split by a space in a span

pdf2htmlEX guesses when to insert a space in its offset spans. It guesses based on the width of a space and the curning of characters. If a false positive occurs, a word will be broken by a space character.


*SS taken from HTML converted Fire-in-My-Belly-TEST-RGB-LINKED.pdf

text selection is broken because of a word split by a space character

pdf2htmlEX guesses when to insert spaces between characters when it reduces spans with optimize text. It guesses based on the width of a space and the curning of characters. When this guessing renders a false positive, an extra space appears in the text output sometimes breaking up words.


*SS taken from HTML converted GS-26-pdftk.pdf

text alignment is off


*SS taken from HTML converted clifford.pdf
*SS taken from HTML converted clifford.pdf

text does not line up with background images


*SS taken from HTML converted Minecraft.pdf

text that does not appear in the PDF appears in the HTML output


PDFs Referenced

Clone this wiki locally