Skip to content

Conversion Issues

Tyler Clemens edited this page Nov 14, 2013 · 13 revisions

Description: this document catalogs conversion issues found when converting pdfs to html via pdf2htmlEX. It gives examples of each problem and an estimate of the frequency of each problem.

Capitalization of letters is off at times

Frequency (low-medium)
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf

Word selection is broke due to spans that are inserted to correct for kerning

frequency (Medium-High)

pdf2htmlEX applies a with and a margin to spans to correct for curning


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf

Word selection is broken due to lack of spaces at the end of divs

frequency(low)

This is not a problem on the kindle


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf

Word selection is broken due to placement of divs

frequency(low)


*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf

Justification of text is slightly off making the text look ragged right

Frequency(medium-high): This problem occurs every time text in the pdf is justified. sometimes it looks close to being justified, and other times it is significantly off.

This problem occurs because we use the command line option optimize text to remove some spans that interfere with word selection. Optimize text reduces the number of spans in a line and adjusts the letter spacing and word spacing of the entire line to account for this reduction. Its an imperfect approximation.


*SS taken from HTML converted Generation Kill.pdf
*SS taken from HTML converted clifford.pdf

text selection is broken because of a word split by a space in a span

frequency(low)

pdf2htmlEX guesses when to insert a space in its offset spans. It guesses based on the width of a space and the curning of characters. If a false positive occurs, a word will be broken by a space character.


*SS taken from HTML converted Fire-in-My-Belly-TEST-RGB-LINKED.pdf

text selection is broken because of a word split by a space character

frequency(low)

pdf2htmlEX guesses when to insert spaces between characters when it reduces spans with optimize text. It guesses based on the width of a space and the curning of characters. When this guessing renders a false positive, an extra space appears in the text output sometimes breaking up words.


*SS taken from HTML converted GS-26-pdftk.pdf

text alignment is off

frequency(low-medium)


*SS taken from HTML converted clifford.pdf
*SS taken from HTML converted clifford.pdf

text does not line up with background images

frequency(low-medium)


*SS taken from HTML converted Minecraft.pdf

text that does not appear in the PDF appears in the HTML output

frequency(low)


**SS taken from HTML converted

PDFs Referenced

Clone this wiki locally