-
Notifications
You must be signed in to change notification settings - Fork 0
Conversion Issues
Description: this document catalogs conversion issues found when converting pdfs to html via pdf2htmlEX. It gives examples of each problem and an estimate of the frequency of each problem.
Frequency (low-medium)
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf
frequency (Medium-High)
pdf2htmlEX applies a with and a margin to spans to correct for curning
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
*SS taken from HTML converted clifford.pdf
frequency(low)
This is not a problem on the kindle
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
frequency(low)
*SS taken from HTML converted MFT-TEST-ASSEMBLED-LINKED-RGB.pdf
Frequency(medium-high): This problem occurs every time text in the pdf is justified. sometimes it looks close to being justified, and other times it is significantly off.
This problem occurs because we use the command line option optimize text to remove some spans that interfere with word selection. Optimize text reduces the number of spans in a line and adjusts the letter spacing and word spacing of the entire line to account for this reduction. Its an imperfect approximation.
*SS taken from HTML converted Generation Kill.pdf
*SS taken from HTML converted clifford.pdf
frequency(low)
pdf2htmlEX guesses when to insert a space in its offset spans. It guesses based on the width of a space and the curning of characters. If a false positive occurs, a word will be broken by a space character.
*SS taken from HTML converted Fire-in-My-Belly-TEST-RGB-LINKED.pdf
frequency(low)
pdf2htmlEX guesses when to insert spaces between characters when it reduces spans with optimize text. It guesses based on the width of a space and the curning of characters. When this guessing renders a false positive, an extra space appears in the text output sometimes breaking up words.
*SS taken from HTML converted GS-26-pdftk.pdf
frequency(low-medium)
*SS taken from HTML converted clifford.pdf
*SS taken from HTML converted clifford.pdf
frequency(low-medium)
*SS taken from HTML converted Minecraft.pdf
frequency(low)
**SS taken from HTML converted
PDFs Referenced













