Invalid characters in the veraPDF output #644

prwheatley · 2016-11-25T20:21:57Z

Some PDF files result in veraPDF including invalid characters in the output (that's the default mmr output). Hardly devastating, but it's actually a bit of a pain as it causes XMLStarlet to drop out of parsing the output. Here are some examples:

This has an invalid character in the title tag extracted from the PDF:
http://web.archive.org/web/20080511210957/http://www.plymouth.gov.uk/5th_december_2007.pdf
This is a similar source, but this time shows a series of invalid characters in the title tags in the Pages feature extract:
http://web.archive.org/web/20071030162909/http://www.somersetpct.nhs.uk/about_us/board_meetings/November_2006/Papers/8%20(D)%20PEC%20Terms%20of%20Reference%20and%20Chair.pdf

This has an invalid character in the description of a rule in the validation output:
http://web.archive.org/web/20060930024642/http://www.tvcs.org.uk/pdfs/06leaflet1.pdf

These seem to be the two sources of invalid characters that I've come across so far, other than this:

http://web.archive.org/web/20071031010646/http://www.somersetpct.nhs.uk/about_us/board_meetings/December_2006/Papers/11%20(H)%20Risk%20Management%20Strategy%20and%20Policy%20Appendix%206%20-%20IPEC.pdf
This one is reported by XMLStarlet as invalid, but XMLStarlet has no problems parsing it. As far as I can see there are no problems with it, but it may be worth further investigation with a decent XML tool, so I thought I'd include it anyway.

The test data sets I'm passing on to Carl would be useful to use to double check there are no remaining bugs of a similar nature once fixes have been applied (I can send on if helpful). Carl mentioned adding these datasets to his automated tests. Depending on the outcome of the last example (and any others we find) it might be useful to validate the output on one of these large corpora as part of the automated test to pick up any of these kinds of issues in the future.

prwheatley · 2016-12-23T16:51:43Z

Fixed and tested - thanks!

bdoubrov assigned BezrukovM Nov 28, 2016

bdoubrov assigned bdoubrov and unassigned BezrukovM Dec 13, 2016

bdoubrov added the ready label Dec 13, 2016

bdoubrov assigned shem-sergey Dec 14, 2016

prwheatley closed this as completed Dec 23, 2016

carlwilson removed the ready label Dec 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid characters in the veraPDF output #644

Invalid characters in the veraPDF output #644

prwheatley commented Nov 25, 2016 •

edited by bdoubrov

Loading

prwheatley commented Dec 23, 2016

Invalid characters in the veraPDF output #644

Invalid characters in the veraPDF output #644

Comments

prwheatley commented Nov 25, 2016 • edited by bdoubrov Loading

prwheatley commented Dec 23, 2016

prwheatley commented Nov 25, 2016 •

edited by bdoubrov

Loading