Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid characters in the veraPDF output #644

Closed
prwheatley opened this issue Nov 25, 2016 · 1 comment
Closed

Invalid characters in the veraPDF output #644

prwheatley opened this issue Nov 25, 2016 · 1 comment
Assignees

Comments

@prwheatley
Copy link

prwheatley commented Nov 25, 2016

Some PDF files result in veraPDF including invalid characters in the output (that's the default mmr output). Hardly devastating, but it's actually a bit of a pain as it causes XMLStarlet to drop out of parsing the output. Here are some examples:

This has an invalid character in the title tag extracted from the PDF:
http://web.archive.org/web/20080511210957/http://www.plymouth.gov.uk/5th_december_2007.pdf
This is a similar source, but this time shows a series of invalid characters in the title tags in the Pages feature extract:
http://web.archive.org/web/20071030162909/http://www.somersetpct.nhs.uk/about_us/board_meetings/November_2006/Papers/8%20(D)%20PEC%20Terms%20of%20Reference%20and%20Chair.pdf

This has an invalid character in the description of a rule in the validation output:
http://web.archive.org/web/20060930024642/http://www.tvcs.org.uk/pdfs/06leaflet1.pdf

These seem to be the two sources of invalid characters that I've come across so far, other than this:

http://web.archive.org/web/20071031010646/http://www.somersetpct.nhs.uk/about_us/board_meetings/December_2006/Papers/11%20(H)%20Risk%20Management%20Strategy%20and%20Policy%20Appendix%206%20-%20IPEC.pdf
This one is reported by XMLStarlet as invalid, but XMLStarlet has no problems parsing it. As far as I can see there are no problems with it, but it may be worth further investigation with a decent XML tool, so I thought I'd include it anyway.

The test data sets I'm passing on to Carl would be useful to use to double check there are no remaining bugs of a similar nature once fixes have been applied (I can send on if helpful). Carl mentioned adding these datasets to his automated tests. Depending on the outcome of the last example (and any others we find) it might be useful to validate the output on one of these large corpora as part of the automated test to pick up any of these kinds of issues in the future.

@prwheatley
Copy link
Author

Fixed and tested - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants