Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't handle JBIG2 pdf images #19

Open
bholtdwyer opened this issue Dec 2, 2022 · 2 comments
Open

Can't handle JBIG2 pdf images #19

bholtdwyer opened this issue Dec 2, 2022 · 2 comments

Comments

@bholtdwyer
Copy link

When I run the training step on some .pdfs of historical Indian census files, I get the following error:

Extracting text line images from ../data/district_reports/raw_pdfs/1981/27582_1981_MAI.pdf, page 3
Error reading image
com.sun.pdfview.PDFParseException: Unknown coding method:JBIG2Decode

and then

java.lang.NullPointerException
	at com.sun.pdfview.font.TTFFont.getOutline(TTFFont.java:170)
	at com.sun.pdfview.font.CIDFontType2.getOutline(CIDFontType2.java:270)
	at com.sun.pdfview.font.OutlineFont.getGlyph(OutlineFont.java:130)
	at com.sun.pdfview.font.PDFFont.getCachedGlyph(PDFFont.java:308)
	at com.sun.pdfview.font.PDFFontEncoding.getGlyphFromCMap(PDFFontEncoding.java:155)
	at com.sun.pdfview.font.PDFFontEncoding.getGlyphs(PDFFontEncoding.java:115)
	at com.sun.pdfview.font.PDFFont.getGlyphs(PDFFont.java:274)
	at com.sun.pdfview.PDFTextFormat.doText(PDFTextFormat.java:269)
	at com.sun.pdfview.PDFParser.iterate(PDFParser.java:752)
	at com.sun.pdfview.BaseWatchable.run(BaseWatchable.java:101)
	at java.base/java.lang.Thread.run(Thread.java:834)

I think what's going on here is that the .pdf contains .jbig2 images, but the program doesn't know how to read these.

@taineleau
Copy link

Maybe try to convert the images first?

@bholtdwyer
Copy link
Author

bholtdwyer commented Jun 6, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants