You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
var documents = pdfReader.get();
ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
"file:\\C:\\Users\\test\\sample1.pdf",
PdfDocumentReaderConfig.builder()
.build());
var documents = pdfReader.get();
for (Document document : documents) {
System.out.println(document.getContent());
}
`
Expected behavior
It should read each paragraph from the sample1.pdf file
Exception
`
java.lang.NullPointerException: Cannot invoke "org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode.getFirstChild()" because "bookmark" is null
at org.springframework.ai.reader.pdf.config.ParagraphManager.generateParagraphs(ParagraphManager.java:131) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.config.ParagraphManager.<init>(ParagraphManager.java:82) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:109) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:92) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
`
The text was updated successfully, but these errors were encountered:
Thanks for reporting this. Parsing PDFs is a challenge and one might not be able to use the ParagraphPdfDocumentReader on all PDFs and then one should look for another strategy to parse the PDF. ParagraphPdfDocumentReader relies on a PDF object called 'outline'. The other options in Spring AI are PagePdfDocumentReader and TikaDocumentReader. I would also suggest looking into https://developer.adobe.com/document-services/apis/pdf-extract/
All that said, there should be no NPE. This will be fixed by adding a check on the code this.document.getDocumentCatalog().getDocumentOutline(), and if it returns null will indicate that the ParagraphPdfDocumentReader can not be used for the provided PDF since it contains no document outline.
Bug description
ParagraphPdfDocumentReader causing NullPointerException when reading sample1.pdf
https://github.com/spring-projects-experimental/spring-ai/blob/main/document-readers/pdf-reader/src/test/resources/sample1.pdf
Environment
Spring Boot version: 3.1.4
Spring AI version: 0.7.0-SNAPSHOT
Java version: openjdk version "17.0.2" 2022-01-18
Steps to reproduce
Add dependency spring-ai-pdf-document-reader: 0.7.0-SNAPSHOT version to pom.xml
`
`
Code to read paragraphs:
`
`
Expected behavior
It should read each paragraph from the sample1.pdf file
Exception
`
`
The text was updated successfully, but these errors were encountered: