Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paragraph Pdf Document Reader issue #59

Closed
radhakrishna67 opened this issue Oct 23, 2023 · 1 comment
Closed

Paragraph Pdf Document Reader issue #59

radhakrishna67 opened this issue Oct 23, 2023 · 1 comment
Assignees
Milestone

Comments

@radhakrishna67
Copy link

radhakrishna67 commented Oct 23, 2023

Bug description
ParagraphPdfDocumentReader causing NullPointerException when reading sample1.pdf
https://github.com/spring-projects-experimental/spring-ai/blob/main/document-readers/pdf-reader/src/test/resources/sample1.pdf

Environment
Spring Boot version: 3.1.4
Spring AI version: 0.7.0-SNAPSHOT
Java version: openjdk version "17.0.2" 2022-01-18

Steps to reproduce
Add dependency spring-ai-pdf-document-reader: 0.7.0-SNAPSHOT version to pom.xml

`

    <dependency>
     <groupId>org.springframework.experimental.ai</groupId>
     <artifactId>spring-ai-pdf-document-reader</artifactId>
     <version>0.7.0-SNAPSHOT</version>
   </dependency>

`

Code to read paragraphs:
`

    var documents = pdfReader.get();

   ParagraphPdfDocumentReader pdfReader = new ParagraphPdfDocumentReader(
            "file:\\C:\\Users\\test\\sample1.pdf",
            PdfDocumentReaderConfig.builder()
                    .build());

    var documents = pdfReader.get();

    for (Document document : documents) {
        System.out.println(document.getContent());
    }

`

Expected behavior
It should read each paragraph from the sample1.pdf file

Exception
`

  java.lang.NullPointerException: Cannot invoke "org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode.getFirstChild()" because "bookmark" is null
at org.springframework.ai.reader.pdf.config.ParagraphManager.generateParagraphs(ParagraphManager.java:131) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.config.ParagraphManager.<init>(ParagraphManager.java:82) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:109) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]
at org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader.<init>(ParagraphPdfDocumentReader.java:92) ~[spring-ai-pdf-document-reader-0.7.0-20231019.142632-5.jar:0.7.0-SNAPSHOT]

`

@markpollack markpollack added this to the 0.8.0 milestone Dec 4, 2023
@markpollack
Copy link
Member

Thanks for reporting this. Parsing PDFs is a challenge and one might not be able to use the ParagraphPdfDocumentReader on all PDFs and then one should look for another strategy to parse the PDF. ParagraphPdfDocumentReader relies on a PDF object called 'outline'. The other options in Spring AI are PagePdfDocumentReader and TikaDocumentReader. I would also suggest looking into https://developer.adobe.com/document-services/apis/pdf-extract/

All that said, there should be no NPE. This will be fixed by adding a check on the code this.document.getDocumentCatalog().getDocumentOutline(), and if it returns null will indicate that the ParagraphPdfDocumentReader can not be used for the provided PDF since it contains no document outline.

@tzolov tzolov closed this as completed in 7b38cc3 Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants