Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF #103

Conversation

gjreda
Copy link
Collaborator

@gjreda gjreda commented Jun 6, 2023

This fixes two underlying bugs:

  1. When Grobid is unable to parse a PDF, it prints an error message to stdout, rather than raising an exception or returning an http error status code. Printing this error message to stdout breaks sidecar communication since we rely on stdout for data passing.
  2. If a PDF could not be parse, it is not included in the References response from the sidecar. This adds Reference objects to the response even if we could not parse the PDF.

@gjreda gjreda marked this pull request as ready for review June 6, 2023 19:44
@gjreda gjreda requested review from sehyod and cguedes June 6, 2023 19:44
@gjreda gjreda linked an issue Jun 6, 2023 that may be closed by this pull request
The TXT file is named as {pdf_filename}_{error_code}.txt.
"""
txt_files = list(self.grobid_output_dir.glob("*.txt"))
logger.info(f"Found {len(txt_files)} txt files from Grobid parsing errors")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this logger writes to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has it disabled (I had to do that a while back so the sidecar can work). #104 re-enables it and sets it up as a file logger, rather than to stdout.

title: str
abstract: str
contents: str
title: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sehyod we will need to change the TS type after this merge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title was already optional (cf https://github.com/refstudio/refstudio/pull/95/files#diff-68572da928b5651e3f8dda9d2b291815dc0cdf0e0ff5dc40ab3dc81215b760feL9) because the sidecar was already returning an null field for some pdfs. @gjreda I don't know if that was an expected behaviour, if it's not I can send you an example pdf file for which the returned title is null

@sergioramos sergioramos changed the title Fix PDF Ingestion bug when Grobid is unable to parse the reference PDF fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF Jun 7, 2023
@sergioramos sergioramos merged commit f909337 into main Jun 7, 2023
7 checks passed
@sergioramos sergioramos deleted the 79-pdf-ingest-does-not-work-properly-with-non-scholarly-documents branch June 7, 2023 09:49
@hammer
Copy link
Contributor

hammer commented Jun 7, 2023

I'm still seeing brittle behavior after this fix...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDF ingest does not work properly with non-scholarly documents
5 participants