fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF #103

gjreda · 2023-06-06T19:39:08Z

PDF ingest does not work properly with non-scholarly documents #79

This fixes two underlying bugs:

When Grobid is unable to parse a PDF, it prints an error message to stdout, rather than raising an exception or returning an http error status code. Printing this error message to stdout breaks sidecar communication since we rely on stdout for data passing.
If a PDF could not be parse, it is not included in the References response from the sidecar. This adds Reference objects to the response even if we could not parse the PDF.

…n-scholarly-documents

cguedes · 2023-06-07T08:31:58Z

python/sidecar/ingest.py

+        The TXT file is named as {pdf_filename}_{error_code}.txt.
+        """
+        txt_files = list(self.grobid_output_dir.glob("*.txt"))
+        logger.info(f"Found {len(txt_files)} txt files from Grobid parsing errors")


Where does this logger writes to?

This PR has it disabled (I had to do that a while back so the sidecar can work). #104 re-enables it and sets it up as a file logger, rather than to stdout.

cguedes · 2023-06-07T08:34:39Z

python/sidecar/typing.py

-    title: str
-    abstract: str
-    contents: str
+    title: Optional[str] = None


@sehyod we will need to change the TS type after this merge.

The title was already optional (cf https://github.com/refstudio/refstudio/pull/95/files#diff-68572da928b5651e3f8dda9d2b291815dc0cdf0e0ff5dc40ab3dc81215b760feL9) because the sidecar was already returning an null field for some pdfs. @gjreda I don't know if that was an expected behaviour, if it's not I can send you an example pdf file for which the returned title is null

hammer · 2023-06-07T13:56:13Z

I'm still seeing brittle behavior after this fix...

gjreda added 4 commits June 6, 2023 10:56

This should have been included in #83

04a0eda

Merge branch 'main' into 79-pdf-ingest-does-not-work-properly-with-no…

cafd307

…n-scholarly-documents

Move client.process inside of HiddenPrints

84b2aa5

Add unparsed PDFs to final Reference response

0512749

gjreda marked this pull request as ready for review June 6, 2023 19:44

gjreda requested review from sehyod and cguedes June 6, 2023 19:44

gjreda linked an issue Jun 6, 2023 that may be closed by this pull request

PDF ingest does not work properly with non-scholarly documents #79

Closed

cguedes reviewed Jun 7, 2023

View reviewed changes

cguedes approved these changes Jun 7, 2023

View reviewed changes

cguedes reviewed Jun 7, 2023

View reviewed changes

sergioramos changed the title ~~Fix PDF Ingestion bug when Grobid is unable to parse the reference PDF~~ fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF Jun 7, 2023

sergioramos merged commit f909337 into main Jun 7, 2023
7 checks passed

sergioramos deleted the 79-pdf-ingest-does-not-work-properly-with-non-scholarly-documents branch June 7, 2023 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF #103

fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF #103

gjreda commented Jun 6, 2023 •

edited

Loading

cguedes Jun 7, 2023

gjreda Jun 7, 2023

cguedes Jun 7, 2023

sehyod Jun 7, 2023

hammer commented Jun 7, 2023

fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF #103

fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF #103

Conversation

gjreda commented Jun 6, 2023 • edited Loading

cguedes Jun 7, 2023

Choose a reason for hiding this comment

gjreda Jun 7, 2023

Choose a reason for hiding this comment

cguedes Jun 7, 2023

Choose a reason for hiding this comment

sehyod Jun 7, 2023

Choose a reason for hiding this comment

hammer commented Jun 7, 2023

gjreda commented Jun 6, 2023 •

edited

Loading