Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing page variable to in-memory stream #102

Merged
merged 1 commit into from
Aug 23, 2019
Merged

Changing page variable to in-memory stream #102

merged 1 commit into from
Aug 23, 2019

Conversation

phutelmyer
Copy link
Contributor

Describe the change
A reference to the "data" variable caused issues with pdfminer while attempting to extract text from a PDF. This error would only occur when the user sets "extract_text" to True from the ScanPdf options in backend.yml. To fix this, we are instead replacing "data" with "pdf_io", an in-memory stream of the data object. This is readable by pdfminer, whereas the data object was not.

Describe testing procedures
Change was made on a local instance. No errors were observed after making the change, and the error that appeared previously was removed. Logging was used to examine the output, which displays proper text extraction.

Sample output
Text extraction noted in header.header:

"scan": { "entropy": { "elapsed": 0.000155, "entropy": 4.953766942682617 }, "hash": { "elapsed": 0.000296, "md5": "039f829a06d43589415673da5ec04431", "sha1": "05b0afac91db59a9de83a2d8b9886586dd76e383", "sha256": "d42488f5ff2444ce075f19c89751ab8f092d9d31c6bae7643b10f3a9550fd5ac", "ssdeep": "192:8lsOIwGOp0/KydxQYBfluX/NtVuvpAyO0aAszFVtLuWNg:8lsOIwGdXDubylzaAs1Lucg" }, "header": { "elapsed": 0.000124, "header": "he pdf995 suite of products - Pdf995, PdfEdit995, " }, "url": { "elapsed": 0.003095, "urls": [ "www.pdf995.com", "http://www.wired.com/\",whichisnothuman-centereddata.In" ] }, "yara": { "elapsed": 0.000207, "flags": [ "compiling_error" ] }

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of and tested my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

@jshlbrd jshlbrd merged commit ace5ae9 into target:master Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants