Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cross-posted from the Forum/Suggestion] Implement a way to integrate (original image file, detected text) →searchable PDF #660

Closed
Wikinaut opened this issue Jan 13, 2017 · 59 comments

Comments

@Wikinaut
Copy link
Contributor

https://groups.google.com/forum/#!topic/tesseract-ocr/vvMldrkcuOQ has asked:

I have a pdf (scanned) and now i make a searchable pdf from this.
First i generate a black/white multipage tif, and with tesseract i can make a searchable pdf.
But is it somehow possible to integrate the original pdf images?
because the generated tif has not the same quality like the original (maybe the scaned image is in color).

How to reproduce:

  1. Assume one page with a colored background in.pdf, converted to in.ppm image
  2. preprocess unpaper in.ppm in-cleaned.ppm
  3. process with (example) tesseract in-cleaned.ppm out -l deu+eng --oem 2 pdf txt
  4. tesseract mixed output file out.pdfhas now a blotchy background (from the unpaper step above)

20170113-10 09 17_auswahl

Is there any way to "feed-in" the original in.ppm as image, so that this is used instead of in-cleaned.ppm when creating the out.pdf ?

So what is wanted is original input image plus ocr layer, so that output looks like
20170113-10 12 22_auswahl

@jbarlow83
Copy link

This is a complicated way of asking for an option to send one image through OCR and insert a different image in the output PDF.

tesseract --pdf-image original.png cleaned.png -l eng --oem 2 pdf  # not implemented, could work like this

I know this was requested before and I believe @jbreiden said it would be added to the PDF renderer at some point.

@jbreiden
Copy link
Contributor

I'm very reluctant to make Tesseract PDF generation fancy. I wonder if we can do an image swap like this outside of Tesseract, using one of the PDF manipulation toolkits.

@jbarlow83
Copy link

jbarlow83 commented Jan 14, 2017 via email

@Wikinaut
Copy link
Contributor Author

Wikinaut commented Jan 14, 2017

@jbreiden It's the last really missing issue.
The new algorithm is already a boost in quality. I reach here up to 100% OCR quality (for --oem 2 -l deu+eng) including these beasty "Umlauts" äöüÄÖÜß....

If this helps, I will donate some mBTC for implementing it just right now. Just post your receiving address.

@Wikinaut
Copy link
Contributor Author

@jbarlow83 background info. As you know, I recently wanted to try your OCRmyPDF because I found the interesting -clean option (source: https://media.readthedocs.org/pdf/ocrmypdf/latest/ocrmypdf.pdf ) which would have solved my problem:

which "does not alter the final output":

--clean
uses unpaper to clean up pages before OCR, but does not alter the final output.  This makes it less
likely that OCR will try to find text in background noise.
•
--clean-final
uses unpaper to clean up pages before OCR and inserts the page into the final output.  You
will want to review each page to ensure that unpaper did not remove something important.

but unfortunately this does not work with tesseract 4, at the present.

So I looked for bug reports, if tesseract could pass the original input image to the output; and filed the present issue.

@jbreiden
Copy link
Contributor

Really? That's interesting, qpdf is very well written. Maybe the right thing to do is allow Tesseract to produce a multi-page PDF with invisible symbolic text PDF only, no images. Then another tool (perhaps an enhanced qpdf tool) would merge and composite two PDFs together. One being the original image-only PDF, and the other an invisible-text-only PDF. What do you think, @jbarlow83? Please point me at the relevant qpdf API calls if you happen to know them.

@jbarlow83
Copy link

I think invisible text only output would be far more useful for developers that integrate tesseract or anyone who wants to do something fancy. It would still make sense to keep the existing OCR with image option of course. As a plus, it's should be easier to suppress the image than add a different one.

OCRmyPDF (which I maintain) use Ghostscript to rasterize and then runs one of its two PDF renderers. One uses Tesseract hOCR and provides more features but is not as good at producing the OCR text layer as Tesseract PDF, so I also provide Tesseract PDF. If Tesseract could produce a invisible text only I could offer all the features for both, and work towards phasing out the hOCR renderer. When possible I already do graft the text layer onto the existing PDF instead of constructing a new one.

In addition to OCRmyPDF pdftk multibackground could merge an OCR layer onto an existing PDF (by "watermarking"), so there is at least one other supported tool out there that should work out of the box. There's some other tools that wrap tesseract for use with PDFs as well.

In writing this I've made a case for not using qpdf because other tools should be able to do the job with an invisible text PDF, but for interest's sake case here is example code that inverts black and white for all images; clearly this is close to how one would replace an image outright.

@jbreiden
Copy link
Contributor

This sounds reasonable to me. I'll try to find time over this coming week to make an experimental invisible-text-only PDF that we can play with. All the other pieces of the puzzle are there; for example Leptonica already ships with a images->pdf tool that avoids transcoding for PNG, JP2K, and JPEG. It would be cool to use qpdf for the merge step because it is already so useful for linearizing. But it's great that there are more options. The qpdf author is extremely friendly in my experience, in case we eventually chat with him. Oh, I now vaguely remember that PDFBox had something for merging as well, but I've never tried it and can't find it at the moment.

@jbreiden
Copy link
Contributor

Here's an experimental PDF pair, image-only and text-only. Let the merging begin!

images.pdf
text.pdf

@jbreiden jbreiden added the PDF label Jan 18, 2017
@jbreiden
Copy link
Contributor

jbreiden commented Jan 18, 2017

This works brilliantly. I will implement for real if someone promises that they will use it. Also, what do we call the configuration option? My best idea so far to describe a PDF that has invisible text only is 'naked'. I'm sure someone has a better idea.

$ time pdftk text.pdf multibackground images.pdf output full.pdf
real	0m0.253s

Actually this works better the other way around, for preserving the bookmarks and things like that.

 pdftk  images.pdf multibackground text.pdf output full.pdf

@jbreiden
Copy link
Contributor

jbreiden commented Jan 18, 2017

Implementation complete and under review by Ray. @jbarlow83 this is a good time to look at the samples above and make sure they meet your needs.

tesseract -c naked_pdf=true HelloWorld.png HelloWorld pdf

@jbarlow83
Copy link

Looks really good @jbreiden.

Works great in pdftk. No display issues and PDF syntax looks fine.

PyPDF2 is also capable of merging. It does not have the equivalent of "multibackground" but merge pages manually. Here is merging one page:

In [1]: import PyPDF2 as pypdf

In [4]: pdf_text = pypdf.PdfFileReader(open('text.pdf', 'rb'))

In [5]: pdf_image = pypdf.PdfFileReader(open('images.pdf', 'rb'))

In [6]: page_text = pdf_text.pages[1]

In [7]: page_image = pdf_image.pages[1]

In [8]: page_text.mergeRotatedScaledTranslatedPage(page_image, 0, 1.0, 0, 0, expand=False)

In [9]: out = pypdf.PdfFileWriter()

In [10]: out.addPage(page_text)

In [11]: with open('pypdfmerge.pdf','wb') as o:
    ...:     out.write(o)
    ...:     

For reference, pdfbox did not work out of the box. As far as I can tell the closest command in pdfbox is

java -jar pdfbox-app-2.0.2.jar OverlayPDF images.pdf text.pdf pdfboxoverlay.pdf

However pdfbox takes the unusual approach of rasterizing the overlay PDF as a bitmap and drawing it on top of the base page, making it useless regardless of image/text order. (I suppose when you go to the trouble implementing a full PDF renderer in Java you feel compelled to use it even when it's not strictly needed.)

@jbarlow83
Copy link

jbarlow83 commented Jan 18, 2017

I don't know about calling it a naked PDF because there's nothing exciting to see in it. It's more of a phantom or spectral apparition PDF, having form without substance.

ocr_text_only would do, or suppress_images? Not nearly as fun, but practical.

@jbreiden
Copy link
Contributor

Spectral writing. Perhaps a kind of ghost script, if you will.

@Shreeshrii
Copy link
Collaborator

How about text_only_pdf ?

@jbreiden is it also possible to use a .pdf file as input to tesseract directly?

@amitdo
Copy link
Collaborator

amitdo commented Jan 19, 2017

pdf_invisible_text_layer_only
+
a config file pdfinvisible (or maybe pdf0)

@jbarlow83
Copy link

@Shreeshrii PDF is a very complex vector-based file format. Tesseract works only on images. It is much easier to write PDFs that use a limited set of PDF features than read arbitrary PDFs. Have a look at OCRmyPDF (which I develop) - it addresses the details of using tesseract to apply OCR to PDFs.

@Wikinaut
Copy link
Contributor Author

@jbreiden @jbarlow83 @amitdo info: I just built the whole toolchain from their git repos (tesseract, ocrmypdf, unpaper), and have ghostscript version 9.20 ready in a dedicated debian 9 "OCR VM" on my Qubes OS system.

Pls. let me know, what (if) you want me to test - I have time to test and want to help you.

@jbreiden
Copy link
Contributor

Hmmm, an invisible text layer, invisible text, let's see ... iText? Anyway, I'll pick something. There is zero chance that a PDF rasterizer will ever be part of Tesseract or Leptonica. In theory one could write an PDF image extractor for Leptonica, but there isn't really enough motivation to do so.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 20, 2017

Ray will eventually merge this patch, but it is hard to predict when. I am posting here for anyone who is impatient or excited.

--- api/pdfrenderer.cpp	2016-12-13 14:43:24.000000000 -0800
+++ api/pdfrenderer.cpp	2017-01-19 14:50:56.000000000 -0800
@@ -178,10 +178,12 @@
  * PDF Renderer interface implementation
  **********************************************************************/
 
-TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
+TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
+                                 bool textonly)
     : TessResultRenderer(outputbase, "pdf") {
   obj_  = 0;
   datadir_ = datadir;
+  textonly_ = textonly;
   offsets_.push_back(0);
 }
 
@@ -326,7 +328,11 @@
   pdf_str.add_str_double("", prec(width));
   pdf_str += " 0 0 ";
   pdf_str.add_str_double("", prec(height));
-  pdf_str += " 0 0 cm /Im1 Do Q\n";
+  pdf_str += " 0 0 cm";
+  if (!textonly_) {
+    pdf_str += " /Im1 Do";
+  }
+  pdf_str += " Q\n";
 
   int line_x1 = 0;
   int line_y1 = 0;
@@ -832,6 +838,7 @@
 bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
   size_t n;
   char buf[kBasicBufSize];
+  char buf2[kBasicBufSize];
   Pix *pix = api->GetInputImage();
   char *filename = (char *)api->GetInputName();
   int ppi = api->GetSourceYResolution();
@@ -840,6 +847,9 @@
   double width = pixGetWidth(pix) * 72.0 / ppi;
   double height = pixGetHeight(pix) * 72.0 / ppi;
 
+  snprintf(buf2, sizeof(buf2), "XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
+  const char *xobject = (textonly_) ? "" : buf2;
+
   // PAGE
   n = snprintf(buf, sizeof(buf),
                "%ld 0 obj\n"
@@ -850,19 +860,18 @@
                "  /Contents %ld 0 R\n"
                "  /Resources\n"
                "  <<\n"
-               "    /XObject << /Im1 %ld 0 R >>\n"
+               "    %s"
                "    /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
                "    /Font << /f-0-0 %ld 0 R >>\n"
                "  >>\n"
                ">>\n"
                "endobj\n",
                obj_,
-               2L,            // Pages object
-               width,
-               height,
-               obj_ + 1,      // Contents object
-               obj_ + 2,      // Image object
-               3L);           // Type0 Font
+               2L,  // Pages object
+               width, height,
+               obj_ + 1,  // Contents object
+               xobject,   // Image object
+               3L);       // Type0 Font
   if (n >= sizeof(buf)) return false;
   pages_.push_back(obj_);
   AppendPDFObject(buf);
@@ -899,13 +908,15 @@
   objsize += strlen(b2);
   AppendPDFObjectDIY(objsize);
 
-  char *pdf_object;
-  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
-    return false;
+  if (!textonly_) {
+    char *pdf_object = nullptr;
+    if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
+      return false;
+    }
+    AppendData(pdf_object, objsize);
+    AppendPDFObjectDIY(objsize);
+    delete[] pdf_object;
   }
-  AppendData(pdf_object, objsize);
-  AppendPDFObjectDIY(objsize);
-  delete[] pdf_object;
   return true;
 }
 

--- api/renderer.h	2016-11-07 07:44:03.000000000 -0800
+++ api/renderer.h	2017-01-19 14:50:56.000000000 -0800
@@ -186,7 +186,7 @@
  public:
   // datadir is the location of the TESSDATA. We need it because
   // we load a custom PDF font from this location.
-  TessPDFRenderer(const char *outputbase, const char *datadir);
+  TessPDFRenderer(const char* outputbase, const char* datadir, bool textonly);
 
  protected:
   virtual bool BeginDocumentHandler();
@@ -196,20 +196,20 @@
  private:
   // We don't want to have every image in memory at once,
   // so we store some metadata as we go along producing
-  // PDFs one page at a time. At the end that metadata is
+  // PDFs one page at a time. At the end, that metadata is
   // used to make everything that isn't easily handled in a
   // streaming fashion.
   long int obj_;                     // counter for PDF objects
   GenericVector<long int> offsets_;  // offset of every PDF object in bytes
   GenericVector<long int> pages_;    // object number for every /Page object
   const char *datadir_;              // where to find the custom font
+  bool textonly_;                    // skip images if set
   // Bookkeeping only. DIY = Do It Yourself.
   void AppendPDFObjectDIY(size_t objectsize);
   // Bookkeeping + emit data.
   void AppendPDFObject(const char *data);
   // Create the /Contents object for an entire page.
-  static char* GetPDFTextObjects(TessBaseAPI* api,
-                                 double width, double height);
+  char* GetPDFTextObjects(TessBaseAPI* api, double width, double height);
   // Turn an image into a PDF object. Only transcode if we have to.
   static bool imageToPDFObj(Pix *pix, char *filename, long int objnum,
                           char **pdf_object, long int *pdf_object_size);

--- api/tesseractmain.cpp	2016-12-15 15:28:37.000000000 -0800
+++ api/tesseractmain.cpp	2017-01-19 14:50:56.000000000 -0800
@@ -337,8 +337,10 @@
 
     api->GetBoolVariable("tessedit_create_pdf", &b);
     if (b) {
-      renderers->push_back(
-          new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()));
+      bool textonly;
+      api->GetBoolVariable("textonly_pdf", &textonly);
+      renderers->push_back(new tesseract::TessPDFRenderer(
+          outputbase, api->GetDatapath(), textonly));
     }
 
     api->GetBoolVariable("tessedit_write_unlv", &b);

--- ccmain/tesseractclass.cpp	2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.cpp	2017-01-19 18:15:57.000000000 -0800
@@ -391,6 +391,8 @@
                   this->params()),
       BOOL_MEMBER(tessedit_create_pdf, false, "Write .pdf output file",
                   this->params()),
+      BOOL_MEMBER(textonly_pdf, false, "Invisible text only for PDF",
+                  this->params()),
       STRING_MEMBER(unrecognised_char, "|",
                     "Output char for unidentified blobs", this->params()),
       INT_MEMBER(suspect_level, 99, "Suspect marker level", this->params()),

--- ccmain/tesseractclass.h	2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.h	2017-01-19 16:31:04.000000000 -0800
@@ -1027,6 +1027,7 @@
   BOOL_VAR_H(tessedit_create_hocr, false, "Write .html hOCR output file");
   BOOL_VAR_H(tessedit_create_tsv, false, "Write .tsv output file");
   BOOL_VAR_H(tessedit_create_pdf, false, "Write .pdf output file");
+  BOOL_VAR_H(textonly_pdf, false, "Invisible text only for PDF");
   STRING_VAR_H(unrecognised_char, "|",
                "Output char for unidentified blobs");
   INT_VAR_H(suspect_level, 99, "Suspect marker level");

@RNCTX
Copy link
Contributor

RNCTX commented Jan 20, 2017

@Shreeshrii http://kiirani.com/2013/03/22/tesseract-pdf.html

The PDF/invisible text output you guys are implementing works quite well for me using OSX 'Preview' but for a little jerkiness depending on scaling, of course.

This is quite a big deal, in my opinion, as it will allow those who have, for instance... legal documents containing notary stamps in color, or in my use-case aviation emergency manuals with color-coded pages, to keep their original copies unmodified from their scanners, but modify them in a clean way into searchable documents. Thanks for this.

@Shreeshrii
Copy link
Collaborator

Thanks for info on pdf to images conversion for use with tesseract.

I usually use ghostscript for the purpose e.g.

gs -dNOPAUSE -dBATCH  -r300x300 -sDEVICE=tiffg4  -dFirstPage=168  -dLastPage=174 -sOutputFile=sample%03d.tif ./sample.pdf

I will give the other suggestions a try (including a new one suggested by zdenop in the forum- https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/vvMldrkcuOQ/xLES3_ZoEwAJ )

@jbreiden Thanks, Jeff, for this invisible text output pdf which can be merged with the original pdf.

@jbreiden
Copy link
Contributor

pdfimages from poppler-utils will do image extraction as well. And pdfium offers API calls for image extraction. I am sure there are many others. Have fun.

Wikinaut added a commit to Wikinaut/tesseract that referenced this issue Jan 20, 2017
@amitdo
Copy link
Collaborator

amitdo commented Jan 20, 2017

Ray will eventually merge this patch, but it is hard to predict when. I am posting here for anyone who is impatient or excited.

I suggest to merge this to master now. Ray can modify it later if needed.

Wikinaut added a commit to Wikinaut/tesseract that referenced this issue Jan 20, 2017
@zdenop
Copy link
Contributor

zdenop commented Jan 20, 2017

merged to master.
@Wikinaut: try master now.

@Wikinaut
Copy link
Contributor Author

Wikinaut commented Jan 20, 2017

@zdenop effa574 does not work: breaks tesseract [UPDATE:] and creates broken files. Who has tested that patch, and how ?

@Wikinaut
Copy link
Contributor Author

Wikinaut commented Jan 21, 2017

why again *.jpg (step 1) ? Never ever use jpg with text files.
Please don't tell the mass about jpeg. Use png, ppm, or tif...

@Wikinaut
Copy link
Contributor Author

I already developed code for this using -c textonly_pdf=1, thanks

@jbreiden
Copy link
Contributor

jbreiden commented Jan 21, 2017

A image-only PDF file is a bag of images. If the bag is holding a bunch of JPEG images, extract them as-is. Don't convert. Don't recompress. Just empty the PDF bag and get your images out. If it is holding JPEG2000, then just get those out. Same with PNG.

@Wikinaut
Copy link
Contributor Author

Yes and no, why can't tesseract do this (pass-through the "bunch of input images") ?

@jbreiden
Copy link
Contributor

Let's shift this discussion back to the forum. Please re-ask your most recent question there; I don't follow exactly what you are asking.

@Wikinaut
Copy link
Contributor Author

Wikinaut commented Jan 21, 2017

Pls. elaborate your step
"Extract the images from the PDF file (don't render!). For this example, we'll assume jpeg."

I use
pdftoppm -aa yes -r 400 -scale-to-x 2000 -scale-to-y 2800 in.pdf image

@zdenop
Copy link
Contributor

zdenop commented Jan 21, 2017

C-API should be fixed now. Thanks for finding this wikinaut.
@jbreiden capi.cpp and capi.h are C-API for tesseract that is used for tesseract wrappers (python etc.)
@Wikinaut as pointed by Jeff, please move back this discussion to tesseract user forum.

@Jmuccigr
Copy link

Jmuccigr commented May 4, 2017

Was there a final resolution to this request for putting back in the original images? @Wikinaut?

@jbreiden
Copy link
Contributor

jbreiden commented May 4, 2017

Yes. The final solution was to implement tesseract -c textonly_pdf=1

@Jmuccigr
Copy link

Jmuccigr commented May 4, 2017

Yeah, that doesn't work for me: Could not set option: textonly_pdf=1

I'm using version 3.05.00 installed via homebrew.

@Wikinaut
Copy link
Contributor Author

Wikinaut commented May 4, 2017

@Jmuccigr I am definitely not happy with the current implementation, and decided some months ago to stay silent and let other users come back with the issue (hoping, that my original proposal - pass-through the original input image without transcoding it - will be implement in forthcoming versions).

@amitdo
Copy link
Collaborator

amitdo commented May 4, 2017

I'm using version 3.05.00

The textonly_pdf parameter is only available on the HEAD (4.00)

@Jmuccigr
Copy link

Jmuccigr commented May 4, 2017

@Wikinaut, yeah, my workflow at some point involves adding OCR'ed text to an optimized PDF. Having the OCR step degrade the quality of that PDF kind of spoils it.

@Shreeshrii
Copy link
Collaborator

The textonly_pdf parameter is only available on the HEAD (4.00)

@zdenop Please backport for 3.05. Thanks!

@zdenop
Copy link
Contributor

zdenop commented May 5, 2017

done.

@Shreeshrii
Copy link
Collaborator

Thanks, @zdenop. Please also make a 3.05.01 release with the latest commit in 3.05 branch so that all these enhancements are easily accessible.

@Jmuccigr
Copy link

Jmuccigr commented Jun 5, 2017

Just getting back to this now that 3.05.01 has hit homebrew and wanted to say that it seems to be working.

I've tested it out by running text-only tesseract on a 2x version of an image - which tends to give better results if the original dpi is too low - and then combining that text-only PDF with a PDF made from the original image, which keeps the file size down.

@gsauthof
Copy link

gsauthof commented May 1, 2018

FWIW, I created a small command line utility pdfmerge as a frontend to the merge functionality (equivalent to the pdfktk multibackground command) in the Python packages PyPDF2 and pdfrw.

@wrznr
Copy link

wrznr commented Oct 24, 2019

From version 8.4.0 on, qpdf has the options --overlay/--underlay for easy merging of image-only and text PDFs. E.g.,

$ qpdf image.pdf --underlay text.pdf -- image_txt.pdf

@amitdo
Copy link
Collaborator

amitdo commented Oct 24, 2019

@wrznr, thanks for the info.

@Jeankree
Copy link

Implementation complete and under review by Ray. @jbarlow83 this is a good time to look at the samples above and make sure they meet your needs.

tesseract -c naked_pdf=true HelloWorld.png HelloWorld pdf

Hello,
I tried this today (tesseract v5.3.0.20221214) but I was not able to run it… I always have this error:
Could not set option: naked_pdf=true
My command:
tesseract -c naked_pdf=true Duerer_Image.jpg Duerer_wText pdf
Was this option disabled in version 5 ?

My first goal was to try to understand how it works and what it does exactly... (for merging image anf textonly pdf files).

Thank you!

@amitdo
Copy link
Collaborator

amitdo commented Feb 27, 2023

You didn't read the whole thread. The parameter name was changed to textonly_pdf.

@Jeankree
Copy link

You didn't read the whole thread. The parameter name was changed to textonly_pdf .

Oh sorry! I thought it was another parameter! (I already know textonly_pdf)
And: of course, I did read the whole thread, and now I read it again to be sure; I did not see any mention about changing the name of this parameter. This is why I thought it could have been different and have another use, but now, I understand that naked_pdf=true was only a temporary name…
Thank you for the clarification!

@amitdo
Copy link
Collaborator

amitdo commented Feb 28, 2023

I should have written: "Did you read the whole thread?" or just omit the sentence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests