New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cross-posted from the Forum/Suggestion] Implement a way to integrate (original image file, detected text) →searchable PDF #660

Closed
Wikinaut opened this Issue Jan 13, 2017 · 53 comments

Comments

Projects
None yet
9 participants
@Wikinaut
Copy link
Contributor

Wikinaut commented Jan 13, 2017

https://groups.google.com/forum/#!topic/tesseract-ocr/vvMldrkcuOQ has asked:

I have a pdf (scanned) and now i make a searchable pdf from this.
First i generate a black/white multipage tif, and with tesseract i can make a searchable pdf.
But is it somehow possible to integrate the original pdf images?
because the generated tif has not the same quality like the original (maybe the scaned image is in color).

How to reproduce:

  1. Assume one page with a colored background in.pdf, converted to in.ppm image
  2. preprocess unpaper in.ppm in-cleaned.ppm
  3. process with (example) tesseract in-cleaned.ppm out -l deu+eng --oem 2 pdf txt
  4. tesseract mixed output file out.pdfhas now a blotchy background (from the unpaper step above)

20170113-10 09 17_auswahl

Is there any way to "feed-in" the original in.ppm as image, so that this is used instead of in-cleaned.ppm when creating the out.pdf ?

So what is wanted is original input image plus ocr layer, so that output looks like
20170113-10 12 22_auswahl

@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 14, 2017

This is a complicated way of asking for an option to send one image through OCR and insert a different image in the output PDF.

tesseract --pdf-image original.png cleaned.png -l eng --oem 2 pdf  # not implemented, could work like this

I know this was requested before and I believe @jbreiden said it would be added to the PDF renderer at some point.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 14, 2017

I'm very reluctant to make Tesseract PDF generation fancy. I wonder if we can do an image swap like this outside of Tesseract, using one of the PDF manipulation toolkits.

@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 14, 2017

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 14, 2017

@jbreiden It's the last really missing issue.
The new algorithm is already a boost in quality. I reach here up to 100% OCR quality (for --oem 2 -l deu+eng) including these beasty "Umlauts" äöüÄÖÜß....

If this helps, I will donate some mBTC for implementing it just right now. Just post your receiving address.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 14, 2017

@jbarlow83 background info. As you know, I recently wanted to try your OCRmyPDF because I found the interesting -clean option (source: https://media.readthedocs.org/pdf/ocrmypdf/latest/ocrmypdf.pdf ) which would have solved my problem:

which "does not alter the final output":

--clean
uses unpaper to clean up pages before OCR, but does not alter the final output.  This makes it less
likely that OCR will try to find text in background noise.
•
--clean-final
uses unpaper to clean up pages before OCR and inserts the page into the final output.  You
will want to review each page to ensure that unpaper did not remove something important.

but unfortunately this does not work with tesseract 4, at the present.

So I looked for bug reports, if tesseract could pass the original input image to the output; and filed the present issue.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 14, 2017

Really? That's interesting, qpdf is very well written. Maybe the right thing to do is allow Tesseract to produce a multi-page PDF with invisible symbolic text PDF only, no images. Then another tool (perhaps an enhanced qpdf tool) would merge and composite two PDFs together. One being the original image-only PDF, and the other an invisible-text-only PDF. What do you think, @jbarlow83? Please point me at the relevant qpdf API calls if you happen to know them.

@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 15, 2017

I think invisible text only output would be far more useful for developers that integrate tesseract or anyone who wants to do something fancy. It would still make sense to keep the existing OCR with image option of course. As a plus, it's should be easier to suppress the image than add a different one.

OCRmyPDF (which I maintain) use Ghostscript to rasterize and then runs one of its two PDF renderers. One uses Tesseract hOCR and provides more features but is not as good at producing the OCR text layer as Tesseract PDF, so I also provide Tesseract PDF. If Tesseract could produce a invisible text only I could offer all the features for both, and work towards phasing out the hOCR renderer. When possible I already do graft the text layer onto the existing PDF instead of constructing a new one.

In addition to OCRmyPDF pdftk multibackground could merge an OCR layer onto an existing PDF (by "watermarking"), so there is at least one other supported tool out there that should work out of the box. There's some other tools that wrap tesseract for use with PDFs as well.

In writing this I've made a case for not using qpdf because other tools should be able to do the job with an invisible text PDF, but for interest's sake case here is example code that inverts black and white for all images; clearly this is close to how one would replace an image outright.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 15, 2017

This sounds reasonable to me. I'll try to find time over this coming week to make an experimental invisible-text-only PDF that we can play with. All the other pieces of the puzzle are there; for example Leptonica already ships with a images->pdf tool that avoids transcoding for PNG, JP2K, and JPEG. It would be cool to use qpdf for the merge step because it is already so useful for linearizing. But it's great that there are more options. The qpdf author is extremely friendly in my experience, in case we eventually chat with him. Oh, I now vaguely remember that PDFBox had something for merging as well, but I've never tried it and can't find it at the moment.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 17, 2017

Here's an experimental PDF pair, image-only and text-only. Let the merging begin!

images.pdf
text.pdf

@jbreiden jbreiden added the PDF label Jan 18, 2017

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 18, 2017

This works brilliantly. I will implement for real if someone promises that they will use it. Also, what do we call the configuration option? My best idea so far to describe a PDF that has invisible text only is 'naked'. I'm sure someone has a better idea.

$ time pdftk text.pdf multibackground images.pdf output full.pdf
real	0m0.253s

Actually this works better the other way around, for preserving the bookmarks and things like that.

 pdftk  images.pdf multibackground text.pdf output full.pdf
@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 18, 2017

Implementation complete and under review by Ray. @jbarlow83 this is a good time to look at the samples above and make sure they meet your needs.

tesseract -c naked_pdf=true HelloWorld.png HelloWorld pdf
@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 18, 2017

Looks really good @jbreiden.

Works great in pdftk. No display issues and PDF syntax looks fine.

PyPDF2 is also capable of merging. It does not have the equivalent of "multibackground" but merge pages manually. Here is merging one page:

In [1]: import PyPDF2 as pypdf

In [4]: pdf_text = pypdf.PdfFileReader(open('text.pdf', 'rb'))

In [5]: pdf_image = pypdf.PdfFileReader(open('images.pdf', 'rb'))

In [6]: page_text = pdf_text.pages[1]

In [7]: page_image = pdf_image.pages[1]

In [8]: page_text.mergeRotatedScaledTranslatedPage(page_image, 0, 1.0, 0, 0, expand=False)

In [9]: out = pypdf.PdfFileWriter()

In [10]: out.addPage(page_text)

In [11]: with open('pypdfmerge.pdf','wb') as o:
    ...:     out.write(o)
    ...:     

For reference, pdfbox did not work out of the box. As far as I can tell the closest command in pdfbox is

java -jar pdfbox-app-2.0.2.jar OverlayPDF images.pdf text.pdf pdfboxoverlay.pdf

However pdfbox takes the unusual approach of rasterizing the overlay PDF as a bitmap and drawing it on top of the base page, making it useless regardless of image/text order. (I suppose when you go to the trouble implementing a full PDF renderer in Java you feel compelled to use it even when it's not strictly needed.)

@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 18, 2017

I don't know about calling it a naked PDF because there's nothing exciting to see in it. It's more of a phantom or spectral apparition PDF, having form without substance.

ocr_text_only would do, or suppress_images? Not nearly as fun, but practical.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 18, 2017

Spectral writing. Perhaps a kind of ghost script, if you will.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented Jan 19, 2017

How about text_only_pdf ?

@jbreiden is it also possible to use a .pdf file as input to tesseract directly?

@amitdo

This comment has been minimized.

Copy link
Contributor

amitdo commented Jan 19, 2017

pdf_invisible_text_layer_only
+
a config file pdfinvisible (or maybe pdf0)

@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 19, 2017

@Shreeshrii PDF is a very complex vector-based file format. Tesseract works only on images. It is much easier to write PDFs that use a limited set of PDF features than read arbitrary PDFs. Have a look at OCRmyPDF (which I develop) - it addresses the details of using tesseract to apply OCR to PDFs.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 19, 2017

@jbreiden @jbarlow83 @amitdo info: I just built the whole toolchain from their git repos (tesseract, ocrmypdf, unpaper), and have ghostscript version 9.20 ready in a dedicated debian 9 "OCR VM" on my Qubes OS system.

Pls. let me know, what (if) you want me to test - I have time to test and want to help you.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 19, 2017

Hmmm, an invisible text layer, invisible text, let's see ... iText? Anyway, I'll pick something. There is zero chance that a PDF rasterizer will ever be part of Tesseract or Leptonica. In theory one could write an PDF image extractor for Leptonica, but there isn't really enough motivation to do so.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 20, 2017

Ray will eventually merge this patch, but it is hard to predict when. I am posting here for anyone who is impatient or excited.

--- api/pdfrenderer.cpp	2016-12-13 14:43:24.000000000 -0800
+++ api/pdfrenderer.cpp	2017-01-19 14:50:56.000000000 -0800
@@ -178,10 +178,12 @@
  * PDF Renderer interface implementation
  **********************************************************************/
 
-TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
+TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
+                                 bool textonly)
     : TessResultRenderer(outputbase, "pdf") {
   obj_  = 0;
   datadir_ = datadir;
+  textonly_ = textonly;
   offsets_.push_back(0);
 }
 
@@ -326,7 +328,11 @@
   pdf_str.add_str_double("", prec(width));
   pdf_str += " 0 0 ";
   pdf_str.add_str_double("", prec(height));
-  pdf_str += " 0 0 cm /Im1 Do Q\n";
+  pdf_str += " 0 0 cm";
+  if (!textonly_) {
+    pdf_str += " /Im1 Do";
+  }
+  pdf_str += " Q\n";
 
   int line_x1 = 0;
   int line_y1 = 0;
@@ -832,6 +838,7 @@
 bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
   size_t n;
   char buf[kBasicBufSize];
+  char buf2[kBasicBufSize];
   Pix *pix = api->GetInputImage();
   char *filename = (char *)api->GetInputName();
   int ppi = api->GetSourceYResolution();
@@ -840,6 +847,9 @@
   double width = pixGetWidth(pix) * 72.0 / ppi;
   double height = pixGetHeight(pix) * 72.0 / ppi;
 
+  snprintf(buf2, sizeof(buf2), "XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
+  const char *xobject = (textonly_) ? "" : buf2;
+
   // PAGE
   n = snprintf(buf, sizeof(buf),
                "%ld 0 obj\n"
@@ -850,19 +860,18 @@
                "  /Contents %ld 0 R\n"
                "  /Resources\n"
                "  <<\n"
-               "    /XObject << /Im1 %ld 0 R >>\n"
+               "    %s"
                "    /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
                "    /Font << /f-0-0 %ld 0 R >>\n"
                "  >>\n"
                ">>\n"
                "endobj\n",
                obj_,
-               2L,            // Pages object
-               width,
-               height,
-               obj_ + 1,      // Contents object
-               obj_ + 2,      // Image object
-               3L);           // Type0 Font
+               2L,  // Pages object
+               width, height,
+               obj_ + 1,  // Contents object
+               xobject,   // Image object
+               3L);       // Type0 Font
   if (n >= sizeof(buf)) return false;
   pages_.push_back(obj_);
   AppendPDFObject(buf);
@@ -899,13 +908,15 @@
   objsize += strlen(b2);
   AppendPDFObjectDIY(objsize);
 
-  char *pdf_object;
-  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
-    return false;
+  if (!textonly_) {
+    char *pdf_object = nullptr;
+    if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
+      return false;
+    }
+    AppendData(pdf_object, objsize);
+    AppendPDFObjectDIY(objsize);
+    delete[] pdf_object;
   }
-  AppendData(pdf_object, objsize);
-  AppendPDFObjectDIY(objsize);
-  delete[] pdf_object;
   return true;
 }
 

--- api/renderer.h	2016-11-07 07:44:03.000000000 -0800
+++ api/renderer.h	2017-01-19 14:50:56.000000000 -0800
@@ -186,7 +186,7 @@
  public:
   // datadir is the location of the TESSDATA. We need it because
   // we load a custom PDF font from this location.
-  TessPDFRenderer(const char *outputbase, const char *datadir);
+  TessPDFRenderer(const char* outputbase, const char* datadir, bool textonly);
 
  protected:
   virtual bool BeginDocumentHandler();
@@ -196,20 +196,20 @@
  private:
   // We don't want to have every image in memory at once,
   // so we store some metadata as we go along producing
-  // PDFs one page at a time. At the end that metadata is
+  // PDFs one page at a time. At the end, that metadata is
   // used to make everything that isn't easily handled in a
   // streaming fashion.
   long int obj_;                     // counter for PDF objects
   GenericVector<long int> offsets_;  // offset of every PDF object in bytes
   GenericVector<long int> pages_;    // object number for every /Page object
   const char *datadir_;              // where to find the custom font
+  bool textonly_;                    // skip images if set
   // Bookkeeping only. DIY = Do It Yourself.
   void AppendPDFObjectDIY(size_t objectsize);
   // Bookkeeping + emit data.
   void AppendPDFObject(const char *data);
   // Create the /Contents object for an entire page.
-  static char* GetPDFTextObjects(TessBaseAPI* api,
-                                 double width, double height);
+  char* GetPDFTextObjects(TessBaseAPI* api, double width, double height);
   // Turn an image into a PDF object. Only transcode if we have to.
   static bool imageToPDFObj(Pix *pix, char *filename, long int objnum,
                           char **pdf_object, long int *pdf_object_size);

--- api/tesseractmain.cpp	2016-12-15 15:28:37.000000000 -0800
+++ api/tesseractmain.cpp	2017-01-19 14:50:56.000000000 -0800
@@ -337,8 +337,10 @@
 
     api->GetBoolVariable("tessedit_create_pdf", &b);
     if (b) {
-      renderers->push_back(
-          new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()));
+      bool textonly;
+      api->GetBoolVariable("textonly_pdf", &textonly);
+      renderers->push_back(new tesseract::TessPDFRenderer(
+          outputbase, api->GetDatapath(), textonly));
     }
 
     api->GetBoolVariable("tessedit_write_unlv", &b);

--- ccmain/tesseractclass.cpp	2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.cpp	2017-01-19 18:15:57.000000000 -0800
@@ -391,6 +391,8 @@
                   this->params()),
       BOOL_MEMBER(tessedit_create_pdf, false, "Write .pdf output file",
                   this->params()),
+      BOOL_MEMBER(textonly_pdf, false, "Invisible text only for PDF",
+                  this->params()),
       STRING_MEMBER(unrecognised_char, "|",
                     "Output char for unidentified blobs", this->params()),
       INT_MEMBER(suspect_level, 99, "Suspect marker level", this->params()),

--- ccmain/tesseractclass.h	2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.h	2017-01-19 16:31:04.000000000 -0800
@@ -1027,6 +1027,7 @@
   BOOL_VAR_H(tessedit_create_hocr, false, "Write .html hOCR output file");
   BOOL_VAR_H(tessedit_create_tsv, false, "Write .tsv output file");
   BOOL_VAR_H(tessedit_create_pdf, false, "Write .pdf output file");
+  BOOL_VAR_H(textonly_pdf, false, "Invisible text only for PDF");
   STRING_VAR_H(unrecognised_char, "|",
                "Output char for unidentified blobs");
   INT_VAR_H(suspect_level, 99, "Suspect marker level");
@RNCTX

This comment has been minimized.

Copy link
Contributor

RNCTX commented Jan 20, 2017

@Shreeshrii http://kiirani.com/2013/03/22/tesseract-pdf.html

The PDF/invisible text output you guys are implementing works quite well for me using OSX 'Preview' but for a little jerkiness depending on scaling, of course.

This is quite a big deal, in my opinion, as it will allow those who have, for instance... legal documents containing notary stamps in color, or in my use-case aviation emergency manuals with color-coded pages, to keep their original copies unmodified from their scanners, but modify them in a clean way into searchable documents. Thanks for this.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented Jan 20, 2017

Thanks for info on pdf to images conversion for use with tesseract.

I usually use ghostscript for the purpose e.g.

gs -dNOPAUSE -dBATCH  -r300x300 -sDEVICE=tiffg4  -dFirstPage=168  -dLastPage=174 -sOutputFile=sample%03d.tif ./sample.pdf

I will give the other suggestions a try (including a new one suggested by zdenop in the forum- https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/vvMldrkcuOQ/xLES3_ZoEwAJ )

@jbreiden Thanks, Jeff, for this invisible text output pdf which can be merged with the original pdf.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 20, 2017

pdfimages from poppler-utils will do image extraction as well. And pdfium offers API calls for image extraction. I am sure there are many others. Have fun.

Wikinaut added a commit to Wikinaut/tesseract that referenced this issue Jan 20, 2017

@amitdo

This comment has been minimized.

Copy link
Contributor

amitdo commented Jan 20, 2017

Ray will eventually merge this patch, but it is hard to predict when. I am posting here for anyone who is impatient or excited.

I suggest to merge this to master now. Ray can modify it later if needed.

Wikinaut added a commit to Wikinaut/tesseract that referenced this issue Jan 20, 2017

@zdenop

This comment has been minimized.

Copy link
Contributor

zdenop commented Jan 20, 2017

merged to master.
@Wikinaut: try master now.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 20, 2017

@zdenop effa574 does not work: breaks tesseract [UPDATE:] and creates broken files. Who has tested that patch, and how ?

@amitdo

This comment has been minimized.

Copy link
Contributor

amitdo commented Jan 20, 2017

effa574 does not work: breaks tesseract and creates broken files. Who has tested that patch, and how ?

I had the impression that Jeff tested it. Maybe I was wrong.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 20, 2017

@amitdo with "it breaks" I mean, that the "normal" function of tesseract is broken, effa574 always creates a blank pdf.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 20, 2017

And now, if I combine my original input pdf with the created output, I appear to have two text layers. Who can help, it's getting more and more complicated.

Let's go back to the roots:
Why not simply passing the original input image to the output, inside tesseract ?

my wish

tesseract image.png image.ocr -c image_passthrough=1 pdf

which then creates

  • image.txt (with the ocr-ed text)
  • image.ocr.pdf (mixed-mode pdf with the original image.png and image.txt)

And this setting ( -c image_passthrough=1 ) should be the tesseract default, in my view.

@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 20, 2017

Tested. With effa574, tesseract -c textonly_pdf=1 works correctly and tesseract -c textonly_pdf=0 produces an invalid PDF.

The problem is missing a "/" in front of XObject.

@jbarlow83

This comment has been minimized.

Copy link
Contributor

jbarlow83 commented Jan 20, 2017

Fix in #667

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 20, 2017

@jbarlow83 works. But when I look to capi.cpp then - I think - you have to apply the corresponding change in capi.h see
Wikinaut@5e80891#diff-1ff9fac4997a03321dc873248bcf1309

(I am not sure, whether my patch is correct.))

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 20, 2017

I clearly did not test well enough to find the embarrassing /XObject bug. Thank you for finding and fixing that @jbarlow83. Regarding capi.h and capi.cpp, the development branch that I am sharing with @theraysmith doesn't have those files. I don't know what the story is with that, maybe Ray does. @Wikinaut, there are problems with "send one image through OCR and insert a different image in the output PDF". Image management becomes difficult, especially considering TIFF and PDF are multipage formats. On top of that, certain image formats require transcoding. I think the textonly_pdf approach makes a building block that plays well with other tools, and is the right way to go. It is super simple to implement, and is especially well suited for turning scanned pdf into to searchable pdf while preserving metadata. That's something that increases in importance every single day.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 20, 2017

@jbreiden regarding "image_passthrough", pls.allow me to explain my workflow, which in my view, is quite common.

  • I already have mixed-mode multi-page PDFs ("input.pdf") -- for example, ocr-ed with the old tesseract.
  • I already found that the new LSTM mode is very much better and want to regenerate the text layer for all my archived PDFs...
  • ...without loosing the high image quality of my existing scans.

With the new textonly_pdfmode I managed this, but it requires this additional ugly step (marked with (*)

  • split input.pdf into single pages for tesseract (use pdftoppm, or gs or whatelse)
  • for each $image do tesseract input-$image.ppm textonly-$image -c textonly_pdf pdf command
  • `pdftk textonly-*.pdf cat output "textonly.pdf"
  • (*) remove text layer from input.pdf -> input-without-text.pdf
  • pdftk input-without-text.pdf multibackground textonly.pdf output new-mixed-mode.pdf

So it's still very long way with your new option.

Please, perhaps you find a possible way when input image type is a single page (and losslessly coded)

  • gif (not really needed)
  • png ; or
  • ppm, pbm, pgm

(preferred)
to pass-through such image types.

I think, it's possible.

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 21, 2017

There is a much simpler way to achieve the same results. Remember, for most image formats Tesseract will not mangle the images in any way when creating a PDF file.

  1. Extract the images from the PDF file (don't render!). For this example, we'll assume jpeg.
  2. ls *.jpg | tesseract - result pdf

Now it is a little more complicated if you want what was described in the top level comment. Which is to OCR a different image than what ends up in the PDF file. For that, it would look like this.

  1. Extract the images from the PDF file (don't render!)
  2. Merge them into an image-only PDF, using something like converttopdf from leptonica-progs
  3. Apply your favorite image processing operations to the extracted images
  4. Generate a text-only PDF from the extracted images
  5. Merge your image-only and text-only PDF

Bottom line, you want your images completely unmolested during this process. No format conversion, no uncompress + decompress cycles. Nothing. Hands off. And it will work for most normal starting points. And honestly, you can almost certainly outsource all these details to software written by @jbarlow83

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 21, 2017

why again *.jpg (step 1) ? Never ever use jpg with text files.
Please don't tell the mass about jpeg. Use png, ppm, or tif...

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 21, 2017

I already developed code for this using -c textonly_pdf=1, thanks

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 21, 2017

A image-only PDF file is a bag of images. If the bag is holding a bunch of JPEG images, extract them as-is. Don't convert. Don't recompress. Just empty the PDF bag and get your images out. If it is holding JPEG2000, then just get those out. Same with PNG.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 21, 2017

Yes and no, why can't tesseract do this (pass-through the "bunch of input images") ?

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented Jan 21, 2017

Let's shift this discussion back to the forum. Please re-ask your most recent question there; I don't follow exactly what you are asking.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented Jan 21, 2017

Pls. elaborate your step
"Extract the images from the PDF file (don't render!). For this example, we'll assume jpeg."

I use
pdftoppm -aa yes -r 400 -scale-to-x 2000 -scale-to-y 2800 in.pdf image

@zdenop

This comment has been minimized.

Copy link
Contributor

zdenop commented Jan 21, 2017

C-API should be fixed now. Thanks for finding this wikinaut.
@jbreiden capi.cpp and capi.h are C-API for tesseract that is used for tesseract wrappers (python etc.)
@Wikinaut as pointed by Jeff, please move back this discussion to tesseract user forum.

@Jmuccigr

This comment has been minimized.

Copy link

Jmuccigr commented May 4, 2017

Was there a final resolution to this request for putting back in the original images? @Wikinaut?

@jbreiden

This comment has been minimized.

Copy link
Contributor

jbreiden commented May 4, 2017

Yes. The final solution was to implement tesseract -c textonly_pdf=1

@Jmuccigr

This comment has been minimized.

Copy link

Jmuccigr commented May 4, 2017

Yeah, that doesn't work for me: Could not set option: textonly_pdf=1

I'm using version 3.05.00 installed via homebrew.

@Wikinaut

This comment has been minimized.

Copy link
Contributor Author

Wikinaut commented May 4, 2017

@Jmuccigr I am definitely not happy with the current implementation, and decided some months ago to stay silent and let other users come back with the issue (hoping, that my original proposal - pass-through the original input image without transcoding it - will be implement in forthcoming versions).

@amitdo

This comment has been minimized.

Copy link
Contributor

amitdo commented May 4, 2017

I'm using version 3.05.00

The textonly_pdf parameter is only available on the HEAD (4.00)

@Jmuccigr

This comment has been minimized.

Copy link

Jmuccigr commented May 4, 2017

@Wikinaut, yeah, my workflow at some point involves adding OCR'ed text to an optimized PDF. Having the OCR step degrade the quality of that PDF kind of spoils it.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented May 4, 2017

The textonly_pdf parameter is only available on the HEAD (4.00)

@zdenop Please backport for 3.05. Thanks!

@zdenop

This comment has been minimized.

Copy link
Contributor

zdenop commented May 5, 2017

done.

@Shreeshrii

This comment has been minimized.

Copy link
Contributor

Shreeshrii commented May 5, 2017

Thanks, @zdenop. Please also make a 3.05.01 release with the latest commit in 3.05 branch so that all these enhancements are easily accessible.

@Jmuccigr

This comment has been minimized.

Copy link

Jmuccigr commented Jun 5, 2017

Just getting back to this now that 3.05.01 has hit homebrew and wanted to say that it seems to be working.

I've tested it out by running text-only tesseract on a 2x version of an image - which tends to give better results if the original dpi is too low - and then combining that text-only PDF with a PDF made from the original image, which keeps the file size down.

@gsauthof

This comment has been minimized.

Copy link

gsauthof commented May 1, 2018

FWIW, I created a small command line utility pdfmerge as a frontend to the merge functionality (equivalent to the pdfktk multibackground command) in the Python packages PyPDF2 and pdfrw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment