# OCR Demo: Full vs. Partial OCR of Pages
In v1.19.1 of PyMuPDF there are two choices of OCR-ing a document page: **_full_** or **_partial_**. In both cases, a `TextPage` object will be created - available for text extractions and text searches as usual. All these text processing methods have been extended with the new parameter `textpage` to allow referencing the OCR result.
* A **_full OCR_** makes a photo of the page with the desired resolution and interprets it.
   - Any existing text on the page will be re-interpreted and be delivered together with text recognized in images on the page. All text will have been assigned Tesseract's "GlyphlessFont" as the font.
   - Only the visible text will be present after OCR. Text covered by images will be lost.
   - Depending on the overall text amount, the chosen DPI and the number of languages used on the page, it may take up to around 2 seconds to perform the OCR. 
* A **_partial OCR_** interprets only the images displayed by the page.
   - Existing "normal" text will be kept with all its properties (font name, font size, position, etc.).
   - The OCR precision cannot (and need not) be influenced, because the original image quality / resolution stored in the document is always taken. The dpi parameter is ignored.
   - The resulting `TextPage` will contain a **_mixture of normal and OCR text_**.
   - For obvious reasons, the execution time of this alternative tends to be much lower than that of a full OCR.

In [14]:
import fitz

if tuple(map(int, fitz.VersionBind.split("."))) < (1, 19, 1):
    raise ValueError("Need at least v1.19.1 of PyMuPDF")

# open a document with normal text and two overlapping images
doc = fitz.open("partial-ocr.pdf")
page = doc[0]

First make a **_full page OCR_**. Please take a look at the document and note the two little text lines. They are contained in a separate, non-transparent image, which covers some text of the larger image underneath it.

In [15]:
full_tp = page.get_textpage_ocr(flags=0, dpi=300, full=True)  # make the TextPage object. It does all the OCR.
# now look what we have got
print(page.get_text(textpage=full_tp))

PDF
PyMuPDF — the Python
soe urs ton na
luPDF
PyMuPDF Documentation
Release 1.19.0
Jorj X. McKie
Oct 17, 2021



Or blockwise output, getting rid of some of the unwanted linebreaks:

In [16]:
blocks = page.get_text("blocks", textpage=full_tp)
for b in blocks:
    print(b[4].replace("\n", " "))

PDF 
PyMuPDF — the Python 
soe urs ton na luPDF PyMuPDF Documentation Release 1.19.0 
Jorj X. McKie 
Oct 17, 2021 


Not very impressive either way: the original text (last 4 lines) was detected ok, but text in the pictures looks quite garbled ... not to our surprise, really.

Please note, that the OCR process scans the page from top-left to bottom-right - which therefore also is the sequence of the extraction.

This is what we get when looking at details of each text span:

In [17]:
for block in page.get_text("dict", textpage=full_tp)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short in the display
            print( span["font"], bbox, span["text"])

GlyphLessFont IRect(222, 188, 297, 221) PDF
GlyphLessFont IRect(282, 273, 390, 298) PyMuPDF
GlyphLessFont IRect(389, 273, 411, 298)  —
GlyphLessFont IRect(410, 273, 453, 298)  the
GlyphLessFont IRect(452, 273, 540, 298)  Python
GlyphLessFont IRect(283, 300, 307, 336) soe
GlyphLessFont IRect(306, 300, 331, 336)  urs
GlyphLessFont IRect(330, 300, 363, 336)  ton
GlyphLessFont IRect(362, 300, 391, 336)  na
GlyphLessFont IRect(457, 300, 521, 336) luPDF
GlyphLessFont IRect(239, 348, 352, 373) PyMuPDF
GlyphLessFont IRect(351, 348, 538, 373)  Documentation
GlyphLessFont IRect(423, 382, 488, 396) Release
GlyphLessFont IRect(487, 382, 541, 396)  1.19.0
GlyphLessFont IRect(432, 478, 463, 496) Jorj
GlyphLessFont IRect(462, 478, 484, 496)  X.
GlyphLessFont IRect(483, 478, 540, 496)  McKie
GlyphLessFont IRect(470, 641, 489, 656) Oct
GlyphLessFont IRect(488, 641, 510, 656)  17,
GlyphLessFont IRect(509, 641, 538, 656)  2021



Let's see what a **_partial OCR_** can do for us:

In [18]:
partial_tp = page.get_textpage_ocr(flags=0, full=False)
# look at the result also here
print(page.get_text(textpage=partial_tp))

PyMuPDF Documentation
Release 1.19.0
Jorj X. McKie
Oct 17, 2021
=
PDF
PyMuPDF — the Python
bindings for MuPDF
Some text as line
1.
Some more text as line 2.



This is very much better ... and also faster.

Please note that the partial OCR `TextPage` stores its text in the following sequence:
1. Normal text
2. Image OCR text in the same sequence as the page displays those images

Looking again at span details:

In [19]:
for block in page.get_text("dict", textpage=partial_tp)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            bbox = fitz.Rect(span["bbox"]).irect  # just to make it short
            print( span["font"], bbox, span["text"])

NimbusSanL-Bold IRect(237, 342, 541, 374) PyMuPDF Documentation
NimbusSanL-BoldItal IRect(422, 377, 541, 400) Release 1.19.0
NimbusSanL-Bold IRect(431, 474, 541, 497) Jorj X. McKie
NimbusSanL-Bold IRect(470, 639, 541, 655) Oct 17, 2021
GlyphLessFont IRect(284, 189, 297, 202) =
GlyphLessFont IRect(222, 188, 297, 221) PDF
GlyphLessFont IRect(282, 273, 390, 298) PyMuPDF
GlyphLessFont IRect(389, 273, 411, 298)  —
GlyphLessFont IRect(410, 273, 453, 298)  the
GlyphLessFont IRect(452, 273, 539, 298)  Python
GlyphLessFont IRect(301, 305, 394, 331) bindings
GlyphLessFont IRect(393, 305, 433, 331)  for
GlyphLessFont IRect(432, 305, 521, 331)  MuPDF
GlyphLessFont IRect(283, 302, 307, 310) Some
GlyphLessFont IRect(306, 302, 326, 310)  text
GlyphLessFont IRect(325, 302, 339, 310)  as
GlyphLessFont IRect(338, 302, 356, 310)  line
GlyphLessFont IRect(360, 302, 363, 310) 1.
GlyphLessFont IRect(283, 320, 307, 328) Some
GlyphLessFont IRect(306, 320, 333, 328)  more
GlyphLessFont IRect(332, 320, 350, 328

As mentioned, the normal text is contained in the textpage as without OCR: with its own font, fontsize, position information, etc. Whereas OCRed text appears with Tesseract's `GlyphLessFont`.

To process the text of partial OCR output, it will definitely make a lot of sense to sort it in some reading order, for example:

In [20]:
blocks = page.get_text("blocks", textpage=partial_tp)
blocks.sort(key=lambda b: (b[3], b[0]))  # sort vertical, then horizontal
for b in blocks:
    print(b[4].replace("\n", " "))  # suppress pesky line breaks

= PDF 
Some text as line 1. 
Some more text as line 2. 
PyMuPDF — the Python bindings for MuPDF 
PyMuPDF Documentation 
Release 1.19.0 
Jorj X. McKie 
Oct 17, 2021 


Now the first 4 lines represent the OCR text, the last for the normal text. All text appears - even when covered by other objects.

## Performance
I mentioned in the beginning, that the OCR work is done during `TextPage` creation. Already without OCR, textpage creation is the most time consuming part of text processing.

Creating OCR textpages may easily take 100 to several thousand times longer. It therefore by all means should happen only once per document page.

The new `textpage` parameter in all text processing methods allows referring to an existing textpage and will suppress creating another one.

Here are some performance comparisons for our example page:

In [21]:
# normal text extraction - no OCR
%timeit page.get_textpage(flags=0)  # suppress image extraction

91.6 µs ± 597 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [22]:
# full page OCR
%timeit page.get_textpage_ocr(flags=0, full=True, dpi=300)

341 ms ± 6.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
# partial OCR
%timeit page.get_textpage_ocr(flags=0, full=False)

242 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The above numbers illustrate how costly OCR textpage are:
* full OCR in our case to 3400 times longer
* partial OCR still took 2400 time longer.

Whereas text processing using a **previously generated** textpage is not significantly dependent on whether or not it uses an OCR textpage:

In [24]:
# normal textpage
normal_tp = page.get_textpage(flags=0)
%timeit page.get_text(textpage=normal_tp)

6.49 µs ± 463 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [25]:
# using full page OCR
%timeit page.get_text(textpage=full_tp)

6.7 µs ± 434 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [26]:
# using partial page OCR
%timeit page.get_text(textpage=partial_tp)

7.63 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
