# Demonstrate the Effect of the MuPDF Dehyphenation Flag
We will demonstrate here the impact of MuPDF's `TEXT_DEHYPHENATE` flag on extracted text and computed text boundary boxes.

First we import PyMuPDF and make a memory PDF with some hyphenated text.

In [83]:
import fitz
text = (
    "This is a longer text with hyphena-\n"
    "ted words. It will be extracted using\n"
    "different flags.\n"
    "This will show, how bit settings influ-\n"
    "ence text bbox computations."
)

doc = fitz.open()
page = doc.new_page()
rect = page.rect + (72, 72, -72, -72)  # move text away from the borders
page.insert_textbox(rect, text, fontsize=24)
pdfbytes = doc.tobytes()  # save this PDF to memory

# recycle as memory PDF
doc = fitz.open("pdf", pdfbytes)

# we now have a PDF to play with
page = doc[0]

First we do a simple extraction, using no special options. Under normal conditions, this should reproduce the text as shown in the file:

In [84]:
# using no special flags
print(page.get_text("text", flags=0))

This is a longer text with hyphena-
ted words. It will be extracted using
different flags.
This will show, how bit settings influ-
ence text bbox computations.



Now extract using the dehyphenation option.

In [85]:
# using the dehyphenation flag
print(page.get_text("text", flags=fitz.TEXT_DEHYPHENATE))

This is a longer text with hyphenated words. It will be extracted using
different flags.
This will show, how bit settings influence text bbox computations.



We see that this time any two consecutive lines have been joined, where the first line ends with a hyphen. As part of this, the hyphen character is removed.

We now investigate the influence on text boundary boxes.

In [86]:
# using no special flags
for block in page.get_text("blocks", flags=0):
    bbox1 = fitz.Rect(block[:4])  # the block bbox
    print(bbox1)
    print(block[4])  # the block text

Rect(72.0, 71.99998474121094, 457.4399719238281, 246.77297973632812)
This is a longer text with hyphena-
ted words. It will be extracted using
different flags.
This will show, how bit settings influ-
ence text bbox computations.



In [87]:
# using the dehyphenation flag
for block in page.get_text("blocks", flags=fitz.TEXT_DEHYPHENATE):
    bbox2 = fitz.Rect(block[:4])  # the block bbox
    print(bbox2)
    print(block[4])  # the block text

Rect(72.0, 71.99998474121094, 449.4479675292969, 246.77297973632812)
This is a longer text with hyphenated words. It will be extracted using
different flags.
This will show, how bit settings influence text bbox computations.



The **"blocks"** outputs each time show the same results as with simple text extraction. Now we look at the two bboxes:

In [88]:
bbox1 - bbox2  # compute coordinate differences

Rect(0.0, 0.0, 7.99200439453125, 0.0)

The "dehyphenated" bbox **may have a smaller width**, because hyphens at line ends are now ignored!
Let us make a picture to see what caused the difference in our case.

In [89]:
page.draw_rect(bbox1, color=(0,0,1), width=2)  # standard flags settings in blue
page.draw_rect(bbox2, color=(1,0,0), width=0.3)  # dehyphenated in red
pix = page.get_pixmap()
from PIL import Image
img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
img.show()
