Skip to content

Commit

Permalink
No ligature preservation in searches
Browse files Browse the repository at this point in the history
Specifying ligatures in search needles makes no sense. Not setting the flag bit causes text to be found with and without ligatures.
  • Loading branch information
JorjMcKie committed Jul 15, 2024
1 parent 3a259dc commit 2414c80
Show file tree
Hide file tree
Showing 5 changed files with 17 additions and 3 deletions.
2 changes: 1 addition & 1 deletion docs/app1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ Text Extraction Flags Defaults
========================= ==== ==== ===== === ==== ======= ===== ====== ======
Indicator text html xhtml xml dict rawdict words blocks search
========================= ==== ==== ===== === ==== ======= ===== ====== ======
preserve ligatures 1 1 1 1 1 1 1 1 1
preserve ligatures 1 1 1 1 1 1 1 1 0
preserve whitespace 1 1 1 1 1 1 1 1 1
preserve images n/a 1 1 n/a 1 1 n/a 0 0
inhibit spaces 0 0 0 0 0 0 0 0 0
Expand Down
2 changes: 1 addition & 1 deletion docs/vars.rst
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ The following constants represent the default combinations of the above for text

.. py:data:: TEXTFLAGS_SEARCH
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_DEHYPHENATE`
`TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_DEHYPHENATE`


.. _linkDest Kinds:
Expand Down
1 change: 0 additions & 1 deletion src/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13312,7 +13312,6 @@ def width(self):
TEXTFLAGS_RAWDICT = TEXTFLAGS_DICT

TEXTFLAGS_SEARCH = (0
| TEXT_PRESERVE_LIGATURES
| TEXT_PRESERVE_WHITESPACE
| TEXT_MEDIABOX_CLIP
| TEXT_DEHYPHENATE
Expand Down
Binary file added tests/resources/text-find-ligatures.pdf
Binary file not shown.
15 changes: 15 additions & 0 deletions tests/test_textsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@
Text search with 'clip' parameter - clip rectangle contains two occurrences
of searched text. Confirm search locations are inside clip.
"""

import os

import pymupdf

scriptdir = os.path.abspath(os.path.dirname(__file__))
filename1 = os.path.join(scriptdir, "resources", "2.pdf")
filename2 = os.path.join(scriptdir, "resources", "github_sample.pdf")
filename3 = os.path.join(scriptdir, "resources", "text-find-ligatures.pdf")


def test_search1():
Expand All @@ -35,3 +37,16 @@ def test_search2():
assert len(rl) == 2
for r in rl:
assert r in clip


def test_search3():
"""Ensure we find text whether or not it contains ligatures."""
doc = pymupdf.open(filename3)
page = doc[0]
needle = "flag"
hits = page.search_for(needle, flags=pymupdf.TEXTFLAGS_SEARCH)
assert len(hits) == 2 # all occurrences found
hits = page.search_for(
needle, flags=pymupdf.TEXTFLAGS_SEARCH | pymupdf.TEXT_PRESERVE_LIGATURES
)
assert len(hits) == 1 # only found text without ligatures

0 comments on commit 2414c80

Please sign in to comment.