Unable to disable font ligatures in insert_htmlbox method #4985

kishaningithub · 2026-05-03T04:08:08Z

kishaningithub
May 3, 2026

Description of the bug

I would like to disable font ligatures when creating a page using insert_htmlbox method.

I tried creating the following css to disable the same but it had no effect

 body {
   font-family:  'Calibri';
   font-size: 12pt;
   font-feature-settings: "liga" 0, "calt" 0 !important;
   font-variant-ligatures: none !important;
 }

Image

If you notice the image below in the word "Introduction" the letters "ti" as compressed into one

Impact

If a user tries to copy the word from the PDF and pastes it into another app it becomes "Introducon" (Notice that the characters "ti" are dropped here.
This also affecting the search_for() function, when i pass "Introduction" as the parameter i.e. page.search_for("Introduction") it is unable to find the same

How to reproduce the bug

Create a pdf using insert_htmlbox method with Calibri font with content as "Introduction"

PyMuPDF version

1.27.2.3

Operating system

MacOS

Python version

3.14

Answered by kishaningithub

May 13, 2026

@JorjMcKie I was able to solve this problem.. ofcourse need to fine tune this.. just sharing here the idea and it works :-)

The trick

When writing links i am using an actual url with the following format

https://example.org/{page_no},{point_x},{point_y}

I am able to get the above info by calling chapter_file.get_toc(simple=False)

After the save operation is done i am opening the pdf again and rewriting the links acquired from page.get_links().

I am opening the same file after save(output) because the page.get_links() is not returning the links exactly after page.insert_htmlbox. Itried doing the page.refresh() but no luck there as of now.

Code

# Run the following exactly after pymupdf.save…

View full answer

JorjMcKie · 2026-05-03T10:22:11Z

JorjMcKie
May 3, 2026
Maintainer

Parsing the HTML and CSS syntax in the text and the css parameters is done by our base library MuPDF. As documented, not the full respective syntaxes are covered by this parsing. So you will always find unsupported language expressions.

PyMuPDF has no way of influencing this. All you can do is contacting the MuPDF team in this Discord channel.

However: Even when your request would be implemented, it would not solve your problem, because "ti" is no official ligature. There exists no Unicode definition for it.
The problem is caused by the font itself.

Text searching in PyMuPDF has no problem with ligatures: they will be automatically decomposed, letting you successfully find e.g. "fi" - whether present as separate characters or as ligature "ﬁ".
I reproduced your observation with this snippet:

import pymupdf
from pathlib import Path

calibri = Path("calibri.ttf").read_bytes()

arch = pymupdf.Archive(calibri, "calibri")
css = "@font-face {font-family: sans-serif; src: url(calibri);}"
css += "* {font-family: sans-serif}"

doc = pymupdf.open()
page = doc.new_page()
rect = (200, 200, 400, 400)
text = "Introduction"
page.insert_htmlbox(rect, text, archive=arch, css=css)
print(page.get_text())  # extract with auto-replacing unknown Unicodes
print(page.get_text(flags=0))  # extract suppressing replacement
# doc.ez_save("test.pdf")

The output:

IAW: the font contains no backtranslation information for this pseudo-ligature.

My recommendation: Use a different font!

0 replies

JorjMcKie · 2026-05-04T10:01:50Z

JorjMcKie
May 4, 2026
Maintainer

You can of course use page.insert_textbox() - this is a character-based output method and thus avoids the use of Harfbuzz. The result will be as you expect it.

0 replies

kishaningithub · 2026-05-04T10:55:39Z

kishaningithub
May 4, 2026
Author

@JorjMcKie My usecase is the following

I have a page (table of contents page) created using the insert_htmlbox method and the rest of the contents of the pdf come from merging several other PDFs

I would like to create links (preferably with <a> tag) to other pages.. I tried the following

for toc_entry in chunk:
    title = toc_entry[1].strip()
    page_no = toc_entry[2]
    rl = page.search_for(title)
    print(f"toc_entry: {toc_entry}")
    if rl:
        link = {
            "kind": pymupdf.LINK_GOTO,
            "from": rl[0],
            "to": toc_entry[3]["to"],
            "page": page_no,
        }
        page.insert_link(link)

but the above did not works as expected because search_for was not detecting the strings like "Introduction"

Any pointers would be much appreciated.

0 replies

JorjMcKie · 2026-05-04T11:06:11Z

JorjMcKie
May 4, 2026
Maintainer

You could give up using .insert_htmlbox() altogether and use the Story object directly. The HTMLBOX method also makes use of it under the hood.

If you use Story then you can let it create TOCs automatically. It also takes care of fitting content to page rectangles and much more. This would free you from above pesky problems.

Using Story basically means that you prepare the complete content in one string, including require HTML / CSS styling, tables, and potential images to be embedded. Then use one of the Story methods to create the PDF output.

0 replies

JorjMcKie · 2026-05-04T11:08:01Z

JorjMcKie
May 4, 2026
Maintainer

BTW - You are aware of Calibri license restrictions that may apply to you? This is not a free font.

0 replies

kishaningithub · 2026-05-04T11:43:22Z

kishaningithub
May 4, 2026
Author

@JorjMcKie How to solve the linking problem using the Story object? Could not get that working. Any pointers to examples would be great..

0 replies

JorjMcKie · 2026-05-04T12:17:12Z

JorjMcKie
May 4, 2026
Maintainer

Use a method like this one to get a a PDF Document object in which the links contained in your source text have been correctly resolved.

0 replies

JorjMcKie · 2026-05-05T12:08:32Z

JorjMcKie
May 5, 2026
Maintainer

Let me know how you get on with this!

0 replies

kishaningithub · 2026-05-05T12:46:31Z

kishaningithub
May 5, 2026
Author

Definitely @JorjMcKie I am currently thinking, if there is a way i can define the internal address in the href attribute of the <a> tag. The href attribute should specify both the "page number" and the "y" location in the page so that when the user clicks the link in the first page it goes to the "exact" location.

3 replies

JorjMcKie May 5, 2026
Maintainer

Definitely @JorjMcKie I am currently thinking, if there is a way i can define the internal address in the href attribute of the <a> tag. The href attribute should specify both the "page number" and the "y" location in the page so that when the user clicks the link in the first page it goes to the "exact" location.

The problem is not only the font, but more fundamentally the PDF standard itself. If I understand your use case correctly, you have multiple PDFs (which are made using the problem font Calibri) for which you want to create a PDF which represents an overall Table of Contents, right?

JorjMcKie May 5, 2026
Maintainer

When defining a PDF link to a different PDF, you cannot - by PDF the standard - specify the location on some page of the target PDF. Only the target page number is possible.

Another idea:
If the target PDFs already have their own TOC, then a much simpler solution is possible.

kishaningithub May 5, 2026
Author

@JorjMcKie First of all, thanks a lot for giving me such a detailed support here!

To add context - All the different pdfs are merged into a single giant pdf using the insert_pdf method. The following is the code which merges all the individual (chapter wise pdfs) into one. Since the individual pdfs as such contain proper table of contents meta data this code just merges them into a master table of content metadata and puts that into the "giant pdf"

def merge_all_pdf_files(files: list[str], pdf_file: Document) -> None:
    master_toc = []
    for file in files:
        with pymupdf.open(file) as chapter_file:
            offset = pdf_file.page_count
            pdf_file.insert_pdf(chapter_file)
            for toc in chapter_file.get_toc(simple=False):
                title = toc[1].strip()
                exact_dest = toc[3]
                exact_dest["page"] = exact_dest["page"] + offset
                if title:
                    master_toc.append([toc[0], title, toc[2] + offset, exact_dest])
        os.remove(file)
    pdf_file.set_toc(master_toc)

So what we have is a perfect table of contents metadata with the exact links. My challenge is in rendering the same as the first page of the pdf

JorjMcKie · 2026-05-05T12:49:57Z

JorjMcKie
May 5, 2026
Maintainer

Let's move this to the Discussions tab first.

0 replies

JorjMcKie · 2026-05-05T21:32:33Z

JorjMcKie
May 5, 2026
Maintainer

I think you could simply do the following:

Make a new page in the resulting "giant" PDF. This new page could be the last page in the document at first. We can later correct this.
Then start writing the TOC lines into it. Add more pages if required.
When done, move this / these TOC pages to the front of the PDF. All references will still point to correct positions, because the mechanism is not dependent on page numbers but on page object references.

2 replies

JorjMcKie May 5, 2026
Maintainer

There is Document.move_page() for changing the sequence of pages.

kishaningithub May 6, 2026
Author

Thanks! What is the best practice to create internal links? Should it be done programatically via methods like insert_link or via a href tag when i am using the story abstraction?

kishaningithub · 2026-05-13T07:38:46Z

kishaningithub
May 13, 2026
Author

@JorjMcKie I was able to solve this problem.. ofcourse need to fine tune this.. just sharing here the idea and it works :-)

The trick

When writing links i am using an actual url with the following format

https://example.org/{page_no},{point_x},{point_y}

I am able to get the above info by calling chapter_file.get_toc(simple=False)

After the save operation is done i am opening the pdf again and rewriting the links acquired from page.get_links().

I am opening the same file after save(output) because the page.get_links() is not returning the links exactly after page.insert_htmlbox. Itried doing the page.refresh() but no luck there as of now.

Code

# Run the following exactly after pymupdf.save(output)
with pymupdf.open(output) as chapter_file:
    for i in range(no_of_pages):
        page = chapter_file[i]
        links = page.get_links()
        for link in links:
           # deleting all the links created by <a href="https://example.org/{page_no},{point_x},{point_y}" />
            page.delete_link(link)
            if link["kind"] == pymupdf.LINK_URI:
                # extracts {page_no},{point_x},{point_y}
                location = TocEntry.decode_location(link["uri"].split("/")[-1])
                new_link = {
                    "kind": pymupdf.LINK_GOTO,
                    "from": link["from"],
                    "page": location.page_no + no_of_pages - 1, # the no_of_pages is the number of pages occupied by table of contents
                    "to": location.to,
                }
                # inserting the link with the "exact goto jump"
                page.insert_link(new_link)
    # Override the same file
    chapter_file.saveIncr()

# Helper classes

@dataclass
class Location:
    page_no: int
    to: pymupdf.Point


@dataclass
class TocEntry:
    level: int
    title: str
    page_no: int
    sno: int
    to: dict

    def encode_location(self) -> str:
        return f"{self.page_no},{self.to[0]},{self.to[1]}"

    @staticmethod
    def decode_location(encoded_location: str) -> Location:
        values = encoded_location.split(",")
        return Location(page_no=int(values[0]), to=pymupdf.Point(values[1], values[2]))

1 reply

JorjMcKie May 13, 2026
Maintainer

Congratulations!
Thanks for sharing your approach here!

Unable to disable font ligatures in insert_htmlbox method #4985

Uh oh!

kishaningithub May 3, 2026

Description of the bug

Image

Impact

How to reproduce the bug

PyMuPDF version

Operating system

Python version

The trick

Code

Replies: 12 comments · 6 replies

Uh oh!

Uh oh!

JorjMcKie May 3, 2026 Maintainer

Uh oh!

JorjMcKie May 4, 2026 Maintainer

Uh oh!

Uh oh!

kishaningithub May 4, 2026 Author

Uh oh!

JorjMcKie May 4, 2026 Maintainer

Uh oh!

JorjMcKie May 4, 2026 Maintainer

Uh oh!

kishaningithub May 4, 2026 Author

Uh oh!

JorjMcKie May 4, 2026 Maintainer

Uh oh!

JorjMcKie May 5, 2026 Maintainer

Uh oh!

Uh oh!

kishaningithub May 5, 2026 Author

Uh oh!

JorjMcKie May 5, 2026 Maintainer

Uh oh!

JorjMcKie May 5, 2026 Maintainer

Uh oh!

Uh oh!

kishaningithub May 5, 2026 Author

Uh oh!

JorjMcKie May 5, 2026 Maintainer

Uh oh!

JorjMcKie May 5, 2026 Maintainer

Uh oh!

JorjMcKie May 5, 2026 Maintainer

Uh oh!

kishaningithub May 6, 2026 Author

Uh oh!

kishaningithub May 13, 2026 Author

The trick

Code

Uh oh!

JorjMcKie May 13, 2026 Maintainer

kishaningithub
May 3, 2026

Replies: 12 comments 6 replies

JorjMcKie
May 3, 2026
Maintainer

JorjMcKie
May 4, 2026
Maintainer

kishaningithub
May 4, 2026
Author

JorjMcKie
May 4, 2026
Maintainer

JorjMcKie
May 4, 2026
Maintainer

kishaningithub
May 4, 2026
Author

JorjMcKie
May 4, 2026
Maintainer

JorjMcKie
May 5, 2026
Maintainer

kishaningithub
May 5, 2026
Author

JorjMcKie May 5, 2026
Maintainer

JorjMcKie May 5, 2026
Maintainer

kishaningithub May 5, 2026
Author

JorjMcKie
May 5, 2026
Maintainer

JorjMcKie
May 5, 2026
Maintainer

JorjMcKie May 5, 2026
Maintainer

kishaningithub May 6, 2026
Author

kishaningithub
May 13, 2026
Author

JorjMcKie May 13, 2026
Maintainer