Unable to disable font ligatures in insert_htmlbox method #4985
-
Beta Was this translation helpful? Give feedback.
Replies: 12 comments 6 replies
-
|
Parsing the HTML and CSS syntax in the text and the css parameters is done by our base library MuPDF. As documented, not the full respective syntaxes are covered by this parsing. So you will always find unsupported language expressions. PyMuPDF has no way of influencing this. All you can do is contacting the MuPDF team in this Discord channel. However: Even when your request would be implemented, it would not solve your problem, because "ti" is no official ligature. There exists no Unicode definition for it. Text searching in PyMuPDF has no problem with ligatures: they will be automatically decomposed, letting you successfully find e.g. "fi" - whether present as separate characters or as ligature "fi". import pymupdf
from pathlib import Path
calibri = Path("calibri.ttf").read_bytes()
arch = pymupdf.Archive(calibri, "calibri")
css = "@font-face {font-family: sans-serif; src: url(calibri);}"
css += "* {font-family: sans-serif}"
doc = pymupdf.open()
page = doc.new_page()
rect = (200, 200, 400, 400)
text = "Introduction"
page.insert_htmlbox(rect, text, archive=arch, css=css)
print(page.get_text()) # extract with auto-replacing unknown Unicodes
print(page.get_text(flags=0)) # extract suppressing replacement
# doc.ez_save("test.pdf")The output:
IAW: the font contains no backtranslation information for this pseudo-ligature. My recommendation: Use a different font! |
Beta Was this translation helpful? Give feedback.
-
|
You can of course use |
Beta Was this translation helpful? Give feedback.
-
|
@JorjMcKie My usecase is the following I have a page (table of contents page) created using the I would like to create links (preferably with for toc_entry in chunk:
title = toc_entry[1].strip()
page_no = toc_entry[2]
rl = page.search_for(title)
print(f"toc_entry: {toc_entry}")
if rl:
link = {
"kind": pymupdf.LINK_GOTO,
"from": rl[0],
"to": toc_entry[3]["to"],
"page": page_no,
}
page.insert_link(link)but the above did not works as expected because Any pointers would be much appreciated. |
Beta Was this translation helpful? Give feedback.
-
|
You could give up using If you use Using Story basically means that you prepare the complete content in one string, including require HTML / CSS styling, tables, and potential images to be embedded. Then use one of the Story methods to create the PDF output. |
Beta Was this translation helpful? Give feedback.
-
|
BTW - You are aware of Calibri license restrictions that may apply to you? This is not a free font. |
Beta Was this translation helpful? Give feedback.
-
|
@JorjMcKie How to solve the linking problem using the Story object? Could not get that working. Any pointers to examples would be great.. |
Beta Was this translation helpful? Give feedback.
-
|
Use a method like this one to get a a PDF Document object in which the links contained in your source text have been correctly resolved. |
Beta Was this translation helpful? Give feedback.
-
|
Let me know how you get on with this! |
Beta Was this translation helpful? Give feedback.
-
|
Definitely @JorjMcKie I am currently thinking, if there is a way i can define the internal address in the |
Beta Was this translation helpful? Give feedback.
-
|
Let's move this to the Discussions tab first. |
Beta Was this translation helpful? Give feedback.
-
|
I think you could simply do the following:
|
Beta Was this translation helpful? Give feedback.
-
|
@JorjMcKie I was able to solve this problem.. ofcourse need to fine tune this.. just sharing here the idea and it works :-) The trickWhen writing links i am using an actual url with the following format https://example.org/{page_no},{point_x},{point_y}I am able to get the above info by calling After the save operation is done i am opening the pdf again and rewriting the links acquired from I am opening the same file after Code# Run the following exactly after pymupdf.save(output)
with pymupdf.open(output) as chapter_file:
for i in range(no_of_pages):
page = chapter_file[i]
links = page.get_links()
for link in links:
# deleting all the links created by <a href="https://example.org/{page_no},{point_x},{point_y}" />
page.delete_link(link)
if link["kind"] == pymupdf.LINK_URI:
# extracts {page_no},{point_x},{point_y}
location = TocEntry.decode_location(link["uri"].split("/")[-1])
new_link = {
"kind": pymupdf.LINK_GOTO,
"from": link["from"],
"page": location.page_no + no_of_pages - 1, # the no_of_pages is the number of pages occupied by table of contents
"to": location.to,
}
# inserting the link with the "exact goto jump"
page.insert_link(new_link)
# Override the same file
chapter_file.saveIncr()# Helper classes
@dataclass
class Location:
page_no: int
to: pymupdf.Point
@dataclass
class TocEntry:
level: int
title: str
page_no: int
sno: int
to: dict
def encode_location(self) -> str:
return f"{self.page_no},{self.to[0]},{self.to[1]}"
@staticmethod
def decode_location(encoded_location: str) -> Location:
values = encoded_location.split(",")
return Location(page_no=int(values[0]), to=pymupdf.Point(values[1], values[2])) |
Beta Was this translation helpful? Give feedback.


@JorjMcKie I was able to solve this problem.. ofcourse need to fine tune this.. just sharing here the idea and it works :-)
The trick
When writing links i am using an actual url with the following format
https://example.org/{page_no},{point_x},{point_y}I am able to get the above info by calling
chapter_file.get_toc(simple=False)After the save operation is done i am opening the pdf again and rewriting the links acquired from
page.get_links().I am opening the same file after
save(output)because thepage.get_links()is not returning the links exactly afterpage.insert_htmlbox. Itried doing thepage.refresh()but no luck there as of now.Code
# Run the following exactly after pymupdf.save…