Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export text as text in page.getSVGimage() #580

Closed
cipri-tom opened this issue Aug 3, 2020 · 12 comments
Closed

Export text as text in page.getSVGimage() #580

cipri-tom opened this issue Aug 3, 2020 · 12 comments
Assignees
Labels
enhancement resolved fixed / implemented / answered

Comments

@cipri-tom
Copy link

Is your feature request related to a problem? Please describe.
Sometimes I want to get the SVG of a page, as it allows me to analyse the contents a bit better than the PDF. I currently have a method of doing it, but I noticed that all glyphs of a font are done as a <path>, which is then referenced multiple times. This makes it a bit difficult to disambiguate between paths that correspond to text, and those which do not.

Describe the solution you'd like
I noticed that there is a possibility to export to SVG while keeping the text as text by passing a flag to fz_new_svg_device(), which is currently hardcoded

PyMuPDF/fitz/fitz.i

Lines 3198 to 3201 in 12d0201

dev = fz_new_svg_device(gctx, out,
tbounds.x1-tbounds.x0, // width
tbounds.y1-tbounds.y0, // height
FZ_SVG_TEXT_AS_PATH, 1);

Describe alternatives you've considered
Compile PyMuPDF from source and do the modification on my side. But I'm not sure which file to modify on PyMuPDF, as the source seems to be in a huge fitz.i file, which seems to be the output of the compiler ?

@JorjMcKie
Copy link
Collaborator

Thanks for bringing this up!
Shouldn't be difficult to add the corresponding option to Page.getSVGimage(). Any suggestion for the option name? text_as_path=True?

If you want to generate a modified PyMuPDF source yourself, you need SWIG. Invoke it like so:

swig -python fitz.i

It will output fitz.py and fitz_wrap.c. All the other *.i files are automatically included by this. The repo also has a setup.py which (hopefully!) automatically does this.

@cipri-tom
Copy link
Author

text_as_path=False sounds good to me 😁 . Unless there are some downsides in defaulting to False such as needing to embed font ?

Thanks for bringing up SWIG. I'll look into it !

@JorjMcKie
Copy link
Collaborator

If you want to generate a modified PyMuPDF source yourself, you need SWIG.

... and of course an installed MuPDF!

@JorjMcKie
Copy link
Collaborator

Unless there are some downsides in defaulting to False such as needing to embed font ?

Don't know yet. Should be easy to check that out.

@JorjMcKie
Copy link
Collaborator

Unless there are some downsides in defaulting to False such as needing to embed font ?

Change already done - easy peasy.

Indeed: using the text-as-text option creates a much smaller file, but correct display in a browser depends on the presence of the fonts named in the SVG.
As you have noticed, text-as-path in contrast synthesizes each single character by elementary draw commands. Here are example outputs created with FireFox standard installation on Windows. MS Edge doesn't look any better, though.

Text

File size 72 KB.

grafik

Path

File size 652 KB.

grafik

Conclusion

The text-as-path option represents the current situation. Changing this would require people to change scripts if they need previous behaviour.
On the other hand: this function is probably rarely used.

@cipri-tom
Copy link
Author

Oh, wow ! that "fixed width" font is anything but "fixed width" 😂 .

Changing this would require people to change scripts if they need previous behaviour.

Indeed, I didn't realise. A quick search on github shows that it isn't much used, but thre are still a couple of cases, so not a good idea to introduce such a breaking change. It's OK to default to True then.

Thanks a lot for the speed of getting it!

@cipri-tom
Copy link
Author

If you want to generate a modified PyMuPDF source yourself, you need SWIG.

... and of course an installed MuPDF!

By installed MuPDF, do you mean from source ? Or is there a muPDF-dev package than I can install more easily 😁 (Ubuntu 18.04)

I'd really like to experiment with this new text without waiting for the new release :">

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Aug 3, 2020

Oh, wow ! that "fixed width" font is anything but "fixed width" 😂 .

That' because the browser doesn't understand that the word "Courier" with the font's name should lead it to choose a fixed font. If you take the svg source and replace that fontname with just "Courier", everything looks immediately better. That is what I have been referring to.

Or is there a muPDF-dev package than I can install more easily 😁 (Ubuntu 18.04).

Don't know if there is such a thing. It must be v1.17.0 at any rate. But go ahead and try installing by the recipe given in the repo folder "installation". It is an easy thing to do.
If you give me your Python version (again), I can also point you to a wheel to download a pre-version. Matter of minutes.

@cipri-tom
Copy link
Author

But go ahead and try installing by the recipe given in the repo folder "installation". It is an easy thing to do.
🤦 can't believe I haven't checked the root of this repo for installation. Thank you !

If you give me your Python version (again), I can also point you to a wheel to download a pre-version. Matter of minutes.

Ah, that would save me a bit of trouble. Thanks a lot ! It's Python 3.8.3 on an Ubuntu box, 64 bit I suppose.

@JorjMcKie
Copy link
Collaborator

@JorjMcKie
Copy link
Collaborator

good luck - hope I haven't introduced some f**up

@cipri-tom
Copy link
Author

Works beautifully so far 😁 . Thanks a lot !

@JorjMcKie JorjMcKie added the resolved fixed / implemented / answered label Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement resolved fixed / implemented / answered
Projects
None yet
Development

No branches or pull requests

2 participants