Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to copy text/code from PDF with indentation(using pymupdf)? #509

Closed
utmcontent opened this issue May 19, 2020 · 1 comment
Closed

How to copy text/code from PDF with indentation(using pymupdf)? #509

utmcontent opened this issue May 19, 2020 · 1 comment
Assignees
Labels

Comments

@utmcontent
Copy link

I want to copy some code from PDF and paste into text editor,but lost code indentation.Is that possible to generate a PDF with code that can copy and paste with correct indentation so we don't need to type indentation by hand again.
Here is an example that can not copy/paste with indentation:
copy_instance
copy
exxxx
lss

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented May 19, 2020

Hm ... not out of the box.
First of all, you would have to use one of the Page.getText(option, flags=nnn) variants.

Second, it all depends on how the text in the PDF is encoded: if indentation is encoded as spaces, you are fine to just use the output of page.getText().
If tabs are used instead, modify the flags parameter such that white spaces are preserved (TEXT_PRESERVE_WHITESPACE).

If all fails, use a methods which also provides text position information. As program code generally uses mono-spacing, for every text piece its start position can be translated into a unique number of spaces to prefix it with.To make this a bit clearer (hopefully):

  • ``page.getText("dict")["blocks"] is a list of dictionaries.
  • Each item represents a text block (think of it as a paragraph). It contains a list of sub dictionaries, which each represent a line.
  • Each line again contains a list of sub dicts, called "spans". A spans contains text with identical font properties. So in case of a program, a line would just contain one span.

Step one: determine which text x-coordinate represents column 0. This is the minimum of the x0 coordinate of the line (or span) bboxes.
Step two: determine the (constant!) width of one character. Take any span, divide its bbox width by its character count.

After this loop through the spans and output each span["text"] prefixed with the correct number of spaces determined by the x0 coordinate of the span bbox.

Here is a script to start with:
code-printer.zip

Note:
I outputted text lines instead of spans, because program code maybe colored (see pygments), which produces (see above) a separate span each time. So I concatenate the spans for each line ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants