How to copy text/code from PDF with indentation(using pymupdf)? #509

utmcontent · 2020-05-19T09:04:36Z

I want to copy some code from PDF and paste into text editor,but lost code indentation.Is that possible to generate a PDF with code that can copy and paste with correct indentation so we don't need to type indentation by hand again.
Here is an example that can not copy/paste with indentation:

JorjMcKie · 2020-05-19T11:32:05Z

Hm ... not out of the box.
First of all, you would have to use one of the Page.getText(option, flags=nnn) variants.

Second, it all depends on how the text in the PDF is encoded: if indentation is encoded as spaces, you are fine to just use the output of page.getText().
If tabs are used instead, modify the flags parameter such that white spaces are preserved (TEXT_PRESERVE_WHITESPACE).

If all fails, use a methods which also provides text position information. As program code generally uses mono-spacing, for every text piece its start position can be translated into a unique number of spaces to prefix it with.To make this a bit clearer (hopefully):

``page.getText("dict")["blocks"] is a list of dictionaries.
Each item represents a text block (think of it as a paragraph). It contains a list of sub dictionaries, which each represent a line.
Each line again contains a list of sub dicts, called "spans". A spans contains text with identical font properties. ~~So in case of a program, a line would just contain one span.~~

Step one: determine which text x-coordinate represents column 0. This is the minimum of the x0 coordinate of the line (or span) bboxes.
Step two: determine the (constant!) width of one character. Take any span, divide its bbox width by its character count.

After this loop through the spans and output each span["text"] prefixed with the correct number of spaces determined by the x0 coordinate of the span bbox.

Here is a script to start with:
code-printer.zip

Note:
I outputted text lines instead of spans, because program code maybe colored (see pygments), which produces (see above) a separate span each time. So I concatenate the spans for each line ...

utmcontent added the question label May 19, 2020

utmcontent assigned JorjMcKie May 19, 2020

JorjMcKie closed this as completed May 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to copy text/code from PDF with indentation(using pymupdf)? #509

How to copy text/code from PDF with indentation(using pymupdf)? #509

utmcontent commented May 19, 2020

JorjMcKie commented May 19, 2020 •

edited

How to copy text/code from PDF with indentation(using pymupdf)? #509

How to copy text/code from PDF with indentation(using pymupdf)? #509

Comments

utmcontent commented May 19, 2020

JorjMcKie commented May 19, 2020 • edited

JorjMcKie commented May 19, 2020 •

edited