multi column pdf file text extraction

Hello,
I am reaching out regarding my recent experience with pymupdf4llm. I have a PDF file that was created from a PowerPoint presentation, and I am attempting to extract specific text elements from it.

pdf content :
Text 1
          - sub text 1.1
          - sub text 1.2

Text 2
        - sub text 2.1
        - sub text 2.2
        
I am currently using the following code to read the PDF file:

```
all_pages_pdf = pymupdf4llm.to_markdown(filename, `page_chunks=True)
    for page in all_pages_pdf:
		page_number = page['metadata']['page']
		page_content = page['text']
		print(page_number)
		print(page_content)
```

Actual Output With V0.0.10 code :
Text 1
Text 2

- sub text 1.1
- sub text 1.2

- sub text 2.1
- sub text 2.2

However, I am aiming for the following desired output:
Text 1
- sub text 1.1
- sub text 1.2

Text 2
- sub text 2.1
- sub text 2.2

I would appreciate any guidance or assistance in achieving the desired output.
Thank you for your attention to this matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi column pdf file text extraction #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multi column pdf file text extraction #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions