Skip to content

multi column pdf file text extraction #78

@sanketpatel91

Description

@sanketpatel91

Hello,
I am reaching out regarding my recent experience with pymupdf4llm. I have a PDF file that was created from a PowerPoint presentation, and I am attempting to extract specific text elements from it.

pdf content :
Text 1
- sub text 1.1
- sub text 1.2

Text 2
- sub text 2.1
- sub text 2.2

I am currently using the following code to read the PDF file:

all_pages_pdf = pymupdf4llm.to_markdown(filename, `page_chunks=True)
    for page in all_pages_pdf:
		page_number = page['metadata']['page']
		page_content = page['text']
		print(page_number)
		print(page_content)

Actual Output With V0.0.10 code :
Text 1
Text 2

  • sub text 1.1

  • sub text 1.2

  • sub text 2.1

  • sub text 2.2

However, I am aiming for the following desired output:
Text 1

  • sub text 1.1
  • sub text 1.2

Text 2

  • sub text 2.1
  • sub text 2.2

I would appreciate any guidance or assistance in achieving the desired output.
Thank you for your attention to this matter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions