When extracting text by line, how to extract table structures and content? #2658

nissansz · 2023-09-10T01:33:39Z

nissansz
Sep 10, 2023

When extracting text by line, how to extract table structures and content?
Table1.pdf

For example, extract table structure like html code

JorjMcKie · 2023-09-10T08:10:46Z

JorjMcKie
Sep 10, 2023
Maintainer

Above example:

In [1]: import fitz
In [2]: doc=fitz.open("Table1.pdf")
In [3]: page=doc[0]
In [4]: tabs = page.find_tables()  # detect the tables
In [5]: len(tabs.tables)
Out[5]: 1
In [6]: tab = tabs[0]
In [7]: for e in tab.extract():
   ...:     print(e)
   ...:
['大撒大撒 1', 'we are1', '大丈夫です 1', '큰 스프레드 1', '特色 1']
['大撒大撒 2', 'we are2', '大丈夫です 2', None, '特色 2']
['大撒大撒 3', 'we are3', '大丈夫です 3', '큰 스프레드 3', '特色 3']
['大撒大撒 4', 'we are4', '大丈夫です 4', '큰 스프레드 4', '特色 4']
['大撒大撒 5', 'we are5', '大丈夫です 5', '큰 스프레드 5', None]
['大撒大撒 6', 'we are6', '大丈夫です 6', '큰 스프레드 6', '特色 6']
In [8]:

1 reply

Gerlah Dec 13, 2023

Hi, I used the same code you supplied. I receive an error "AttributeError: 'Page' object has no attribute 'find_tables'". When I search fitz I don't find a method either. What would the problem be here.

julian-smith-artifex-com · 2023-12-13T19:11:53Z

julian-smith-artifex-com
Dec 13, 2023
Maintainer

Closing this because on discord it was found to be a problem with an old release of PyMuPDF.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When extracting text by line, how to extract table structures and content? #2658

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

When extracting text by line, how to extract table structures and content? #2658

nissansz Sep 10, 2023

Replies: 2 comments · 1 reply

JorjMcKie Sep 10, 2023 Maintainer

Gerlah Dec 13, 2023

julian-smith-artifex-com Dec 13, 2023 Maintainer

nissansz
Sep 10, 2023

Replies: 2 comments 1 reply

JorjMcKie
Sep 10, 2023
Maintainer

julian-smith-artifex-com
Dec 13, 2023
Maintainer