Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False result when finding bounding boxes for lines in blocks. #3581

Closed
khaled-hammoud opened this issue Jun 14, 2024 · 6 comments
Closed

False result when finding bounding boxes for lines in blocks. #3581

khaled-hammoud opened this issue Jun 14, 2024 · 6 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@khaled-hammoud
Copy link

Description of the bug

Hi, I am using fitz module to extract the bounding boxes around texts, so I extracted them then plotted them with matplotlib to figure out how they would look like and how correct are they for me in this case.

I opened the same pdf file using Foxit Editor then toggled object editor for text and selected all texts in order to see how bbox are shown in Foxit Reader, they are perfect and correct as expected to be for us as human readers, then I compared that with result got from matplotlib and fitz, found that bbox from fitz are almost the same but there are wrong cases:

1- Where two columns are handled as if they were a single column (illustrated as one big and small red arrows).
2- Where three cells are handled as if they were a single cell (illustrated as 3 equal red arrows in image).

Please see the illustration for figuring out the comparaison, I tried to annotate and writing down the legend and my problem, any further information needed, please ask.

Hint: This some non-latin words used in document is a rtl language and here exactly Arabic.

How to reproduce the bug

That is the pdf file:
North 02_Minieh_Record 01.pdf

This is my python script and the illustration below for comparaison:

import matplotlib.pyplot as plt
import fitz

#auxilliary function to plot closed polygon
def plot_poly(x1, y1, x2, y2, color = 'k', linewidth = 1):
    plt.plot(
        [x1, x2, x2, x1, x1], #four x vertices and closed
        [y1, y1, y2, y2, y1], #four y vertices and closed
        color = color,
        linewidth = linewidth
    )


pdf_file = "North 02_Minieh_Record 01.pdf"

with fitz.open(pdf_file) as doc:
    for page in doc:        
        dic = page.get_text('dict')

        #the text
        for block in dic['blocks']:
            if block['type'] == 0: #Text type
                for line in block['lines']:

                    X1, Y1, X2, Y2 = line['bbox'] #bbox of each line
                    plot_poly(X1, Y1, X2, Y2, color = 'r')

                    ''' #irrelevant for code now
                    if line['dir'][1] == -1: #Rotated text
                        angle = 90
                    else: #Normal text orientation
                        angle = 0
                    '''

                    for span in line['spans']:
                        size = span['size']
                        font = span['font']
                        text = span['text']
                        x0, y0 = span['origin']
                        plt.plot(x0, y0,'o', markersize = 1, color = 'b') #origin x, y of each span

        #the layout
        #one can comment this block out, but it is there for better figuring out
        drawings = page.get_drawings()
        for drawing in drawings:
            for item in drawing['items']:
                shape, data, num = item
                Xr1, Yr1, Xr2, Yr2 = data
                width = Xr2 - Xr1
                height = Yr2 - Yr1
                if width > 1 and height < 1:
                    plot_poly(Xr1, Yr1, Xr2, Yr2)
                elif width < 1 and height > 1:
                    plot_poly(Xr1, Yr1, Xr2, Yr2 )
                else:
                    pass

ax = plt.gca()
ax.invert_yaxis()
plt.show()
comparaison

PyMuPDF version

1.24.1

Operating system

Windows

Python version

3.9

@khaled-hammoud
Copy link
Author

ADDED:

I used now rawdict instead of dict as parameter for page.get_text(), so the script now gives better results for those two columns but still case number 2 there:

import matplotlib.pyplot as plt
import fitz

#auxilliary function to plot closed polygon
def plot_poly(x1, y1, x2, y2, color = 'k', linewidth = 1):
    plt.plot(
        [x1, x2, x2, x1, x1], #four x vertices and closed
        [y1, y1, y2, y2, y1], #four y vertices and closed
        color = color,
        linewidth = linewidth
    )


pdf_file = "North 02_Minieh_Record 01.pdf"

with fitz.open(pdf_file) as doc:
    for page in doc:        
        dic = page.get_text('rawdict')

        #the text
        for block in dic['blocks']:
           
            if block['type'] == 0: #Text type
                for line in block['lines']:

                    X1, Y1, X2, Y2 = line['bbox'] #bbox of each line
                    #plot_poly(X1, Y1, X2, Y2, color = 'r') #no need for it now as previous code

                    ''' #irrelevant for code now
                    if line['dir'][1] == -1: #Rotated text
                        angle = 90
                    else: #Normal text orientation
                        angle = 0
                    '''

                    for span in line['spans']:
                        ascender = span['ascender']
                        descender = span['descender']
                        size = span['size']
                        font = span['font']
                        #text = span['text']

                        x0, y0 = span['origin']
                        plt.plot(x0, y0,'o', markersize = 1, color = 'b') #origin x, y of each span
                        
                        spx1, spy1, spx2, spy2 = span['bbox']
                        plot_poly(spx1, spy1, spx2, spy2, color = 'r')

        
        #the layout
        #one can comment this block out, but it is there for better figuring out
        drawings = page.get_drawings()
        for drawing in drawings:
            for item in drawing['items']:
                shape, data, num = item
                Xr1, Yr1, Xr2, Yr2 = data
                width = Xr2 - Xr1
                height = Yr2 - Yr1
                if width > 1 and height < 1:
                    plot_poly(Xr1, Yr1, Xr2, Yr2)
                elif width < 1 and height > 1:
                    plot_poly(Xr1, Yr1, Xr2, Yr2 )
                else:
                    pass
        

ax = plt.gca()
ax.invert_yaxis()
plt.show()

This case problem remains:
case2

@JorjMcKie
Copy link
Collaborator

I think there is a basic misconception here:
PyMuPDF text extraction does not care about at all whether text pieces are in table cells or not!
Whether text particles are regarded as being in the same line is decided based on criteria like font size inter-character and inter-word distances and what not else.

IAW your red arrows are no bug.

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Jun 14, 2024
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jun 14, 2024

It looks like you want to locate / extract text from table cells.
This is supported, but you have to identify / find the table via page.find_tables() and then use table-related method to extract text from single cells.

@khaled-hammoud
Copy link
Author

khaled-hammoud commented Jun 15, 2024

Thanks, for remembering! yes, I am already familiar with page.find_tables() but I am doing something else.

So If I really understand you then you mean that those three adjacent cells are treated as one cell because they may have same y-coordinate for line and other criterias, isn't?

Here below in illustration and just for testing purpose, I used Foxit Editor and selected the middle cell text and slide it vertically a litte bit then I applied fitz to see the result, then it passed and handled as three single cells instead of one single cell.

Can I for example pass some parameter to get_text() like distance tolerance between words, so if it is greater than some number then handle it as another cell?!

Untitled

@JorjMcKie
Copy link
Collaborator

No, I haven't been clear enough it seems:
Text extraction does not know the heck about cells. All it knows is about text. Whether text pieces are joined or not to form one span is decided independently from whether there are any lines, background colors, or whatever.
Just imagine that that page contains no line, no background, nothing of that sort. Only text present.

I made a picture for you: This is what text extraction sees:
image

@khaled-hammoud
Copy link
Author

Thank you so much, now it is more clear for me what you mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants