Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query lines, words, paragraphs, blocks get error no text returned #249

Open
flashpixx opened this issue Feb 24, 2021 · 5 comments
Open

query lines, words, paragraphs, blocks get error no text returned #249

flashpixx opened this issue Feb 24, 2021 · 5 comments

Comments

@flashpixx
Copy link

Hello,

I try to get all boxes of lines, words, paragraph, blocks and symbols but I get on a second call the error "No text returned". I have written a method to iterate over all boxes

    def _boxes(api, element: tesserocr.RIL) -> Iterable[ElementData]:
        l_boxes = api.GetComponentImages(element, text_only=True)
        for i, (im, box, _, _) in enumerate(l_boxes):
            api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
            yield ElementData(
                id=self._uuid,
                index=i,
                text = self._ocr.api.GetUTF8Text().strip(),
                confidence=ConfidenceData(
                    mean=self._ocr.api.MeanTextConf(),
                    token=[TokenConfidenceData(text=i[0], confidence=i[1]) for i in api.MapWordConfidences()]
                ),
                bounding_box=box
            )

ElementData is a dataclass for storing the data and api is the api reference, I call this function with:

with PyTessBaseAPI() as api:
    api.SetImage(image)

    for i in  _boxes(api, RIL.TEXTLINE):
        // store i in a database
    
   >>! at this point I get an error, the first loop works well as expected
    for i in _boxes(api, RIL.WORD):
        // store i in a database

    for i in  _boxes(api, RIL.PARA):
        // store i in a database
    
    for i in _boxes(api, RIL.BLOCK):
        // store i in a database

    for i in  _boxes(api, RIL.SYMBOL):
        // store i in a database

The input by set image is a PIL-Image as a single page (a JPEG file). The image of the page has goot a header, a footer with some text, multiple paragraphs and multiple lines with words. How can I do this

@flashpixx flashpixx changed the title query query lines, words, paragraphs, blocks get error no text returned Feb 24, 2021
@sirfz
Copy link
Owner

sirfz commented Feb 25, 2021

I'm not an expert with tesseract's API but is it possible that the SetRectangle call basically limits the detection area to that box so in the next call it's operating on the last SetRectangle area from the first _boxes call. Just something to look into.

@flashpixx
Copy link
Author

yes, I agree, but how can I reset the rectangle after each call?

@flashpixx
Copy link
Author

flashpixx commented Mar 2, 2021

I have added bevor each loop:

api.SetRectangle(0, 0, *image.size)

image is the pillow image instance and size returns weight and height in pixel of the image, it works in general, I get boxes for words, symbols, lines etc. But it seems that the ordering is not set correctly, so e.g. words boxes does not have got an order like the origin text, so if I have get all words but I cannot create by concatinating the origin text.

My goal is to get the whole text in different box detail levels

@bertsky
Copy link
Contributor

bertsky commented Jul 2, 2021

I also see your use of SetRectangle as the culprit. The API doc says:

Each SetRectangle clears the recogntion results so multiple rectangles can be recognized with the same image.

You want to use that function before triggering layout analysis or recognition, not afterwards. Since you are already using the page iterator (via GetComponentImages), you only need to loop over results on all hierarchy levels. (I also recommend restructuring your main loop so that it follows the natural RIL recursion.)

The use case for SetRectangle across levels is if you have an external segmentation of the image into regions, paragraphs, lines or words. (Which would also entail using SetPageSegMode(PSM.SINGLE_COLUMN) / SINGLE_BLOCK / SINGLE_LINE / SINGLE_WORD.)

@bertsky
Copy link
Contributor

bertsky commented Jul 2, 2021

@sirfz, again, the problem is already in the usage example of the current README:

tesserocr/README.rst

Lines 181 to 187 in 711cbab

boxes = api.GetComponentImages(RIL.TEXTLINE, True)
print('Found {} textline image components.'.format(len(boxes)))
for i, (im, box, _, _) in enumerate(boxes):
# im is a PIL image object
# box is a dict with x, y, w and h keys
api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
ocrResult = api.GetUTF8Text()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants