query lines, words, paragraphs, blocks get error no text returned #249

flashpixx · 2021-02-24T22:15:24Z

Hello,

I try to get all boxes of lines, words, paragraph, blocks and symbols but I get on a second call the error "No text returned". I have written a method to iterate over all boxes

    def _boxes(api, element: tesserocr.RIL) -> Iterable[ElementData]:
        l_boxes = api.GetComponentImages(element, text_only=True)
        for i, (im, box, _, _) in enumerate(l_boxes):
            api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
            yield ElementData(
                id=self._uuid,
                index=i,
                text = self._ocr.api.GetUTF8Text().strip(),
                confidence=ConfidenceData(
                    mean=self._ocr.api.MeanTextConf(),
                    token=[TokenConfidenceData(text=i[0], confidence=i[1]) for i in api.MapWordConfidences()]
                ),
                bounding_box=box
            )

ElementData is a dataclass for storing the data and api is the api reference, I call this function with:

with PyTessBaseAPI() as api:
    api.SetImage(image)

    for i in  _boxes(api, RIL.TEXTLINE):
        // store i in a database
    
   >>! at this point I get an error, the first loop works well as expected
    for i in _boxes(api, RIL.WORD):
        // store i in a database

    for i in  _boxes(api, RIL.PARA):
        // store i in a database
    
    for i in _boxes(api, RIL.BLOCK):
        // store i in a database

    for i in  _boxes(api, RIL.SYMBOL):
        // store i in a database

The input by set image is a PIL-Image as a single page (a JPEG file). The image of the page has goot a header, a footer with some text, multiple paragraphs and multiple lines with words. How can I do this

The text was updated successfully, but these errors were encountered:

sirfz · 2021-02-25T15:17:03Z

I'm not an expert with tesseract's API but is it possible that the SetRectangle call basically limits the detection area to that box so in the next call it's operating on the last SetRectangle area from the first _boxes call. Just something to look into.

flashpixx · 2021-02-25T19:26:13Z

yes, I agree, but how can I reset the rectangle after each call?

flashpixx · 2021-03-02T17:46:02Z

I have added bevor each loop:

api.SetRectangle(0, 0, *image.size)

image is the pillow image instance and size returns weight and height in pixel of the image, it works in general, I get boxes for words, symbols, lines etc. But it seems that the ordering is not set correctly, so e.g. words boxes does not have got an order like the origin text, so if I have get all words but I cannot create by concatinating the origin text.

My goal is to get the whole text in different box detail levels

bertsky · 2021-07-02T18:35:45Z

I also see your use of SetRectangle as the culprit. The API doc says:

Each SetRectangle clears the recogntion results so multiple rectangles can be recognized with the same image.

You want to use that function before triggering layout analysis or recognition, not afterwards. Since you are already using the page iterator (via GetComponentImages), you only need to loop over results on all hierarchy levels. (I also recommend restructuring your main loop so that it follows the natural RIL recursion.)

The use case for SetRectangle across levels is if you have an external segmentation of the image into regions, paragraphs, lines or words. (Which would also entail using SetPageSegMode(PSM.SINGLE_COLUMN) / SINGLE_BLOCK / SINGLE_LINE / SINGLE_WORD.)

bertsky · 2021-07-02T22:31:23Z

@sirfz, again, the problem is already in the usage example of the current README:

tesserocr/README.rst

Lines 181 to 187 in 711cbab

    
                   boxes = api.GetComponentImages(RIL.TEXTLINE, True) 
        
                   print('Found {} textline image components.'.format(len(boxes))) 
        
                   for i, (im, box, _, _) in enumerate(boxes): 
        
                       # im is a PIL image object 
        
                       # box is a dict with x, y, w and h keys 
        
                       api.SetRectangle(box['x'], box['y'], box['w'], box['h']) 
        
                       ocrResult = api.GetUTF8Text()

flashpixx changed the title ~~query~~ query lines, words, paragraphs, blocks get error no text returned Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query lines, words, paragraphs, blocks get error no text returned #249

query lines, words, paragraphs, blocks get error no text returned #249

flashpixx commented Feb 24, 2021

sirfz commented Feb 25, 2021

flashpixx commented Feb 25, 2021

flashpixx commented Mar 2, 2021 •

edited

bertsky commented Jul 2, 2021

bertsky commented Jul 2, 2021

Navigation Menu

query lines, words, paragraphs, blocks get error no text returned #249

query lines, words, paragraphs, blocks get error no text returned #249

Comments

flashpixx commented Feb 24, 2021

sirfz commented Feb 25, 2021

flashpixx commented Feb 25, 2021

flashpixx commented Mar 2, 2021 • edited

bertsky commented Jul 2, 2021

bertsky commented Jul 2, 2021

flashpixx commented Mar 2, 2021 •

edited