Skip to content

Commit

Permalink
docs: Add some new OCR tips
Browse files Browse the repository at this point in the history
  • Loading branch information
wmanley committed May 5, 2015
1 parent e24c3dc commit b53a6de
Showing 1 changed file with 23 additions and 0 deletions.
23 changes: 23 additions & 0 deletions docs/ocr.md
Expand Up @@ -13,6 +13,29 @@ Tesseract uses a dictionary to help choose the correct word even if individual
characters were misread. This is helpful when reading real words but it can get
in the way when reading characters and words with a different structure.

## General tips

* Crop the region in which you are performing OCR tight to the text using the
`region` parameter to `ocr` and `match_text`

## Matching some known text

* Use the `tesseract_user_words` or `tesseract_user_patterns` parameters to
`ocr` and `match_text` to tell the OCR engine what you're expecting.
* Use fuzzy matching to check if the returned text matches what you were
expecting. e.g. a function like:

def fuzzy_match(string1, string2, threshold=0.8):
import difflib
return difflib.SequenceMatcher(None, string1, string2).ratio() >= threshold

### Example

Looking for the text "EastEnders":

text = stbt.ocr(region=stbt.Region(52, 34, 120, 50))
assert fuzzy_match(text, "EastEnders)

## Matching serial numbers

For example, here is a code generated at random by one user's set-top box, for
Expand Down

0 comments on commit b53a6de

Please sign in to comment.