New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optical character recognition (OCR) of text in freeform fields #15

Open
rholbert opened this Issue Jan 4, 2013 · 10 comments

Comments

Projects
None yet
3 participants
@rholbert

rholbert commented Jan 4, 2013

I'd like to see an enhancement added to the SDAPS project that would allow optical character recognition (OCR) of the text in freeform fields.

The Gamera Project may be a good place to start:

http://gamera.informatik.hsnr.de/

@benzea

This comment has been minimized.

Member

benzea commented Jan 5, 2013

Oh, I somehow expected this to come up at some point ... thanks for pointing out gamera, I did not know about it.

Some thoughts of what needs to be done for this:

  • Add a new textbox type to LaTeX (no idea how to do that in ODT right now)
    • Allow only one line of writing
    • Render tick marks at the top bottom (only to force proper printing of characters by users)
  • Figure out gamera script:
    • Create gamera test/training data
    • Export some example data from SDAPS
    • Create a standalone gamera script
    • Hook up everything into SDAPS
  • Add correction support (text input); maybe some features to create training data
  • Wordlists? (ie. often there will be only some certain good answers, doing this could enhance recognition, and would allow categorizing for statistics)

Gamera seems like a good starting point, but this whole thing is quite a big chunk to get it working. I have no idea when or even how much time I can spend on this myself ...

@benzea

This comment has been minimized.

Member

benzea commented Feb 23, 2013

Step 1. is done. There is now an "ocr" branch, that has support for rendering text fields that are optimized for OCR. Obviously the rest of the program does not understand this yet.

@rholbert

This comment has been minimized.

rholbert commented Feb 23, 2013

Cool!

Sent from my iPhone

On Feb 23, 2013, at 10:09, Benjamin Berg notifications@github.com wrote:

Step 1. is done. There is now an "ocr" branch, that has support for rendering text fields that are optimized for OCR. Obviously the rest of the program does not understand this yet.


Reply to this email directly or view it on GitHub.

@benzea

This comment has been minimized.

Member

benzea commented Apr 2, 2013

I did some work on the branch to interface with gamera. I don't seem to entirely understand it right now, but there is hope :-)

It seems to me that the default grouping doesn't work, but SDAPS can do grouping by itself. Also, for a start it seems easier to do the training using the gamera_gui program instead of something custom.

Important next steps:

  • Filter to remove the border around the textbox (maybe similar to QueXF)
  • Possibility to export non-empty fields from SDAPS into one large image that can be used for training.

Fun side fact: gamera seems to store the image+original location into the XML file for each character; I guess some munging could be necessary for privacy reasons if one wants to share the training data. Otherwise the original strings could be build from the training data.

@Narayane

This comment has been minimized.

Narayane commented Oct 27, 2014

Hi,
is this enhancement still current or given up ?
Best regards.

@benzea

This comment has been minimized.

Member

benzea commented Oct 27, 2014

Unless something unexpected happens, it is unlikely that I will work on this anytime soon. So, I wouldn't hold my breath for this to happen.

@benzea

This comment has been minimized.

Member

benzea commented Nov 7, 2014

Unexpected would for likely mean some external code contribution, or if someone really wants this paid development work (either by me or a third party).

@Narayane

This comment has been minimized.

Narayane commented Dec 10, 2014

Hi,

considering paid development is an option, is it possible for you (or a third party) to estimate remaining work to achieve this feature ?

Thank you.

@benzea

This comment has been minimized.

Member

benzea commented Dec 10, 2014

Hi,

It is hard to say how much work it is overall. I expect that there is a
large gap between just getting something working and having good ways to
train any recognition engine that might get integrated. Then error
correction (e.g. using word lists) and ways to collect training data, do
the training, and share it.

You might want also want to talk to Matthew Roy, see
http://article.gmane.org/gmane.comp.statistics.sdaps/173
I don't know how well tesseract might work, but it definitely is a
relatively quick way to get some support at least.

I'll think about the matter some more the next days/week (i.e. how much
work it would be. And there are also some job changes for me very soon,
which might change the chances of me working on this drastically
unfortunately).

Benjamin

@Narayane

This comment has been minimized.

Narayane commented Dec 10, 2014

Hi,

I will be very interested by your evaluation, please keep me informed.

SDAPS seems to be a very good start point to build a complete solution to meet needs of one of my customers but I absolutely need OCR in addition of OMR to be complete.

Another option could be to take over ocr branch code to try to achieve it but certainly not the best in time and money terms in my context.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment