Scan Extractor

This repo documents the process to extract strucutred data from images of textual data.

Step 1

Convert HEIC to JPG images one_two.sh

If your images are already in JPG format. You can skip this step. If your images are not in HEIC, you may be able to use ImageMagick's Convert program.

Step 2

Straighten, Dewarp, remove paper, and convert to Black & White I reccommend using Scan Tailor

If your images are really good. You might be able to use a vairation on two_three.sh which uses textcleaner by Fred Weinhaus

Step 3

Use Tesseract OCR to extract text from the images. three_four.sh

If you know your input has a limited character set, I have found that using

tessedit_char_whitelist eliminates post processing work needed.

Step 4

Post process the OCR output and generate a CSV. process.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scan Extractor

Step 1

Step 2

Step 3

If you know your input has a limited character set, I have found that using

Step 4

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
one_two.sh		one_two.sh
process.py		process.py
readme.md		readme.md
three_four.sh		three_four.sh
two_three.sh		two_three.sh

Folders and files

Latest commit

History

Repository files navigation

Scan Extractor

Step 1

Step 2

Step 3

If you know your input has a limited character set, I have found that using

Step 4

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages