bookscanning

Some scripts and things for processing scanned books. These will probably only be useful to people that have a huge directory of TIFF files of the sort that ScanTailor puts out after processing a huge directory of JPEG files that a DIY Book Scanner produces. I wrote these to help process a book scanned at Noisebridge.

These steps assume you've got imagemagick, tesseract, pdftk, and, like, Perl and Ruby installed.

looptifftopdf.rb converts all those TIFFs to PDF files in a subdirectory called /pdf/. Then I just went into that directory and used pdftk like pdftk *.pdf cat full-book.pdf. It's made a very large PDF, and I'm working on making that smaller.

ocrthethings.rb runs through the same TIFFs with Tesseract and produces a final OCRed output that is pretty good. One weird thing is it had page numbers in it, which I don't think I need. The difficult part is that they're surrounded by newlines, and I couldn't just strip out all the lines at once with grep or sed or whatever. So I turned to Perl, and used this one-liner:

perl -pe 'undef $/; s/\n\d{1,3}\n\n//g' finaltext.txt > finaltext-nonums.txt

That'll work so long as the page numbers are surrounded by newlines, are between 1 and 3 digits long, have nothing else on the same line, and you want to yank out all three lines each time.

TODO

Make PDF much smaller, with some kinda optimization along the way
(perhaps at cross purposes) add the OCR to the PDF with hOCR or something.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
looptifftopdf.rb		looptifftopdf.rb
ocrthethings.rb		ocrthethings.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bookscanning

TODO

About

Releases

Packages

Languages

License

thisisparker/bookscanning

Folders and files

Latest commit

History

Repository files navigation

bookscanning

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages