Skip to content

thisisparker/bookscanning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

bookscanning

Some scripts and things for processing scanned books. These will probably only be useful to people that have a huge directory of TIFF files of the sort that ScanTailor puts out after processing a huge directory of JPEG files that a DIY Book Scanner produces. I wrote these to help process a book scanned at Noisebridge.

These steps assume you've got imagemagick, tesseract, pdftk, and, like, Perl and Ruby installed.

looptifftopdf.rb converts all those TIFFs to PDF files in a subdirectory called /pdf/. Then I just went into that directory and used pdftk like pdftk *.pdf cat full-book.pdf. It's made a very large PDF, and I'm working on making that smaller.

ocrthethings.rb runs through the same TIFFs with Tesseract and produces a final OCRed output that is pretty good. One weird thing is it had page numbers in it, which I don't think I need. The difficult part is that they're surrounded by newlines, and I couldn't just strip out all the lines at once with grep or sed or whatever. So I turned to Perl, and used this one-liner:

perl -pe 'undef $/; s/\n\d{1,3}\n\n//g' finaltext.txt > finaltext-nonums.txt

That'll work so long as the page numbers are surrounded by newlines, are between 1 and 3 digits long, have nothing else on the same line, and you want to yank out all three lines each time.

TODO

  • Make PDF much smaller, with some kinda optimization along the way
  • (perhaps at cross purposes) add the OCR to the PDF with hOCR or something.

About

📚 some scripts and things for processing scanned books

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages