Skip to content

uoregon-libraries/pdftotext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pdftotext

This repository is based on poppler 0.22.5 and includes only the pdftotext.cc file and a sample of output HTML generated by running it on a PDF.

Changes

Adds a new option, -bbox-layout, which is very similar to -bbox, but instead of only producing word coordinates, it also produces tags for flows, blocks, lines, and words. The blocks, lines, and words all include coordinates. This output is useful for producing ALTO-like XML for ingesting PDFs into our Historic Oregon Newspapers system.

Licensing

The license is GPL v2, as specified in the version of code this is based on, and can be viewed in the source file, pdftotext.cc.

About

Custom addition to poppler-utils pdftotext

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published