Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
extract text content and layout information from PDF documents
C C++
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
test
util
.gitignore
README
hocr2xml
pdfocr
rpdf

README

rpdf extracts text and layout from PDF documents, using pdftohtml and
cuneiform.

version 0.3 (2010-06-05)

---------------------------------------------------------------------
INSTALLATION
---------------------------------------------------------------------

rpdf requires 

- pdftohtml (the Poppler version)
- cuneiform 0.9
- convert (ImageMagick)
- pdftk
- the Perl modules XML::Writer, XML::XPath, HTML::TreeBuilder

No further installation required. To test if it works, run

./rpdf test/simple.pdf
./rpdf test/ocr.pdf


---------------------------------------------------------------------
USAGE
---------------------------------------------------------------------

rpdf [-d int] [-p int] file.pdf

 -d int : debug level
 -p int : max. number of pages to parse

If -p is not specified, what happens depends on whether rpdf can use
pdftohtml or has to fall back on OCR.
Something went wrong with that request. Please try again.