Skip to content

Parse text contents from common file formats

License

Notifications You must be signed in to change notification settings

vmamaev/doc_ripper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.

For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion. I found

Need OCR support or in-image text parsing? Take a look at Docsplit.

Quickstart

  gem install doc_ripper

Specify a file to parse

  DocRipper::TextRipper.new('/path/to/file')

Return the file's text

  dr = DocRipper::TextRipper.new('/path/to/file')
  dr.text
  => "Document's text"

If the file cannot be read, nil will be returned.

  dr = DocRipper::TextRipper.new('/path/to/missing/file')
  dr.text
  => nil

Dependencies

About

Parse text contents from common file formats

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published