Skip to content

pzaich/doc_ripper

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

DocRipper

Gem Version

Grab the text from common document formats with 1 command. DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf, .sketch) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.

For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion.

Need OCR support or in-image text parsing? Take a look at Docsplit.

Supported File Formats

.doc
.docx
.pdf
.txt
.sketch
File format Supported? Dependencies
.doc x Antiword
.docx x
.pdf x Poppler-utils
.txt x
.sketch x Sqlite3

Quickstart

  gem install doc_ripper

Specify a file path of a file

  require 'doc_ripper'

  DocRipper::rip('/path/to/file')

If the file cannot be read, nil will be returned.

  DocRipper::rip('/path/to/missing/file')
  => nil

Want to raise an exception? Use #rip!

#rip! will raise an exception if rip returns nil or the file type isn't supported

  # invalid file type
  DocRipper::rip!('/path/to/invalide/file.type')
  => DocRipper::UnsupportedFileType

  # missing file
  DocRipper::rip!('/path/to/missing/file.doc')
  => DocRipper::FileNotFound

Dependencies

About

Parse text contents from common file formats

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages