Skip to content

Ruby document parsing toolkit with zero runtime dependencies. Parse PDFs, DOCX, XLSX, and images (with OCR) using a single, lightweight gem. Statically links MuPDF and Tesseract at compile time for hassle-free installation - no system libraries or external tools required.

License

Notifications You must be signed in to change notification settings

scientist-labs/parsekit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

90 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

parsekit

Gem Version License: MIT

Native Ruby bindings for the parser-core Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX), images (with OCR), and more. Part of the ruby-nlp ecosystem.

Features

  • ๐Ÿ“„ Document Parsing: Extract text from PDFs, Office documents (DOCX, XLSX)
  • ๐Ÿ–ผ๏ธ OCR Support: Extract text from images using Tesseract OCR
  • ๐Ÿš€ High Performance: Native Rust performance with Ruby convenience
  • ๐Ÿ”ง Unified API: Single interface for multiple document formats
  • ๐Ÿ“ฆ Cross-Platform: Works on Linux, macOS, and Windows
  • ๐Ÿงช Well Tested: Comprehensive test suite with RSpec

Installation

Add this line to your application's Gemfile:

gem 'parsekit'

And then execute:

$ bundle install

Or install it yourself as:

gem install parsekit

Requirements

  • Ruby >= 3.0.0
  • Rust toolchain (stable)
  • C compiler (for linking)

That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.

Usage

Basic Usage

require 'parsekit'

# Parse a PDF file
text = ParseKit.parse_file("document.pdf")
puts text  # Extracted text from the PDF

# Parse an Excel file
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text  # Extracted text from all sheets

# Parse binary data directly
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text

# Parse with a Parser instance
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text

Module-Level Convenience Methods

# Parse files directly
content = ParseKit.parse_file('document.pdf')

# Parse bytes
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)

# Check supported formats
formats = ParseKit.supported_formats
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]

# Check if a file is supported
ParseKit.supports_file?('document.pdf')  # => true

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

Format-Specific Parsing

parser = ParseKit::Parser.new

# Direct access to format-specific parsers
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)

Supported Formats

Format Extensions Method Notes
PDF .pdf parse_pdf Text extraction via MuPDF
Word .docx parse_docx Office Open XML format
Excel .xlsx, .xls parse_xlsx Both modern and legacy formats
PowerPoint .pptx parse_pptx Text extraction from slides and notes
Images .png, .jpg, .jpeg, .tiff, .bmp ocr_image OCR via bundled Tesseract
JSON .json parse_json Pretty-printed output
XML/HTML .xml, .html parse_xml Extracts text content
Text .txt, .csv, .md parse_text With encoding detection

Performance

ParseKit is built with performance in mind:

  • Native Rust implementation for speed
  • Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
  • Efficient memory usage with streaming where possible
  • Configurable size limits to prevent memory issues

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests.

To compile the Rust extension:

rake compile

To run tests with coverage:

rake dev:coverage

OCR Mode Configuration

By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:

Using system Tesseract during installation:

gem install parsekit -- --no-default-features

For development with system Tesseract:

rake compile CARGO_FEATURES=""  # Disables bundled-tesseract feature

System Tesseract requirements:

  • macOS: brew install tesseract
  • Ubuntu/Debian: sudo apt-get install libtesseract-dev
  • Fedora/RHEL: sudo dnf install tesseract-devel

The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.

Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

  • Ruby Layer: Provides convenient API and format detection
  • Rust Layer: Implements high-performance parsing using:
    • MuPDF for PDF text extraction (statically linked)
    • tesseract-rs for OCR (with bundled Tesseract by default)
    • Pure Rust libraries for DOCX/XLSX parsing
    • Magnus for Ruby-Rust FFI bindings

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/parsekit.

License

The gem is available as open source under the terms of the MIT License.

Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.

About

Ruby document parsing toolkit with zero runtime dependencies. Parse PDFs, DOCX, XLSX, and images (with OCR) using a single, lightweight gem. Statically links MuPDF and Tesseract at compile time for hassle-free installation - no system libraries or external tools required.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors 2

  •  
  •