Vectorize Iris is a model-based extraction solution that transforms how RAG systems handle PDFs. It combines extraction and chunking into one streamlined process, making it easier than ever to get clean, usable text from complex documents.
Documentation: docs.vectorize.io
- Why Iris?
- Quick Start
- Installation
- Features
- CLI Examples
- Configuration
- Documentation
- License
- Support
Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to understand document structure and context, delivering:
- 📄 Universal format support - Works with all unstructured document types (PDFs, images, scans, and more)
- ✨ High accuracy - Handles poor quality scans and complex layouts
- 📊 Structure preservation - Maintains tables, lists, and formatting
- 🎯 Smart chunking - Semantic splitting for RAG pipelines
- 🔍 Metadata extraction - Extract specific fields using natural language
- 🚀 Simple API - One function call to extract text
- ⚡ Parallel processing - Process multiple documents simultaneously
- 🌐 URL support - Extract directly from HTTP/HTTPS URLs
- 📂 Batch processing - Process entire directories automatically
- 🔧 Multiple formats - Output as JSON, YAML, or plain text
- 🪶 Lightweight - Single binary CLI with no dependencies
- ☁️ Cloud-native - Serverless-ready APIs
- 🌍 Multi-lingual - 100+ languages including Hindi, Arabic, Chinese
- 🔌 Multi-platform - Python, Node.js, and CLI support
Choose your preferred tool:
from vectorize_iris import extract_text_from_file
result = extract_text_from_file('document.pdf')
print(result.text)import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf');
console.log(result.text);vectorize-iris document.pdfCLI:
curl -fsSL https://raw.githubusercontent.com/vectorize-io/vectorize-iris/refs/heads/main/install.sh | shPython:
pip install vectorize-irisNode.js:
npm install @vectorize-io/irisExtract clean, structured text from any document format.
Split documents into semantic chunks perfect for RAG pipelines:
- Markdown-aware chunking
- Configurable chunk sizes
- Preserves context across chunks
Extract structured data using JSON schemas (OpenAPI spec format recommended):
result = extract_text_from_file('invoice.pdf', options=ExtractionOptions(
metadata_schemas=[{
'id': 'invoice-data',
'schema': {
'invoice_number': 'string',
'date': 'string',
'total_amount': 'number',
'vendor_name': 'string'
}
}]
))
# Returns structured JSON metadataGuide the extraction with custom instructions:
result = extract_text_from_file('document.pdf', options=ExtractionOptions(
parsing_instructions='Focus on extracting tables and ignore headers/footers'
))Beautiful terminal output with progress indicators:
vectorize-iris document.pdfOutput:
✨ Vectorize Iris Extraction
──────────────────────────────────────────────────
✓ Upload prepared
✓ File uploaded successfully
✓ Extraction started
✓ Extraction completed in 7s
─────────────────────────────────────────────────────────
📄 Extracted Text
─────────────────────────────────────────────────────────
Stats: 5536 chars • 1245 words • 89 lines
This is the extracted text from your PDF document.
All formatting and structure is preserved.
Tables, lists, and other elements are properly extracted.
Download and extract files directly from HTTP/HTTPS URLs:
vectorize-iris https://arxiv.org/pdf/2206.01062vectorize-iris document.pdf -o jsonOutput:
{
"success": true,
"text": "This is the extracted text from your PDF document...",
"chunks": null,
"metadata": null
}Pipe to jq:
vectorize-iris document.pdf -o json | jq -r '.text' > output.txtGet only the extracted text:
vectorize-iris document.pdf -o textPipe directly:
vectorize-iris document.pdf -o text > output.txtUse -f to save output directly:
vectorize-iris document.pdf -o json -f output.jsonOutput:
✨ Vectorize Iris Extraction
──────────────────────────────────────────────────
✓ Upload prepared
✓ File uploaded successfully
✓ Extraction started
✓ Extraction completed in 7s
✓ Output written to output.json
Process all files in a directory automatically:
vectorize-iris ./documents -f ./outputOutput:
📦 Processing Directory
──────────────────────────────────────────────────
💡 Found 5 files to process
⚙️ Processing 1/5 - report-q1.pdf
✨ Vectorize Iris Extraction
──────────────────────────────────────────────────
✓ Upload prepared
✓ File uploaded successfully
✓ Extraction started
✓ Extraction completed in 8s
✓ Output written to output/report-q1.txt
⚙️ Processing 2/5 - report-q2.pdf
...
──────────────────────────────────────────────────
✨ Batch Processing Complete
✓ Successful: 5
With custom output format:
# Extract all PDFs to JSON
vectorize-iris ./documents -o json -f ./output
# Extract all files to plain text
vectorize-iris ./scans -o text -f ./extractedvectorize-iris long-document.pdf --chunk-size 512Splits documents at semantic boundaries, perfect for RAG pipelines.
vectorize-iris report.pdf --parsing-instructions "Extract only tables and numerical data, ignore narrative text"Pass multiple metadata schemas and Iris will automatically classify which schema matches best:
vectorize-iris invoice.pdf \
--metadata-schema 'invoice:{"invoice_number":"string","date":"string","total_amount":"number","vendor":"string"}' \
--metadata-schema 'receipt:{"store_name":"string","date":"string","items":"array","total":"number"}' \
--metadata-schema 'contract:{"parties":"array","effective_date":"string","terms":"string"}' \
--metadata-schema 'cv:{"name":"string","contact_info":"object","skills":"array","experience":"array"}' \
-o jsonOutput:
{
"success": true,
"text": "...",
"metadata": "{\"invoice_number\":\"INV-2024-001\",\"date\":\"2024-01-15\",\"total_amount\":1250.00,\"vendor\":\"Acme Corp\"}",
"metadataSchema": "invoice"
}Iris automatically detected this was an invoice and extracted the relevant fields using the matching schema.
# Custom chunk size with metadata extraction
vectorize-iris document.pdf \
--chunk-size 256 \
--infer-metadata-schema \
--parsing-instructions "Focus on extracting structured data" \
-o yaml -f output.yaml
# Longer timeout for large documents
vectorize-iris large-document.pdf \
--timeout 600 \
--poll-interval 5The CLI offers multiple ways to configure your credentials:
The easiest way to get started - opens your browser for authentication:
vectorize-iris configureWhat happens:
- Opens your browser to the Vectorize platform
- Click "Authorize" to grant access
- Credentials are automatically saved to
~/.vectorize-iris/credentials - Done! You're ready to extract
If you prefer not to use the browser, prompt for credentials manually:
vectorize-iris configure --manualYou'll be asked to enter:
- Access Token
- Organization ID
Get these from platform.vectorize.io → Account → Org Settings → Access Tokens
For scripts and automation, pass credentials directly:
vectorize-iris configure --api-token "your-token" --org-id "your-org-id"Alternatively, set credentials via environment variables (works for all clients):
export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"For Python and Node.js clients, use environment variables or pass credentials programmatically:
Environment variables:
export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"Python:
from vectorize_iris import VectorizeIrisClient
client = VectorizeIrisClient(
api_token="your-token",
org_id="your-org-id"
)Node.js:
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf', {
apiToken: 'your-token',
orgId: 'your-org-id'
});For detailed documentation, API reference, and advanced features:
MIT
- 📖 Documentation
- 💬 Community
- 🐛 Issues