A powerful and developer-friendly document parsing library optimized for RAG (Retrieval Augmented Generation) applications. Built on top of industry-standard Java libraries like Apache Tika, Apache POI, and PDFBox, TorchV Unstructured provides enhanced parsing capabilities with intelligent table structure recognition and content extraction.
- Intelligent Table Parsing: Advanced table structure analysis with proper cell merging detection
- Multi-format Support: Seamless handling of DOC, DOCX, PDF, and other document formats
- RAG-Optimized Output: Structured content extraction designed for AI/ML pipelines
- Markdown & HTML Export: Flexible output formats with preserved table structures
- Image Extraction: Automatic extraction and handling of embedded images
- Memory Efficient: Optimized for processing large documents with minimal memory footprint
<dependency>
<groupId>com.torchv.infra</groupId>
<artifactId>torchv-unstructured</artifactId>
<version>1.0.0</version>
</dependency>
implementation 'com.torchv.infra:torchv-unstructured:1.0.0'
import com.torchv.infra.unstructured.UnstructuredParser;
// Parse document to markdown (recommended for RAG)
String content = UnstructuredParser.toMarkdown("document.docx");
System.out.println(content);
// Parse document to markdown with HTML tables (preserving table structure)
String contentWithTables = UnstructuredParser.toMarkdownWithHtmlTables("document.docx");
System.out.println(contentWithTables);
import com.torchv.infra.unstructured.UnstructuredParser;
import java.io.File;
import java.util.List;
// Extract only tables from Word documents
List<String> tables = UnstructuredParser.extractTables("document.docx");
for (int i = 0; i < tables.size(); i++) {
System.out.println("Table " + (i + 1) + ":");
System.out.println(tables.get(i));
}
// Get structured result with more control
DocumentResult result = UnstructuredParser.toStructuredResult("document.docx");
if (result.isSuccess()) {
System.out.println("Content: " + result.getContent());
System.out.println("Tables: " + result.getTables());
}
import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.util.UnstructuredUtils;
// Check if file format is supported
if (UnstructuredUtils.isSupportedFormat("document.docx")) {
String content = UnstructuredParser.toMarkdownWithHtmlTables("document.docx");
System.out.println("Parsing successful!");
} else {
System.out.println("Unsupported file format");
}
// Get all supported formats
List<String> supportedFormats = UnstructuredUtils.getSupportedFormats();
System.out.println("Supported formats: " + String.join(", ", supportedFormats));
- UnstructuredParser: Main entry class providing simple and unified API for all document parsing operations
- UnstructuredWord: Universal Word document parsing with auto-detection
- TikaAutoUtils: Generic document parsing with auto-detection (underlying implementation)
- WordTableParser: Specialized Word document table parser
- DocxTableParser: Advanced DOCX table structure analyzer
- ToMarkdownWithHtmlTableContentHandler: Converts documents to Markdown with HTML tables
- DocMarkdownWithHtmlTableContentHandler: Specialized DOC format handler
- DocXMarkdownWithHtmlTableContentHandler: Specialized DOCX format handler
- TableStructureAnalyzer: Intelligent table structure recognition
- CellMergeAnalyzer: Advanced cell merging detection
- HtmlTableBuilder: Clean HTML table generation
- FileMagicUtils: File type detection and validation
- ImageExtractParse: Embedded image extraction
import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.core.DocumentResult;
// Optimized for RAG applications
public class RAGDocumentProcessor {
public DocumentChunk processDocument(String filePath) {
// Parse with table structure preservation for better context
String content = UnstructuredParser.toMarkdownWithHtmlTables(filePath);
// Extract tables separately for structured data processing
List<String> tables = UnstructuredParser.extractTables(filePath);
return new DocumentChunk(content, tables);
}
}
import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.util.UnstructuredUtils;
public class BatchProcessor {
public void processBatch(List<String> filePaths) {
filePaths.parallelStream()
.filter(UnstructuredUtils::isSupportedFormat)
.forEach(this::processFile);
}
private void processFile(String filePath) {
try {
String content = UnstructuredParser.toMarkdownWithHtmlTables(filePath);
// Save or further process the content
saveProcessedContent(filePath, content);
} catch (Exception e) {
log.error("Failed to process file: {}", filePath, e);
}
}
}
import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.util.UnstructuredUtils;
public class DocumentValidator {
public ProcessingResult validateAndProcess(String filePath) {
// Check file format
if (!UnstructuredUtils.isSupportedFormat(filePath)) {
return ProcessingResult.unsupportedFormat();
}
try {
String content = UnstructuredParser.toMarkdownWithHtmlTables(filePath);
List<String> tables = UnstructuredParser.extractTables(filePath);
return ProcessingResult.success(content, tables);
} catch (RuntimeException e) {
return ProcessingResult.error(e.getMessage());
}
}
}
- Structured Output: Clean, structured content extraction perfect for embedding generation
- Table Preservation: Maintains table relationships crucial for document understanding
- Metadata Rich: Extracts comprehensive document metadata for enhanced retrieval
- Simple API: Intuitive interfaces with sensible defaults
- Extensible: Plugin-based architecture for custom content handlers
- Production Ready: Battle-tested with comprehensive error handling
- Memory Efficient: Streaming processing for large documents
- Fast Processing: Optimized algorithms for quick parsing
- Scalable: Designed for high-throughput document processing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Apache Tika - Content analysis toolkit
- Apache POI - Java API for Microsoft Documents
- PDFBox - PDF document manipulation
- 📧 Email: xiaoymin@foxmail.com
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Made with ❤️ by TorchV Team