TorchV Unstructured

A powerful and developer-friendly document parsing library optimized for RAG (Retrieval Augmented Generation) applications. Built on top of industry-standard Java libraries like Apache Tika, Apache POI, and PDFBox, TorchV Unstructured provides enhanced parsing capabilities with intelligent table structure recognition and content extraction.

🚀 Key Features

Intelligent Table Parsing: Advanced table structure analysis with proper cell merging detection
Multi-format Support: Seamless handling of DOC, DOCX, PDF, and other document formats
RAG-Optimized Output: Structured content extraction designed for AI/ML pipelines
Markdown & HTML Export: Flexible output formats with preserved table structures
Image Extraction: Automatic extraction and handling of embedded images
Memory Efficient: Optimized for processing large documents with minimal memory footprint

📦 Installation

Maven

<dependency>
    <groupId>com.torchv.infra</groupId>
    <artifactId>torchv-unstructured</artifactId>
    <version>1.0.0</version>
</dependency>

Gradle

implementation 'com.torchv.infra:torchv-unstructured:1.0.0'

🔧 Quick Start

Basic Document Parsing

import com.torchv.infra.unstructured.UnstructuredParser;

// Parse document to markdown (recommended for RAG)
String content = UnstructuredParser.toMarkdown("document.docx");
System.out.println(content);

// Parse document to markdown with HTML tables (preserving table structure)
String contentWithTables = UnstructuredParser.toMarkdownWithHtmlTables("document.docx");
System.out.println(contentWithTables);

Advanced Table Extraction

import com.torchv.infra.unstructured.UnstructuredParser;

import java.io.File;
import java.util.List;

// Extract only tables from Word documents
List<String> tables = UnstructuredParser.extractTables("document.docx");
for (int i = 0; i < tables.size(); i++) {
    System.out.println("Table " + (i + 1) + ":");
    System.out.println(tables.get(i));
}

// Get structured result with more control
DocumentResult result = UnstructuredParser.toStructuredResult("document.docx");
if (result.isSuccess()) {
    System.out.println("Content: " + result.getContent());
    System.out.println("Tables: " + result.getTables());
}

File Format Support

import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.util.UnstructuredUtils;

// Check if file format is supported
if (UnstructuredUtils.isSupportedFormat("document.docx")) {
    String content = UnstructuredParser.toMarkdownWithHtmlTables("document.docx");
    System.out.println("Parsing successful!");
} else {
    System.out.println("Unsupported file format");
}

// Get all supported formats
List<String> supportedFormats = UnstructuredUtils.getSupportedFormats();
System.out.println("Supported formats: " + String.join(", ", supportedFormats));

🎯 Core Components

Unified Entry Point

UnstructuredParser: Main entry class providing simple and unified API for all document parsing operations

Document Parsers

UnstructuredWord: Universal Word document parsing with auto-detection
TikaAutoUtils: Generic document parsing with auto-detection (underlying implementation)
WordTableParser: Specialized Word document table parser
DocxTableParser: Advanced DOCX table structure analyzer

Content Handlers

ToMarkdownWithHtmlTableContentHandler: Converts documents to Markdown with HTML tables
DocMarkdownWithHtmlTableContentHandler: Specialized DOC format handler
DocXMarkdownWithHtmlTableContentHandler: Specialized DOCX format handler

Table Analysis

TableStructureAnalyzer: Intelligent table structure recognition
CellMergeAnalyzer: Advanced cell merging detection
HtmlTableBuilder: Clean HTML table generation

Utilities

FileMagicUtils: File type detection and validation
ImageExtractParse: Embedded image extraction

🔍 Advanced Usage

RAG Application Integration

import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.core.DocumentResult;

// Optimized for RAG applications
public class RAGDocumentProcessor {

    public DocumentChunk processDocument(String filePath) {
        // Parse with table structure preservation for better context
        String content = UnstructuredParser.toMarkdownWithHtmlTables(filePath);

        // Extract tables separately for structured data processing
        List<String> tables = UnstructuredParser.extractTables(filePath);

        return new DocumentChunk(content, tables);
    }
}

Batch Processing

import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.util.UnstructuredUtils;

public class BatchProcessor {

    public void processBatch(List<String> filePaths) {
        filePaths.parallelStream()
                .filter(UnstructuredUtils::isSupportedFormat)
                .forEach(this::processFile);
    }

    private void processFile(String filePath) {
        try {
            String content = UnstructuredParser.toMarkdownWithHtmlTables(filePath);
            // Save or further process the content
            saveProcessedContent(filePath, content);
        } catch (Exception e) {
            log.error("Failed to process file: {}", filePath, e);
        }
    }
}

Error Handling and Validation

import com.torchv.infra.unstructured.UnstructuredParser;
import com.torchv.infra.unstructured.util.UnstructuredUtils;

public class DocumentValidator {
    
    public ProcessingResult validateAndProcess(String filePath) {
        // Check file format
        if (!UnstructuredUtils.isSupportedFormat(filePath)) {
            return ProcessingResult.unsupportedFormat();
        }
        
        try {
            String content = UnstructuredParser.toMarkdownWithHtmlTables(filePath);
            List<String> tables = UnstructuredParser.extractTables(filePath);
            
            return ProcessingResult.success(content, tables);
        } catch (RuntimeException e) {
            return ProcessingResult.error(e.getMessage());
        }
    }
}

🌟 Why Choose TorchV Unstructured?

For RAG Applications

Structured Output: Clean, structured content extraction perfect for embedding generation
Table Preservation: Maintains table relationships crucial for document understanding
Metadata Rich: Extracts comprehensive document metadata for enhanced retrieval

For Developers

Simple API: Intuitive interfaces with sensible defaults
Extensible: Plugin-based architecture for custom content handlers
Production Ready: Battle-tested with comprehensive error handling

Performance Optimized

Memory Efficient: Streaming processing for large documents
Fast Processing: Optimized algorithms for quick parsing
Scalable: Designed for high-throughput document processing

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Apache Tika - Content analysis toolkit
Apache POI - Java API for Microsoft Documents
PDFBox - PDF document manipulation

📞 Support

📧 Email: xiaoymin@foxmail.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Made with ❤️ by TorchV Team

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs/changelogs/v1-0-0		docs/changelogs/v1-0-0
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TorchV Unstructured

🚀 Key Features

📦 Installation

Maven

Gradle

🔧 Quick Start

Basic Document Parsing

Advanced Table Extraction

File Format Support

🎯 Core Components

Unified Entry Point

Document Parsers

Content Handlers

Table Analysis

Utilities

🔍 Advanced Usage

RAG Application Integration

Batch Processing

Error Handling and Validation

🌟 Why Choose TorchV Unstructured?

For RAG Applications

For Developers

Performance Optimized

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

torchv/torchv-unstructured

Folders and files

Latest commit

History

Repository files navigation

TorchV Unstructured

🚀 Key Features

📦 Installation

Maven

Gradle

🔧 Quick Start

Basic Document Parsing

Advanced Table Extraction

File Format Support

🎯 Core Components

Unified Entry Point

Document Parsers

Content Handlers

Table Analysis

Utilities

🔍 Advanced Usage

RAG Application Integration

Batch Processing

Error Handling and Validation

🌟 Why Choose TorchV Unstructured?

For RAG Applications

For Developers

Performance Optimized

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages