Skip to content
/ undms Public

Text and Metadata Extraction Library for Document Files with Text Similarity Comparison

License

Notifications You must be signed in to change notification settings

xcvzmoon/undms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UNDMS

undms

npm version https://github.com/xcvzmoon/undms/actions

High-performance document text and metadata extraction library with similarity comparison, built with napi-rs for Node.js and Bun.

Installation

bun add undms
# or
npm install undms

Features

  • Multi-format extraction - Text, DOCX, XLSX, PDF, and images
  • Similarity comparison - Compare documents against reference texts using multiple algorithms
  • Rich metadata - Extract format-specific metadata (EXIF, PDF info, DOCX stats, etc.)
  • OCR support - Extract text from images using Tesseract
  • Parallel processing - Documents are processed concurrently for performance
  • TypeScript support - Full type definitions included

Supported Formats

Format MIME Type Features
Text text/*, application/json, application/xml, etc. Content + line/word/character counts
DOCX application/vnd.openxmlformats-officedocument.wordprocessingml.document Paragraphs, tables, images, hyperlinks
XLSX application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Cell content, sheets, rows, columns
PDF application/pdf Text, title, author, subject, producer, page info
Images image/jpeg, image/png, image/gif, image/bmp, image/tiff, image/webp OCR text, EXIF data, GPS location

Quick Start

import { extract, computeDocumentSimilarity, computeTextSimilarity } from 'undms';

const documents = [
  {
    name: 'report.txt',
    size: 1024,
    type: 'text/plain',
    lastModified: Date.now(),
    webkitRelativePath: '',
    buffer: Buffer.from('Document content here...'),
  },
];

const result = extract(documents);
console.log(result[0].documents[0].content);

API Reference

extract(documents)

Extracts text and metadata from input documents. Results are grouped by MIME type.

Parameters:

  • documents - Array of Document objects

Returns: GroupedDocuments[]

const result = extract([
  {
    name: 'document.pdf',
    size: 1024,
    type: 'application/pdf',
    lastModified: Date.now(),
    webkitRelativePath: '',
    buffer: Buffer.from(pdfData),
  },
]);

computeDocumentSimilarity(documents, referenceTexts, threshold?, method?)

Extracts documents and computes similarity against reference texts.

Parameters:

  • documents - Array of Document objects
  • referenceTexts - Candidate reference texts to compare against
  • threshold - Minimum score (0-100) to include a match (default: 30)
  • method - Similarity algorithm: 'jaccard', 'ngram', 'levenshtein', or 'hybrid' (default)

Returns: GroupedDocumentsWithSimilarity[]

const result = computeDocumentSimilarity(
  documents,
  ['reference text A', 'reference text B'],
  70,
  'hybrid',
);

computeTextSimilarity(sourceText, referenceTexts, threshold?, method?)

Computes similarity for plain text without file extraction.

Parameters:

  • sourceText - Source text to compare
  • referenceTexts - Candidate reference texts
  • threshold - Minimum score (0-100) to include a match (default: 30)
  • method - Similarity algorithm (default: 'hybrid')

Returns: SimilarityMatch[]

const matches = computeTextSimilarity(
  'alpha beta gamma',
  ['alpha beta gamma', 'different text'],
  80,
  'jaccard',
);

Type Definitions

Document

Input document interface.

interface Document {
  name: string;
  size: number;
  type: string; // MIME type
  lastModified: number;
  webkitRelativePath: string;
  buffer: Buffer;
}

DocumentMetadata

Extracted document result.

interface DocumentMetadata {
  name: string;
  size: number;
  processingTime: number;
  encoding: string;
  content: string;
  metadata?: MetadataPayload;
  error?: string;
}

GroupedDocuments

Documents grouped by MIME type.

interface GroupedDocuments {
  mimeType: string;
  documents: DocumentMetadata[];
}

GroupedDocumentsWithSimilarity

Grouped documents with similarity matches.

interface GroupedDocumentsWithSimilarity {
  mimeType: string;
  documents: DocumentMetadataWithSimilarity[];
}

DocumentMetadataWithSimilarity

Document metadata with similarity results.

interface DocumentMetadataWithSimilarity {
  name: string;
  size: number;
  processingTime: number;
  encoding: string;
  content: string;
  metadata?: MetadataPayload;
  error?: string;
  similarityMatches: SimilarityMatch[];
}

SimilarityMatch

Similarity comparison result.

interface SimilarityMatch {
  referenceIndex: number;
  similarityPercentage: number;
}

MetadataPayload

Complete metadata payload with format-specific fields.

interface MetadataPayload {
  text?: TextMetadata;
  docx?: DocxMetadata;
  xlsx?: XlsxMetadata;
  pdf?: PdfMetadata;
  image?: ImageMetadata;
}

TextMetadata

Text content statistics.

interface TextMetadata {
  lineCount: number;
  wordCount: number;
  characterCount: number;
  nonWhitespaceCharacterCount: number;
}

DocxMetadata

DOCX document statistics.

interface DocxMetadata {
  paragraphCount: number;
  tableCount: number;
  imageCount: number;
  hyperlinkCount: number;
}

XlsxMetadata

XLSX spreadsheet statistics.

interface XlsxMetadata {
  sheetCount: number;
  sheetNames: string[];
  rowCount: number;
  columnCount: number;
  cellCount: number;
}

PdfMetadata

PDF document information.

interface PdfMetadata {
  title?: string;
  author?: string;
  subject?: string;
  producer?: string;
  pageSize?: PdfPageSize;
  pageCount: number;
}

interface PdfPageSize {
  width: number;
  height: number;
}

ImageMetadata

Image file information.

interface ImageMetadata {
  width: number;
  height: number;
  format?: string;
  cameraMake?: string;
  cameraModel?: string;
  datetimeOriginal?: string;
  location: ImageLocation;
}

interface ImageLocation {
  latitude?: number;
  longitude?: number;
}

Similarity Methods

Method Description
jaccard Set-based similarity using Jaccard index
ngram N-gram token matching (default: trigrams)
levenshtein Edit distance-based similarity
hybrid Weighted combination of all methods (default)

The similarity score is computed as a weighted blend of content similarity (80%) and metadata similarity (20%).

Error Handling

All functions handle errors gracefully:

  • Extraction errors - Returned in the error field of DocumentMetadata
  • Unsupported formats - Returns empty content with application/octet-stream encoding
  • Similarity errors - Returns empty matches array
const result = extract(documents);
if (result[0].documents[0].error) {
  console.error('Extraction failed:', result[0].documents[0].error);
}

Troubleshooting

  • OCR is slow - Large images take time; consider resizing before processing
  • Missing GPS data - Not all images contain EXIF location; location object exists but fields may be undefined
  • Empty PDF text - Some PDFs are image-based; OCR is not currently applied to PDFs
  • Unicode handling - All similarity methods support Unicode text

Development

Requirements

  • Rust (latest stable)
  • Node.js 18+
  • pnpm

Build

pnpm build

Test

pnpm test

Benchmark

pnpm bench

License

MIT

About

Text and Metadata Extraction Library for Document Files with Text Similarity Comparison

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors