UNDMS

High-performance document text and metadata extraction library with similarity comparison, built with napi-rs for Node.js and Bun.

Installation

bun add undms
# or
npm install undms

Features

Multi-format extraction - Text, DOCX, XLSX, PDF, and images
Similarity comparison - Compare documents against reference texts using multiple algorithms
Rich metadata - Extract format-specific metadata (EXIF, PDF info, DOCX stats, etc.)
OCR support - Extract text from images using Tesseract
Parallel processing - Documents are processed concurrently for performance
TypeScript support - Full type definitions included

Supported Formats

Format	MIME Type	Features
Text	`text/*`, `application/json`, `application/xml`, etc.	Content + line/word/character counts
DOCX	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Paragraphs, tables, images, hyperlinks
XLSX	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	Cell content, sheets, rows, columns
PDF	`application/pdf`	Text, title, author, subject, producer, page info
Images	`image/jpeg`, `image/png`, `image/gif`, `image/bmp`, `image/tiff`, `image/webp`	OCR text, EXIF data, GPS location

Quick Start

import { extract, computeDocumentSimilarity, computeTextSimilarity } from 'undms';

const documents = [
  {
    name: 'report.txt',
    size: 1024,
    type: 'text/plain',
    lastModified: Date.now(),
    webkitRelativePath: '',
    buffer: Buffer.from('Document content here...'),
  },
];

const result = extract(documents);
console.log(result[0].documents[0].content);

API Reference

`extract(documents)`

Extracts text and metadata from input documents. Results are grouped by MIME type.

Parameters:

documents - Array of Document objects

Returns: GroupedDocuments[]

const result = extract([
  {
    name: 'document.pdf',
    size: 1024,
    type: 'application/pdf',
    lastModified: Date.now(),
    webkitRelativePath: '',
    buffer: Buffer.from(pdfData),
  },
]);

`computeDocumentSimilarity(documents, referenceTexts, threshold?, method?)`

Extracts documents and computes similarity against reference texts.

Parameters:

documents - Array of Document objects
referenceTexts - Candidate reference texts to compare against
threshold - Minimum score (0-100) to include a match (default: 30)
method - Similarity algorithm: 'jaccard', 'ngram', 'levenshtein', or 'hybrid' (default)

Returns: GroupedDocumentsWithSimilarity[]

const result = computeDocumentSimilarity(
  documents,
  ['reference text A', 'reference text B'],
  70,
  'hybrid',
);

`computeTextSimilarity(sourceText, referenceTexts, threshold?, method?)`

Computes similarity for plain text without file extraction.

Parameters:

sourceText - Source text to compare
referenceTexts - Candidate reference texts
threshold - Minimum score (0-100) to include a match (default: 30)
method - Similarity algorithm (default: 'hybrid')

Returns: SimilarityMatch[]

const matches = computeTextSimilarity(
  'alpha beta gamma',
  ['alpha beta gamma', 'different text'],
  80,
  'jaccard',
);

Type Definitions

Document

Input document interface.

interface Document {
  name: string;
  size: number;
  type: string; // MIME type
  lastModified: number;
  webkitRelativePath: string;
  buffer: Buffer;
}

DocumentMetadata

Extracted document result.

interface DocumentMetadata {
  name: string;
  size: number;
  processingTime: number;
  encoding: string;
  content: string;
  metadata?: MetadataPayload;
  error?: string;
}

GroupedDocuments

Documents grouped by MIME type.

interface GroupedDocuments {
  mimeType: string;
  documents: DocumentMetadata[];
}

GroupedDocumentsWithSimilarity

Grouped documents with similarity matches.

interface GroupedDocumentsWithSimilarity {
  mimeType: string;
  documents: DocumentMetadataWithSimilarity[];
}

DocumentMetadataWithSimilarity

Document metadata with similarity results.

interface DocumentMetadataWithSimilarity {
  name: string;
  size: number;
  processingTime: number;
  encoding: string;
  content: string;
  metadata?: MetadataPayload;
  error?: string;
  similarityMatches: SimilarityMatch[];
}

SimilarityMatch

Similarity comparison result.

interface SimilarityMatch {
  referenceIndex: number;
  similarityPercentage: number;
}

MetadataPayload

Complete metadata payload with format-specific fields.

interface MetadataPayload {
  text?: TextMetadata;
  docx?: DocxMetadata;
  xlsx?: XlsxMetadata;
  pdf?: PdfMetadata;
  image?: ImageMetadata;
}

TextMetadata

Text content statistics.

interface TextMetadata {
  lineCount: number;
  wordCount: number;
  characterCount: number;
  nonWhitespaceCharacterCount: number;
}

DocxMetadata

DOCX document statistics.

interface DocxMetadata {
  paragraphCount: number;
  tableCount: number;
  imageCount: number;
  hyperlinkCount: number;
}

XlsxMetadata

XLSX spreadsheet statistics.

interface XlsxMetadata {
  sheetCount: number;
  sheetNames: string[];
  rowCount: number;
  columnCount: number;
  cellCount: number;
}

PdfMetadata

PDF document information.

interface PdfMetadata {
  title?: string;
  author?: string;
  subject?: string;
  producer?: string;
  pageSize?: PdfPageSize;
  pageCount: number;
}

interface PdfPageSize {
  width: number;
  height: number;
}

ImageMetadata

Image file information.

interface ImageMetadata {
  width: number;
  height: number;
  format?: string;
  cameraMake?: string;
  cameraModel?: string;
  datetimeOriginal?: string;
  location: ImageLocation;
}

interface ImageLocation {
  latitude?: number;
  longitude?: number;
}

Similarity Methods

Method	Description
`jaccard`	Set-based similarity using Jaccard index
`ngram`	N-gram token matching (default: trigrams)
`levenshtein`	Edit distance-based similarity
`hybrid`	Weighted combination of all methods (default)

The similarity score is computed as a weighted blend of content similarity (80%) and metadata similarity (20%).

Error Handling

All functions handle errors gracefully:

Extraction errors - Returned in the error field of DocumentMetadata
Unsupported formats - Returns empty content with application/octet-stream encoding
Similarity errors - Returns empty matches array

const result = extract(documents);
if (result[0].documents[0].error) {
  console.error('Extraction failed:', result[0].documents[0].error);
}

Troubleshooting

OCR is slow - Large images take time; consider resizing before processing
Missing GPS data - Not all images contain EXIF location; location object exists but fields may be undefined
Empty PDF text - Some PDFs are image-based; OCR is not currently applied to PDFs
Unicode handling - All similarity methods support Unicode text

Development

Requirements

Rust (latest stable)
Node.js 18+
pnpm

Build

pnpm build

Test

pnpm test

Benchmark

pnpm bench

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.cargo		.cargo
.github		.github
.husky		.husky
.vscode		.vscode
__test__		__test__
benchmark		benchmark
docs		docs
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.oxfmtrc.json		.oxfmtrc.json
.oxlintrc.json		.oxlintrc.json
.taplo.toml		.taplo.toml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
build.rs		build.rs
index.d.ts		index.d.ts
index.js		index.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
rustfmt.toml		rustfmt.toml
text-detection-model.rten		text-detection-model.rten
text-recognition-model.rten		text-recognition-model.rten
tsconfig.json		tsconfig.json
undms.png		undms.png

License

xcvzmoon/undms

Folders and files

Latest commit

History

Repository files navigation

UNDMS

Installation

Features

Supported Formats

Quick Start

API Reference

extract(documents)

computeDocumentSimilarity(documents, referenceTexts, threshold?, method?)

computeTextSimilarity(sourceText, referenceTexts, threshold?, method?)

Type Definitions

Document

DocumentMetadata

GroupedDocuments

GroupedDocumentsWithSimilarity

DocumentMetadataWithSimilarity

SimilarityMatch

MetadataPayload

TextMetadata

DocxMetadata

XlsxMetadata

PdfMetadata

ImageMetadata

Similarity Methods

Error Handling

Troubleshooting

Development

Requirements

Build

Test

Benchmark

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`extract(documents)`

`computeDocumentSimilarity(documents, referenceTexts, threshold?, method?)`

`computeTextSimilarity(sourceText, referenceTexts, threshold?, method?)`

Packages