High-performance document text and metadata extraction library with similarity comparison, built with napi-rs for Node.js and Bun.
bun add undms
# or
npm install undms- Multi-format extraction - Text, DOCX, XLSX, PDF, and images
- Similarity comparison - Compare documents against reference texts using multiple algorithms
- Rich metadata - Extract format-specific metadata (EXIF, PDF info, DOCX stats, etc.)
- OCR support - Extract text from images using Tesseract
- Parallel processing - Documents are processed concurrently for performance
- TypeScript support - Full type definitions included
| Format | MIME Type | Features |
|---|---|---|
| Text | text/*, application/json, application/xml, etc. |
Content + line/word/character counts |
| DOCX | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
Paragraphs, tables, images, hyperlinks |
| XLSX | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Cell content, sheets, rows, columns |
application/pdf |
Text, title, author, subject, producer, page info | |
| Images | image/jpeg, image/png, image/gif, image/bmp, image/tiff, image/webp |
OCR text, EXIF data, GPS location |
import { extract, computeDocumentSimilarity, computeTextSimilarity } from 'undms';
const documents = [
{
name: 'report.txt',
size: 1024,
type: 'text/plain',
lastModified: Date.now(),
webkitRelativePath: '',
buffer: Buffer.from('Document content here...'),
},
];
const result = extract(documents);
console.log(result[0].documents[0].content);Extracts text and metadata from input documents. Results are grouped by MIME type.
Parameters:
documents- Array ofDocumentobjects
Returns: GroupedDocuments[]
const result = extract([
{
name: 'document.pdf',
size: 1024,
type: 'application/pdf',
lastModified: Date.now(),
webkitRelativePath: '',
buffer: Buffer.from(pdfData),
},
]);Extracts documents and computes similarity against reference texts.
Parameters:
documents- Array ofDocumentobjectsreferenceTexts- Candidate reference texts to compare againstthreshold- Minimum score (0-100) to include a match (default: 30)method- Similarity algorithm:'jaccard','ngram','levenshtein', or'hybrid'(default)
Returns: GroupedDocumentsWithSimilarity[]
const result = computeDocumentSimilarity(
documents,
['reference text A', 'reference text B'],
70,
'hybrid',
);Computes similarity for plain text without file extraction.
Parameters:
sourceText- Source text to comparereferenceTexts- Candidate reference textsthreshold- Minimum score (0-100) to include a match (default: 30)method- Similarity algorithm (default:'hybrid')
Returns: SimilarityMatch[]
const matches = computeTextSimilarity(
'alpha beta gamma',
['alpha beta gamma', 'different text'],
80,
'jaccard',
);Input document interface.
interface Document {
name: string;
size: number;
type: string; // MIME type
lastModified: number;
webkitRelativePath: string;
buffer: Buffer;
}Extracted document result.
interface DocumentMetadata {
name: string;
size: number;
processingTime: number;
encoding: string;
content: string;
metadata?: MetadataPayload;
error?: string;
}Documents grouped by MIME type.
interface GroupedDocuments {
mimeType: string;
documents: DocumentMetadata[];
}Grouped documents with similarity matches.
interface GroupedDocumentsWithSimilarity {
mimeType: string;
documents: DocumentMetadataWithSimilarity[];
}Document metadata with similarity results.
interface DocumentMetadataWithSimilarity {
name: string;
size: number;
processingTime: number;
encoding: string;
content: string;
metadata?: MetadataPayload;
error?: string;
similarityMatches: SimilarityMatch[];
}Similarity comparison result.
interface SimilarityMatch {
referenceIndex: number;
similarityPercentage: number;
}Complete metadata payload with format-specific fields.
interface MetadataPayload {
text?: TextMetadata;
docx?: DocxMetadata;
xlsx?: XlsxMetadata;
pdf?: PdfMetadata;
image?: ImageMetadata;
}Text content statistics.
interface TextMetadata {
lineCount: number;
wordCount: number;
characterCount: number;
nonWhitespaceCharacterCount: number;
}DOCX document statistics.
interface DocxMetadata {
paragraphCount: number;
tableCount: number;
imageCount: number;
hyperlinkCount: number;
}XLSX spreadsheet statistics.
interface XlsxMetadata {
sheetCount: number;
sheetNames: string[];
rowCount: number;
columnCount: number;
cellCount: number;
}PDF document information.
interface PdfMetadata {
title?: string;
author?: string;
subject?: string;
producer?: string;
pageSize?: PdfPageSize;
pageCount: number;
}
interface PdfPageSize {
width: number;
height: number;
}Image file information.
interface ImageMetadata {
width: number;
height: number;
format?: string;
cameraMake?: string;
cameraModel?: string;
datetimeOriginal?: string;
location: ImageLocation;
}
interface ImageLocation {
latitude?: number;
longitude?: number;
}| Method | Description |
|---|---|
jaccard |
Set-based similarity using Jaccard index |
ngram |
N-gram token matching (default: trigrams) |
levenshtein |
Edit distance-based similarity |
hybrid |
Weighted combination of all methods (default) |
The similarity score is computed as a weighted blend of content similarity (80%) and metadata similarity (20%).
All functions handle errors gracefully:
- Extraction errors - Returned in the
errorfield ofDocumentMetadata - Unsupported formats - Returns empty content with
application/octet-streamencoding - Similarity errors - Returns empty matches array
const result = extract(documents);
if (result[0].documents[0].error) {
console.error('Extraction failed:', result[0].documents[0].error);
}- OCR is slow - Large images take time; consider resizing before processing
- Missing GPS data - Not all images contain EXIF location;
locationobject exists but fields may be undefined - Empty PDF text - Some PDFs are image-based; OCR is not currently applied to PDFs
- Unicode handling - All similarity methods support Unicode text
- Rust (latest stable)
- Node.js 18+
- pnpm
pnpm buildpnpm testpnpm benchMIT
