mistral-ocr converts PDFs into Markdown and/or DOCX using mistral-ocr-latest.
The project exposes:
- a
mistral-ocrCLI - a reusable JavaScript/TypeScript API
Use it as an npm package:
npm install mistral-ocrFor local package development:
npm install
npm run buildRequired environment variable:
export MISTRAL_API_KEY=...Standard conversion to Markdown + DOCX with image extraction:
npx mistral-ocr convert ./document.pdfDefault outputs:
./document.md./document.docx./document-images/
Main options:
npx mistral-ocr convert ./document.pdf \
--output-dir ./out \
--markdown ./out/document.md \
--docx ./out/document.docx \
--images-dir ./out/images \
--model mistral-ocr-latestGenerate Markdown only:
npx mistral-ocr convert ./document.pdf --no-docxGenerate DOCX only:
npx mistral-ocr convert ./document.pdf --no-markdownBatch OCR conversion:
npx mistral-ocr batch ./doc-a.pdf ./doc-b.pdf --output-dir ./outBatch mode uses Mistral's Batch Inference endpoint for OCR, waits for the job by default, then writes one Markdown/DOCX pair per input PDF:
./out/doc-a.md./out/doc-a.docx./out/doc-a-images/./out/doc-b.md./out/doc-b.docx./out/doc-b-images/
Submit a batch job without waiting for results:
npx mistral-ocr batch ./doc-a.pdf ./doc-b.pdf --no-waitUseful batch options:
npx mistral-ocr batch ./doc-a.pdf ./doc-b.pdf \
--output-dir ./out \
--poll-interval 10 \
--timeout 1800 \
--no-docximport { convertPdf } from 'mistral-ocr';
const result = await convertPdf('./document.pdf', {
markdownPath: './out/document.md',
docxPath: './out/document.docx',
imageOutputDir: './out/images',
});
console.log(result.markdown);
console.log(result.docxBuffer?.length);Example without writing to disk:
import { convertPdf } from 'mistral-ocr';
const result = await convertPdf('./document.pdf', {
generateDocx: false,
logger: false,
});
console.log(result.markdown);Batch API:
import { convertPdfBatch, createOcrBatch, waitForOcrBatch } from 'mistral-ocr';
const batch = await convertPdfBatch(['./doc-a.pdf', './doc-b.pdf'], {
outputDir: './out',
generateDocx: false,
});
console.log(batch.job.id);
console.log(batch.entries.map((entry) => entry.markdownPath));
const submitted = await createOcrBatch(['./large-a.pdf', './large-b.pdf']);
const finished = await waitForOcrBatch(submitted.job.id);
console.log(finished.status);This library follows the format returned by the Mistral OCR API:
- text is returned as Markdown, page by page
- extracted images are first referenced as placeholders in the OCR Markdown, then remapped to local files when
imageOutputDiris provided - DOCX generation is intentionally lightweight and focuses on headings, paragraphs, and images
Practical implications:
- scanned PDFs, multi-column layouts, tables, figures, and captions are generally handled well by
mistral-ocr-latest - complex tables, equations, or very rich layouts remain most faithful in the raw Markdown produced by the model
- DOCX output does not try to perfectly reconstruct the original Word-style layout; it aims to produce a usable document
Official references:
- API OCR Mistral: https://docs.mistral.ai/capabilities/document_ai/basic_ocr/
- Mistral Batch Inference: https://docs.mistral.ai/capabilities/batch/
- latest public benchmark published for Mistral OCR: https://mistral.ai/news/mistral-ocr-3
convertPdf(input, options)convertPdfBatch(inputs, options)createOcrBatch(inputs, options)waitForOcrBatch(jobId, options)listOcrBatchOutputs(job, options)markdownToDocx(markdown, options)createMistralClient(apiKey?)buildMarkdownFromOcrResponse(ocrResponse, replacements?)extractImagesFromOcrResponse(ocrResponse)writeExtractedImages(images, imageOutputDir, referenceBaseDir?)
npm run build
node build/cli.js --helpPublishing is handled by GitHub Actions through npm Trusted Publishing.
Release flow:
npm version patch
git push origin master --follow-tagsThe publish job only runs for v* tags. Before publishing, CI verifies that:
- the Git tag matches
package.jsonexactly, for examplev0.1.1for version0.1.1 - the package version is not already present on npm
npm run verifypasses
For local testing in this workspace, the Mistral key can be loaded from ../top-ai-ideas-fullstack/.env.
Recommended test PDF:
New York illustrated(Library of Congress, 1878), 122 illustrated pages, public domain- source page: https://www.loc.gov/item/01014750/
- direct PDF: https://tile.loc.gov/storage-services/public/gdcmassbookdig/newyorkillustrat03newy/newyorkillustrat03newy.pdf
Useful commands:
npm run build
mkdir -p .scratch/mistral-ocr-tests
curl -L https://tile.loc.gov/storage-services/public/gdcmassbookdig/newyorkillustrat03newy/newyorkillustrat03newy.pdf -o .scratch/mistral-ocr-tests/new-york-illustrated.pdf
bash -lc 'set -a; source ../top-ai-ideas-fullstack/.env >/dev/null 2>&1; set +a; node build/cli.js convert .scratch/mistral-ocr-tests/new-york-illustrated.pdf --output-dir .scratch/mistral-ocr-tests/new-york-illustrated-out'
bash -lc 'set -a; source ../top-ai-ideas-fullstack/.env >/dev/null 2>&1; set +a; node build/cli.js convert CONTRIBUATION_AI_AERONAUTIQUE.pdf --output-dir .scratch/mistral-ocr-tests/contribution-out'
bash -lc 'set -a; source ../top-ai-ideas-fullstack/.env >/dev/null 2>&1; set +a; node build/cli.js batch CONTRIBUATION_AI_AERONAUTIQUE.pdf .scratch/mistral-ocr-tests/new-york-illustrated.pdf --output-dir .scratch/mistral-ocr-tests/batch-out --no-docx'