A collection of utilities for working with PDF files. Designed specifically for Deno, workers and other nodeless environments. However, it also works in Node.js and the browser.
unpdf
ships with a serverless build/redistribution of Mozilla's PDF.js for serverless environments. Apart from some string replacements and mocks, unenv
does the heavy lifting by converting Node.js specific code to be platform-agnostic. See pdfjs.rollup.config.ts
for all the details.
This library is also intended as a modern alternative to the unmaintained but still popular pdf-parse
.
- ποΈ Works in Node.js, browser and workers
- πͺ Includes serverless build of PDF.js (
unpdf/pdfjs
) - π¬ Extract text and images from PDF files
- π§± Opt-in to legacy PDF.js build
- π¨ Zero dependencies
The serverless build of PDF.js provided by unpdf
is based on PDF.js v4.10.38.
You can use an official PDF.js build by using the definePDFJSModule
method. This is useful if you want to use a specific version or a custom build of PDF.js.
Run the following command to add unpdf
to your project.
# pnpm
pnpm add -D unpdf
# npm
npm install -D unpdf
# yarn
yarn add -D unpdf
import { extractText, getDocumentProxy } from 'unpdf'
// Either fetch a PDF file from the web or load it from the file system
const buffer = await fetch('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf')
.then(res => res.arrayBuffer())
const buffer = await readFile('./dummy.pdf')
// Then, load the PDF file into a PDF.js document
const pdf = await getDocumentProxy(new Uint8Array(buffer))
// Finally, extract the text from the PDF file
const { totalPages, text } = await extractText(pdf, { mergePages: true })
console.log(`Total pages: ${totalPages}`)
console.log(text)
unpdf
provides helpful methods to work with PDF files, such as extractText
and extractImages
, which should cover most use cases. However, if you need more control over the PDF.js API, you can use the getResolvedPDFJS
method to get the resolved PDF.js module.
Access the PDF.js API directly by calling getResolvedPDFJS
:
import { getResolvedPDFJS } from 'unpdf'
const { version } = await getResolvedPDFJS()
Note
If no other PDF.js build was defined, the serverless build will always be used.
For example, you can use the getDocument
method to load a PDF file and then use the getPage
method to get a specific page. You can also use the getTextContent
method to extract the text from the page.
import { getResolvedPDFJS } from 'unpdf'
const { getDocument } = await getResolvedPDFJS()
const data = Deno.readFileSync('dummy.pdf')
const doc = await getDocument(data).promise
console.log(await doc.getMetadata())
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i)
const textContent = await page.getTextContent()
const contents = textContent.items.map(item => item.str).join(' ')
console.log(contents)
}
Usually you don't need to worry about the PDF.js build. unpdf
ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.
Warning
Later PDF.js v4.x versions uses Promise.withResolvers
, which may not be supported in all environments, such as Node < 22. Consider to use the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.
For example, if you want to use the official PDF.js build, you can do the following:
import { definePDFJSModule } from 'unpdf'
// Define the PDF.js build before using any other unpdf method
await definePDFJSModule(() => import('pdfjs-dist'))
// Now, you can use all unpdf methods with the official PDF.js build
const { text } = await extractText(pdf)
Allows to define a custom PDF.js build. This method should be called before using any other method. If no custom build is defined, the serverless build will be used.
Type Declaration
function definePDFJSModule(pdfjs: () => Promise<PDFJS>): Promise<void>
Returns the resolved PDF.js module. If no other PDF.js build was defined, the serverless build will be used. This method is useful if you want to use the PDF.js API directly.
Type Declaration
function getResolvedPDFJS(): Promise<PDFJS>
Type Declaration
function getMeta(
data: DocumentInitParameters['data'] | PDFDocumentProxy,
): Promise<{
info: Record<string, any>
metadata: Record<string, any>
}>
Extracts all text from a PDF. If mergePages
is set to true
, the text of all pages will be merged into a single string. Otherwise, an array of strings for each page will be returned.
Type Declaration
function extractText(
data: DocumentInitParameters['data'] | PDFDocumentProxy,
options?: {
mergePages?: false
}
): Promise<{
totalPages: number
text: string[]
}>
function extractText(
data: DocumentInitParameters['data'] | PDFDocumentProxy,
options: {
mergePages: true
}
): Promise<{
totalPages: number
text: string
}>
Extracts images from a specific page of a PDF document, including necessary metadata such as width, height, and calculated color channels.
Note
This method will only work in Node.js and browser environments.
In order to use this method, make sure to meet the following requirements:
- Use the official PDF.js build (see below for details).
- Install the
@napi-rs/canvas
package if you are using Node.js. This package is required to render the PDF page as an image.
Type Declaration
interface ExtractedImageObject {
data: Uint8ClampedArray
width: number
height: number
channels: 1 | 3 | 4
key: string
}
function extractImages(
data: DocumentInitParameters['data'] | PDFDocumentProxy,
pageNumber: number,
): Promise<ExtractedImageObject[]>
Example
Note
The following example uses the sharp library to process and save the extracted images. You will need to install it with your preferred package manager.
import { readFile, writeFile } from 'node:fs/promises'
import sharp from 'sharp'
import { extractImages, getDocumentProxy } from 'unpdf'
async function extractPdfImages() {
const buffer = await readFile('./document.pdf')
const pdf = await getDocumentProxy(new Uint8Array(buffer))
// Extract images from page 1
const imagesData = await extractImages(pdf, 1)
console.log(`Found ${imagesData.length} images on page 1`)
// Process each image with sharp (optional)
let totalImagesProcessed = 0
for (const imgData of imagesData) {
const imageIndex = ++totalImagesProcessed
await sharp(imgData.data, {
raw: {
width: imgData.width,
height: imgData.height,
channels: imgData.channels
}
})
.png()
.toFile(`image-${imageIndex}.png`)
console.log(`Saved image ${imageIndex} (${imgData.width}x${imgData.height}, ${imgData.channels} channels)`)
}
}
extractPdfImages().catch(console.error)
To render a PDF page as an image, you can use the renderPageAsImage
method. This method will return an ArrayBuffer
of the rendered image.
Note
This method will only work in Node.js and browser environments.
In order to use this method, make sure to meet the following requirements:
- Use the official PDF.js build (see below for details).
- Install the
@napi-rs/canvas
package if you are using Node.js. This package is required to render the PDF page as an image.
Type Declaration
declare function renderPageAsImage(
data: DocumentInitParameters['data'],
pageNumber: number,
options?: {
canvasImport?: () => Promise<typeof import('@napi-rs/canvas')>
/** @default 1.0 */
scale?: number
width?: number
height?: number
},
): Promise<ArrayBuffer>
Example
import { definePDFJSModule, renderPageAsImage } from 'unpdf'
// Use the official PDF.js build
await definePDFJSModule(() => import('pdfjs-dist'))
const pdf = await readFile('./dummy.pdf')
const buffer = new Uint8Array(pdf)
const pageNumber = 1
const result = await renderPageAsImage(buffer, pageNumber, {
canvasImport: () => import('@napi-rs/canvas'),
scale: 2,
})
await writeFile('dummy-page-1.png', new Uint8Array(result))
MIT License Β© 2023-PRESENT Johann Schopplich