Extract and transform HTML with your own schema, powered by Standard Schema
compatibility.
by @johnie
xscrape is a powerful HTML scraping library that combines the flexibility of query selectors with the safety of schema validation. It works with any validation library that implements the Standard Schema specification, including Zod, Valibot, ArkType, and Effect Schema.
- HTML Parsing: Extract data from HTML using query selectors powered by cheerio
- Universal Schema Support: Works with any Standard Schema compatible library
- Type Safety: Full TypeScript support with inferred types from your schemas
- Flexible Extraction: Support for nested objects, arrays, and custom transformation functions
- Error Handling: Comprehensive error handling with detailed validation feedback
- Custom Transformations: Apply post-processing transformations to validated data
- Default Values: Handle missing data gracefully through schema defaults
Install xscrape with your preferred package manager:
npm install xscrape
# or
pnpm add xscrape
# or
bun add xscrape
import { defineScraper } from 'xscrape';
import { z } from 'zod';
// Define your schema
const schema = z.object({
title: z.string(),
description: z.string(),
keywords: z.array(z.string()),
views: z.coerce.number(),
});
// Create a scraper
const scraper = defineScraper({
schema,
extract: {
title: { selector: 'title' },
description: { selector: 'meta[name="description"]', value: 'content' },
keywords: {
selector: 'meta[name="keywords"]',
value: (el) => el.attribs['content']?.split(',') || [],
},
views: { selector: 'meta[name="views"]', value: 'content' },
},
});
// Use the scraper
const { data, error } = await scraper(htmlString);
Extract basic metadata from an HTML page:
import { defineScraper } from 'xscrape';
import { z } from 'zod';
const scraper = defineScraper({
schema: z.object({
title: z.string(),
description: z.string(),
author: z.string(),
}),
extract: {
title: { selector: 'title' },
description: { selector: 'meta[name="description"]', value: 'content' },
author: { selector: 'meta[name="author"]', value: 'content' },
},
});
const html = `
<!DOCTYPE html>
<html>
<head>
<title>My Blog Post</title>
<meta name="description" content="An interesting blog post">
<meta name="author" content="John Doe">
</head>
<body>...</body>
</html>
`;
const { data, error } = await scraper(html);
// data: { title: "My Blog Post", description: "An interesting blog post", author: "John Doe" }
Use schema defaults to handle missing data gracefully:
const scraper = defineScraper({
schema: z.object({
title: z.string().default('Untitled'),
description: z.string().default('No description available'),
publishedAt: z.string().optional(),
views: z.coerce.number().default(0),
}),
extract: {
title: { selector: 'title' },
description: { selector: 'meta[name="description"]', value: 'content' },
publishedAt: { selector: 'meta[name="published"]', value: 'content' },
views: { selector: 'meta[name="views"]', value: 'content' },
},
});
// Even with incomplete HTML, you get sensible defaults
const { data } = await scraper('<html><head><title>Test</title></head></html>');
// data: { title: "Test", description: "No description available", views: 0 }
Extract multiple elements as arrays:
const scraper = defineScraper({
schema: z.object({
links: z.array(z.string()),
headings: z.array(z.string()),
}),
extract: {
links: [{ selector: 'a', value: 'href' }],
headings: [{ selector: 'h1, h2, h3' }],
},
});
const html = `
<html>
<body>
<h1>Main Title</h1>
<h2>Subtitle</h2>
<a href="/page1">Link 1</a>
<a href="/page2">Link 2</a>
</body>
</html>
`;
const { data } = await scraper(html);
// data: {
// links: ["/page1", "/page2"],
// headings: ["Main Title", "Subtitle"]
// }
Extract complex nested data structures:
const scraper = defineScraper({
schema: z.object({
title: z.string(),
socialMedia: z.object({
image: z.string().url(),
width: z.coerce.number(),
height: z.coerce.number(),
type: z.string(),
}),
}),
extract: {
title: { selector: 'title' },
socialMedia: {
selector: 'head',
value: {
image: { selector: 'meta[property="og:image"]', value: 'content' },
width: { selector: 'meta[property="og:image:width"]', value: 'content' },
height: { selector: 'meta[property="og:image:height"]', value: 'content' },
type: { selector: 'meta[property="og:type"]', value: 'content' },
},
},
},
});
Apply custom logic to extracted values:
const scraper = defineScraper({
schema: z.object({
tags: z.array(z.string()),
publishedDate: z.date(),
readingTime: z.number(),
}),
extract: {
tags: {
selector: 'meta[name="keywords"]',
value: (el) => el.attribs['content']?.split(',').map(tag => tag.trim()) || [],
},
publishedDate: {
selector: 'meta[name="published"]',
value: (el) => new Date(el.attribs['content']),
},
readingTime: {
selector: 'article',
value: (el) => {
const text = el.text();
const wordsPerMinute = 200;
const wordCount = text.split(/\s+/).length;
return Math.ceil(wordCount / wordsPerMinute);
},
},
},
});
Apply transformations to the validated data:
const scraper = defineScraper({
schema: z.object({
title: z.string(),
description: z.string(),
tags: z.array(z.string()),
}),
extract: {
title: { selector: 'title' },
description: { selector: 'meta[name="description"]', value: 'content' },
tags: {
selector: 'meta[name="keywords"]',
value: (el) => el.attribs['content']?.split(',') || [],
},
},
transform: (data) => ({
...data,
slug: data.title.toLowerCase().replace(/\s+/g, '-'),
tagCount: data.tags.length,
summary: data.description.substring(0, 100) + '...',
}),
});
import { z } from 'zod';
const schema = z.object({
title: z.string(),
price: z.coerce.number(),
inStock: z.boolean().default(false),
});
import * as v from 'valibot';
const schema = v.object({
title: v.string(),
price: v.pipe(v.string(), v.transform(Number)),
inStock: v.optional(v.boolean(), false),
});
import { type } from 'arktype';
const schema = type({
title: 'string',
price: 'number',
inStock: 'boolean = false',
});
import { Schema } from 'effect';
const schema = Schema.Struct({
title: Schema.String,
price: Schema.NumberFromString,
inStock: Schema.optionalWith(Schema.Boolean, { default: () => false }),
});
Creates a scraper function with the specified configuration.
config.schema
: A Standard Schema compatible schema objectconfig.extract
: Extraction configuration objectconfig.transform?
: Optional post-processing function
A scraper function that takes HTML string and returns Promise<{ data?: T, error?: unknown }>
.
The extract
object defines how to extract data from HTML:
type ExtractConfig = {
[key: string]: ExtractDescriptor | [ExtractDescriptor];
};
type ExtractDescriptor = {
selector: string;
value?: string | ((el: Element) => any) | ExtractConfig;
};
selector
: CSS selector to find elementsvalue
: How to extract the value:string
: Attribute name (e.g.,'href'
,'content'
)function
: Custom extraction functionobject
: Nested extraction configurationundefined
: Extract text content
Wrap the descriptor in an array to extract multiple elements:
{
links: [{ selector: 'a', value: 'href' }]
}
xscrape provides comprehensive error handling:
const { data, error } = await scraper(html);
if (error) {
// Handle validation errors, extraction errors, or transform errors
console.error('Scraping failed:', error);
} else {
// Use the validated data
console.log('Extracted data:', data);
}
- Use Specific Selectors: Be as specific as possible with CSS selectors to avoid unexpected matches
- Handle Missing Data: Use schema defaults or optional fields for data that might not be present
- Validate URLs: Use URL validation in your schema for href attributes
- Transform Data Early: Use custom value functions rather than post-processing when possible
- Type Safety: Let TypeScript infer types from your schema for better developer experience
- Web Scraping: Extract structured data from websites
- Meta Tag Extraction: Get social media and SEO metadata
- Content Migration: Transform HTML content to structured data
- Testing: Validate HTML structure in tests
- RSS/Feed Processing: Extract article data from HTML feeds
- xscrape uses cheerio for fast HTML parsing
- Schema validation is performed once after extraction
- Consider using streaming for large HTML documents
- Cache scrapers when processing many similar documents
We welcome contributions! Please see our Contributing Guide for details.
MIT License. See the LICENSE file for details.
- cheerio - jQuery-like server-side HTML parsing
- Standard Schema - Universal schema specification
- Zod - TypeScript-first schema validation
- Valibot - Modular and type-safe schema library
- Effect - Maximum Type-safety (incl. error handling)
- ArkType - TypeScript's 1:1 validator, optimized from editor to runtime