xscrape
is a powerful and flexible library designed for extracting and transforming data from HTML documents using user-defined schemas. It now supports any validation library that implements the Standard Schema, allowing you to bring your own schema for robust, type-safe data validation.
- HTML Parsing: Extract data from HTML using CSS selectors with the help of cheerio.
- Flexible Schema Validation: Validate and transform extracted data with any validation library that implements the Standard Schema, such as Zod, Valibot, ArkType, and Effect Schema.
- Custom Transformations: Provide custom transformations for extracted attributes.
- Default Values: Define default values for missing data fields through your chosen schema library's features.
- Nested Field Support: Define and extract nested data structures from HTML elements.
To install this library, use your preferred package manager:
pnpm add xscrape
# or
npm install xscrape
You will also need to install your chosen schema validation library, for example, Zod:
pnpm add zod
# or
npm install zod
Below is an example of how to use xscrape
with a Zod schema to extract and transform data from an HTML document.
import { defineScraper } from 'xscrape';
import { z } from 'zod';
const scraper = defineScraper({
schema: z.object({
title: z.string(),
description: z.string(),
keywords: z.array(z.string()),
views: z.coerce.number(),
}),
extract: {
title: {
selector: 'title',
},
description: {
selector: 'meta[name="description"]',
value: 'content',
},
keywords: {
selector: 'meta[name="keywords"]',
value(el) {
return el.attribs['content']?.split(',');
},
},
views: {
selector: 'meta[name="views"]',
value: 'content',
},
},
});
const html = `
<!DOCTYPE html>
<html>
<head>
<meta name="description" content="An example description.">
<meta name="keywords" content="typescript,html,parsing">
<meta name="views" content="1234">
<title>Example Title</title>
</head>
<body></body>
</html>
`;
const { data, error } = await scraper(html);
console.log(data);
// Outputs:
// {
// title: 'Example Title',
// description: 'An example description.',
// keywords: ['typescript', 'html', 'parsing'],
// views: 1234
// }
You can handle missing data by using the features of your chosen schema library, such as default values in Zod.
import { defineScraper } from 'xscrape';
import { z } from 'zod';
const scraper = defineScraper({
schema: z.object({
title: z.string().default('No title'),
description: z.string().default('No description'),
views: z.coerce.number().default(0),
}),
extract: {
title: {
selector: 'title',
},
description: {
selector: 'meta[name="description"]',
value: 'content',
},
views: {
selector: 'meta[name="views"]',
value: 'content',
},
},
});
xscrape
also supports extracting nested data structures.
import { defineScraper } from 'xscrape';
import { z } from 'zod';
const scraper = defineScraper({
schema: z.object({
title: z.string(),
image: z.object({
url: z.string().url(),
width: z.coerce.number(),
height: z.coerce.number(),
}).default({ url: '', width: 0, height: 0 }).optional(),
}),
extract: {
title: {
selector: 'title',
},
image: {
selector: 'head',
value: {
url: {
selector: 'meta[property="og:image"]',
value: 'content',
},
width: {
selector: 'meta[property="og:image:width"]',
value: 'content',
},
height: {
selector: 'meta[property="og:image:height"]',
value: 'content',
},
},
},
},
});
The defineScraper
function accepts a configuration object with the following properties:
schema
: A schema object from any library that implements the Standard Schema interface. This schema defines the shape and validation rules for the extracted data.extract
: An object that determines how fields are extracted from the HTML using CSS selectors.transform
(optional): A function to apply custom transformations to the validated data.
Contributions are welcome! Please see the Contributing Guide for more information.
This project is licensed under the MIT License. See the LICENSE file for details.