Skip to content
/ xscrape Public

xscrape is a flexible library designed for extracting and transforming data from HTML documents using user-defined schemas.

License

Notifications You must be signed in to change notification settings

johnie/xscrape

Repository files navigation

🕷️
xscrape

Extract and transform HTML with your own schema, powered by Standard Schema compatibility.
by @johnie


License npm stars



Overview

xscrape is a powerful HTML scraping library that combines the flexibility of query selectors with the safety of schema validation. It works with any validation library that implements the Standard Schema specification, including Zod, Valibot, ArkType, and Effect Schema.

Features

  • HTML Parsing: Extract data from HTML using query selectors powered by cheerio
  • Universal Schema Support: Works with any Standard Schema compatible library
  • Type Safety: Full TypeScript support with inferred types from your schemas
  • Flexible Extraction: Support for nested objects, arrays, and custom transformation functions
  • Error Handling: Comprehensive error handling with detailed validation feedback
  • Custom Transformations: Apply post-processing transformations to validated data
  • Default Values: Handle missing data gracefully through schema defaults

Installation

Install xscrape with your preferred package manager:

npm install xscrape
# or
pnpm add xscrape
# or
bun add xscrape

Quick Start

import { defineScraper } from 'xscrape';
import { z } from 'zod';

// Define your schema
const schema = z.object({
  title: z.string(),
  description: z.string(),
  keywords: z.array(z.string()),
  views: z.coerce.number(),
});

// Create a scraper
const scraper = defineScraper({
  schema,
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    keywords: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',') || [],
    },
    views: { selector: 'meta[name="views"]', value: 'content' },
  },
});

// Use the scraper
const { data, error } = await scraper(htmlString);

Usage Examples

Basic Extraction

Extract basic metadata from an HTML page:

import { defineScraper } from 'xscrape';
import { z } from 'zod';

const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    description: z.string(),
    author: z.string(),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    author: { selector: 'meta[name="author"]', value: 'content' },
  },
});

const html = `
<!DOCTYPE html>
<html>
<head>
  <title>My Blog Post</title>
  <meta name="description" content="An interesting blog post">
  <meta name="author" content="John Doe">
</head>
<body>...</body>
</html>
`;

const { data, error } = await scraper(html);
// data: { title: "My Blog Post", description: "An interesting blog post", author: "John Doe" }

Handling Missing Data

Use schema defaults to handle missing data gracefully:

const scraper = defineScraper({
  schema: z.object({
    title: z.string().default('Untitled'),
    description: z.string().default('No description available'),
    publishedAt: z.string().optional(),
    views: z.coerce.number().default(0),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    publishedAt: { selector: 'meta[name="published"]', value: 'content' },
    views: { selector: 'meta[name="views"]', value: 'content' },
  },
});

// Even with incomplete HTML, you get sensible defaults
const { data } = await scraper('<html><head><title>Test</title></head></html>');
// data: { title: "Test", description: "No description available", views: 0 }

Extracting Arrays

Extract multiple elements as arrays:

const scraper = defineScraper({
  schema: z.object({
    links: z.array(z.string()),
    headings: z.array(z.string()),
  }),
  extract: {
    links: [{ selector: 'a', value: 'href' }],
    headings: [{ selector: 'h1, h2, h3' }],
  },
});

const html = `
<html>
<body>
  <h1>Main Title</h1>
  <h2>Subtitle</h2>
  <a href="/page1">Link 1</a>
  <a href="/page2">Link 2</a>
</body>
</html>
`;

const { data } = await scraper(html);
// data: {
//   links: ["/page1", "/page2"],
//   headings: ["Main Title", "Subtitle"]
// }

Nested Objects

Extract complex nested data structures:

const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    socialMedia: z.object({
      image: z.string().url(),
      width: z.coerce.number(),
      height: z.coerce.number(),
      type: z.string(),
    }),
  }),
  extract: {
    title: { selector: 'title' },
    socialMedia: {
      selector: 'head',
      value: {
        image: { selector: 'meta[property="og:image"]', value: 'content' },
        width: { selector: 'meta[property="og:image:width"]', value: 'content' },
        height: { selector: 'meta[property="og:image:height"]', value: 'content' },
        type: { selector: 'meta[property="og:type"]', value: 'content' },
      },
    },
  },
});

Custom Value Transformations

Apply custom logic to extracted values:

const scraper = defineScraper({
  schema: z.object({
    tags: z.array(z.string()),
    publishedDate: z.date(),
    readingTime: z.number(),
  }),
  extract: {
    tags: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',').map(tag => tag.trim()) || [],
    },
    publishedDate: {
      selector: 'meta[name="published"]',
      value: (el) => new Date(el.attribs['content']),
    },
    readingTime: {
      selector: 'article',
      value: (el) => {
        const text = el.text();
        const wordsPerMinute = 200;
        const wordCount = text.split(/\s+/).length;
        return Math.ceil(wordCount / wordsPerMinute);
      },
    },
  },
});

Post-Processing with Transform

Apply transformations to the validated data:

const scraper = defineScraper({
  schema: z.object({
    title: z.string(),
    description: z.string(),
    tags: z.array(z.string()),
  }),
  extract: {
    title: { selector: 'title' },
    description: { selector: 'meta[name="description"]', value: 'content' },
    tags: {
      selector: 'meta[name="keywords"]',
      value: (el) => el.attribs['content']?.split(',') || [],
    },
  },
  transform: (data) => ({
    ...data,
    slug: data.title.toLowerCase().replace(/\s+/g, '-'),
    tagCount: data.tags.length,
    summary: data.description.substring(0, 100) + '...',
  }),
});

Schema Library Examples

Zod

import { z } from 'zod';

const schema = z.object({
  title: z.string(),
  price: z.coerce.number(),
  inStock: z.boolean().default(false),
});

Valibot

import * as v from 'valibot';

const schema = v.object({
  title: v.string(),
  price: v.pipe(v.string(), v.transform(Number)),
  inStock: v.optional(v.boolean(), false),
});

ArkType

import { type } from 'arktype';

const schema = type({
  title: 'string',
  price: 'number',
  inStock: 'boolean = false',
});

Effect Schema

import { Schema } from 'effect';

const schema = Schema.Struct({
  title: Schema.String,
  price: Schema.NumberFromString,
  inStock: Schema.optionalWith(Schema.Boolean, { default: () => false }),
});

API Reference

defineScraper(config)

Creates a scraper function with the specified configuration.

Parameters

  • config.schema: A Standard Schema compatible schema object
  • config.extract: Extraction configuration object
  • config.transform?: Optional post-processing function

Returns

A scraper function that takes HTML string and returns Promise<{ data?: T, error?: unknown }>.

Extraction Configuration

The extract object defines how to extract data from HTML:

type ExtractConfig = {
  [key: string]: ExtractDescriptor | [ExtractDescriptor];
};

type ExtractDescriptor = {
  selector: string;
  value?: string | ((el: Element) => any) | ExtractConfig;
};

Properties

  • selector: CSS selector to find elements
  • value: How to extract the value:
    • string: Attribute name (e.g., 'href', 'content')
    • function: Custom extraction function
    • object: Nested extraction configuration
    • undefined: Extract text content

Array Extraction

Wrap the descriptor in an array to extract multiple elements:

{
  links: [{ selector: 'a', value: 'href' }]
}

Error Handling

xscrape provides comprehensive error handling:

const { data, error } = await scraper(html);

if (error) {
  // Handle validation errors, extraction errors, or transform errors
  console.error('Scraping failed:', error);
} else {
  // Use the validated data
  console.log('Extracted data:', data);
}

Best Practices

  1. Use Specific Selectors: Be as specific as possible with CSS selectors to avoid unexpected matches
  2. Handle Missing Data: Use schema defaults or optional fields for data that might not be present
  3. Validate URLs: Use URL validation in your schema for href attributes
  4. Transform Data Early: Use custom value functions rather than post-processing when possible
  5. Type Safety: Let TypeScript infer types from your schema for better developer experience

Common Use Cases

  • Web Scraping: Extract structured data from websites
  • Meta Tag Extraction: Get social media and SEO metadata
  • Content Migration: Transform HTML content to structured data
  • Testing: Validate HTML structure in tests
  • RSS/Feed Processing: Extract article data from HTML feeds

Performance Considerations

  • xscrape uses cheerio for fast HTML parsing
  • Schema validation is performed once after extraction
  • Consider using streaming for large HTML documents
  • Cache scrapers when processing many similar documents

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

MIT License. See the LICENSE file for details.

Related Projects

  • cheerio - jQuery-like server-side HTML parsing
  • Standard Schema - Universal schema specification
  • Zod - TypeScript-first schema validation
  • Valibot - Modular and type-safe schema library
  • Effect - Maximum Type-safety (incl. error handling)
  • ArkType - TypeScript's 1:1 validator, optimized from editor to runtime

About

xscrape is a flexible library designed for extracting and transforming data from HTML documents using user-defined schemas.

Resources

License

Stars

Watchers

Forks

Packages

No packages published