Skip to content

Releases: SHA888/kawat

v0.1.5

23 Apr 14:55
f22c96f

Choose a tag to compare

v0.1.5 - 2026-04-23

Added

  • kawat: Public facade crate
    • kawat::extract(): Single-call HTML → plain text extraction
    • fetch_url() / fetch_url_async(): URL fetching helpers
  • Golden test suite: 20 HTML fixtures compared against Python trafilatura 2.0
    • Similarity scoring via Jaccard + word-level containment
    • Average match: 91.2% (threshold: 70%)
  • CI/CD enhancements
    • cargo-about license certification (about.toml)
    • Security audit workflow with cargo-audit + RUSTSEC advisory tracking

Changed

  • cascade.rs: Removed dead with_metadata branch (single default path)
  • containment(): Switched from substring to HashSet word-level matching

Fixed

  • Word-level containment false positives ("a" matching inside "article")
  • rustls-webpki vulnerability (RUSTSEC-2026-0104) via cargo update

v0.1.4

18 Apr 13:53
5e72029

Choose a tag to compare

v0.1.4

Added

  • kawat-core: Core extraction orchestrator (0.1.4 milestone)
    • cascade::run(): Full extraction pipeline with steps 1, 5, 6, 8a
    • ExtractorOptions: Configuration struct mirroring trafilatura Extractor
    • Document: Extracted document model with metadata and body text
    • Size validation with min_extracted_size threshold
  • kawat-output: TXT output formatter
    • to_txt(): Plain text output with metadata header
    • to_txt_body_only(): Body text without headers
    • OutputFormat enum with #[non_exhaustive] for forward compatibility
    • OutputError custom error type using thiserror
  • CI/CD: GitHub Actions workflow with cargo-deny
    • License audit with explicit allow list
    • Security advisory checking
    • Dependency banning for duplicates

Changed

  • Document::to_formatted_string() now dispatches to format-specific handlers
  • TXT output gates metadata header on with_metadata option (matches trafilatura behavior)
  • Pre-commit hooks include cargo-deny for local license checking

Fixed

  • Deprecated deny key removed from cargo-deny config
  • Added missing licenses: Unicode-3.0, Zlib

Release v0.1.3

07 Apr 00:14

Choose a tag to compare

Changes

Installation

From crates.io (manual publishing)

cargo install kawat

From source

cargo install --git https://github.com/SHA888/kawat.git --tag v0.1.3

Downloads

Check the assets below for pre-compiled binaries.

Note

Crates.io publishing is done manually to ensure quality control.

What's Changed

  • cargo(deps): update dom_smoothie requirement from 0.5 to 0.17 by @dependabot[bot] in #13
  • ci: bump codecov/codecov-action from 5 to 6 by @dependabot[bot] in #10
  • ci: bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #11
  • ci: bump peter-evans/create-pull-request from 6 to 8 by @dependabot[bot] in #12

Full Changelog: v0.1.2...v0.1.3

v0.1.2

26 Mar 08:37

Choose a tag to compare

Added

  • kawat-html: Complete HTML tree cleaning and tag normalization pipeline
    • tree_cleaning(): Remove 44 MANUALLY_CLEANED tags and strip 20 MANUALLY_STRIPPED tags
    • convert_tags(): Normalize HTML tags to internal catalog (h1-h6→head, b/strong/em/i→hi, a→ref, ul/ol→list, li→item, br→lb, blockquote→quote, del/s→del, code/pre→code)
    • convert_link(): Resolve relative URLs against base_url using standards-compliant URL resolution
    • _is_code_block(): Distinguish between inline code and code blocks
    • handle_textnode() + process_node(): Text extraction and normalization for all element types
    • link_density_test() and link_density_test_tables(): Link density filtering for content extraction
    • delete_by_link_density(): Remove high-density link elements with backtracking
  • kawat-extract: Custom KawatTree structure for lightweight HTML processing
    • KawatNode and KawatTree structs with full traversal and manipulation methods
    • HTML parsing with proper text/tail distinction
    • Integration with kawat-html transformations
    • 23 comprehensive unit tests

Changed

  • Improved HTML processing pipeline with immutable-first design
  • Enhanced error handling with proper Result types throughout

Fixed

  • Lifetime syntax errors in tree.rs (explicit '_ lifetime parameters)
  • Test failures in convert_link and textnode modules
  • Inline code formatting (missing closing backtick)

Testing

  • 34 kawat-html unit tests (all passing)
  • 23 kawat-extract unit tests (all passing)
  • 57 total tests across all crates (all passing)
  • Pre-commit hooks: Rust Format, Clippy, Cargo Audit (all passing)

v0.1.1

24 Mar 07:17

Choose a tag to compare

v0.1.1 - 2026-03-24

Added

  • Comprehensive XPath evaluation engine with CSS selector fallback
  • Benchmark suite for XPath performance testing
  • Extensive test coverage (15+ tests) for all BODY_XPATH expressions
  • Support for complex XPath patterns including (article)[1]

Changed

  • Updated dependencies: lru (0.12→0.16), quick-xml (0.37→0.39), scraper (0.22→0.26), criterion (0.5→0.8), reqwest (0.12→0.13)
  • Improved TLS configuration in reqwest (rustls-tls → rustls)
  • Enhanced error handling in XPath evaluation

Fixed

  • XPath fallback test failures for (article)[1] pattern
  • CI/CD pipeline issues with deny.toml configuration
  • Codecov upload failures for protected branches
  • Security workflow permission handling

v0.1.0

21 Mar 21:04

Choose a tag to compare

  • Initial workspace import of kawat crates (core, extract, html, xpath, metadata, output, CLI, etc.).
  • Added date parsing regex fallback and chrono clock feature for htmldate-rs.
  • Implemented FromStr for OutputFormat and hardened CLI format parsing.
  • Resolved clippy warnings (unused imports/vars, format args) across crates.
  • Added .gitignore and .pre-commit-config.yaml (fmt, clippy, cargo-audit) and ensured hooks pass.