Releases: SHA888/kawat
Releases · SHA888/kawat
v0.1.5
v0.1.5 - 2026-04-23
Added
- kawat: Public facade crate
kawat::extract(): Single-call HTML → plain text extractionfetch_url()/fetch_url_async(): URL fetching helpers
- Golden test suite: 20 HTML fixtures compared against Python trafilatura 2.0
- Similarity scoring via Jaccard + word-level containment
- Average match: 91.2% (threshold: 70%)
- CI/CD enhancements
cargo-aboutlicense certification (about.toml)- Security audit workflow with
cargo-audit+ RUSTSEC advisory tracking
Changed
cascade.rs: Removed deadwith_metadatabranch (single default path)containment(): Switched from substring to HashSet word-level matching
Fixed
- Word-level containment false positives ("a" matching inside "article")
rustls-webpkivulnerability (RUSTSEC-2026-0104) viacargo update
v0.1.4
v0.1.4
Added
- kawat-core: Core extraction orchestrator (0.1.4 milestone)
cascade::run(): Full extraction pipeline with steps 1, 5, 6, 8aExtractorOptions: Configuration struct mirroring trafilatura ExtractorDocument: Extracted document model with metadata and body text- Size validation with
min_extracted_sizethreshold
- kawat-output: TXT output formatter
to_txt(): Plain text output with metadata headerto_txt_body_only(): Body text without headersOutputFormatenum with#[non_exhaustive]for forward compatibilityOutputErrorcustom error type usingthiserror
- CI/CD: GitHub Actions workflow with cargo-deny
- License audit with explicit allow list
- Security advisory checking
- Dependency banning for duplicates
Changed
Document::to_formatted_string()now dispatches to format-specific handlers- TXT output gates metadata header on
with_metadataoption (matches trafilatura behavior) - Pre-commit hooks include cargo-deny for local license checking
Fixed
- Deprecated
denykey removed from cargo-deny config - Added missing licenses:
Unicode-3.0,Zlib
Release v0.1.3
Changes
Installation
From crates.io (manual publishing)
cargo install kawatFrom source
cargo install --git https://github.com/SHA888/kawat.git --tag v0.1.3Downloads
Check the assets below for pre-compiled binaries.
Note
Crates.io publishing is done manually to ensure quality control.
What's Changed
- cargo(deps): update dom_smoothie requirement from 0.5 to 0.17 by @dependabot[bot] in #13
- ci: bump codecov/codecov-action from 5 to 6 by @dependabot[bot] in #10
- ci: bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #11
- ci: bump peter-evans/create-pull-request from 6 to 8 by @dependabot[bot] in #12
Full Changelog: v0.1.2...v0.1.3
v0.1.2
Added
- kawat-html: Complete HTML tree cleaning and tag normalization pipeline
tree_cleaning(): Remove 44 MANUALLY_CLEANED tags and strip 20 MANUALLY_STRIPPED tagsconvert_tags(): Normalize HTML tags to internal catalog (h1-h6→head, b/strong/em/i→hi, a→ref, ul/ol→list, li→item, br→lb, blockquote→quote, del/s→del, code/pre→code)convert_link(): Resolve relative URLs against base_url using standards-compliant URL resolution_is_code_block(): Distinguish between inline code and code blockshandle_textnode()+process_node(): Text extraction and normalization for all element typeslink_density_test()andlink_density_test_tables(): Link density filtering for content extractiondelete_by_link_density(): Remove high-density link elements with backtracking
- kawat-extract: Custom KawatTree structure for lightweight HTML processing
KawatNodeandKawatTreestructs with full traversal and manipulation methods- HTML parsing with proper text/tail distinction
- Integration with kawat-html transformations
- 23 comprehensive unit tests
Changed
- Improved HTML processing pipeline with immutable-first design
- Enhanced error handling with proper Result types throughout
Fixed
- Lifetime syntax errors in tree.rs (explicit
'_lifetime parameters) - Test failures in convert_link and textnode modules
- Inline code formatting (missing closing backtick)
Testing
- 34 kawat-html unit tests (all passing)
- 23 kawat-extract unit tests (all passing)
- 57 total tests across all crates (all passing)
- Pre-commit hooks: Rust Format, Clippy, Cargo Audit (all passing)
v0.1.1
v0.1.1 - 2026-03-24
Added
- Comprehensive XPath evaluation engine with CSS selector fallback
- Benchmark suite for XPath performance testing
- Extensive test coverage (15+ tests) for all BODY_XPATH expressions
- Support for complex XPath patterns including
(article)[1]
Changed
- Updated dependencies: lru (0.12→0.16), quick-xml (0.37→0.39), scraper (0.22→0.26), criterion (0.5→0.8), reqwest (0.12→0.13)
- Improved TLS configuration in reqwest (rustls-tls → rustls)
- Enhanced error handling in XPath evaluation
Fixed
- XPath fallback test failures for
(article)[1]pattern - CI/CD pipeline issues with deny.toml configuration
- Codecov upload failures for protected branches
- Security workflow permission handling
v0.1.0
- Initial workspace import of kawat crates (core, extract, html, xpath, metadata, output, CLI, etc.).
- Added date parsing regex fallback and chrono clock feature for htmldate-rs.
- Implemented
FromStrforOutputFormatand hardened CLI format parsing. - Resolved clippy warnings (unused imports/vars, format args) across crates.
- Added
.gitignoreand.pre-commit-config.yaml(fmt, clippy, cargo-audit) and ensured hooks pass.