Skip to content

v0.6.0

Choose a tag to compare

@rocklambros rocklambros released this 20 Feb 05:56
· 165 commits to main since this release

v0.6.0 — Security, Performance & Quality

Security

  • SSRF protection: URL fetching validates resolved IPs against private/reserved/loopback/link-local ranges
  • File size limits: 100MB default, configurable via --max-file-size CLI flag
  • Hardened filename sanitization: Strips null bytes, control characters, and path separators
  • YAML escape fix: Properly handles newlines and carriage returns in frontmatter values
  • CI permissions: Added contents: read to workflow

Performance

  • Lazy imports: Converter modules loaded on-demand (~300-800ms startup improvement)
  • Pre-compiled regexes: All regex patterns compiled once at module level
  • PDF single-parse: Uses context manager and passes open document to pymupdf4llm
  • lxml parser: Faster HTML pre-cleaning via lxml instead of html.parser
  • Trafilatura-first HTML: Calls trafilatura.extract(output_format="markdown") directly, falls back to markdownify only when needed

Fixes

  • Fixed lettered list regex false positive on uppercase names (e.g. "A. Einstein")
  • Fixed CLI skip counter logic (early-return before converter call)
  • Moved SCRIPT_DIR inside main() to avoid module-level side effect
  • Narrowed exception handlers across all converters

Improvements

  • Shared build_frontmatter() helper eliminates 4-way frontmatter duplication
  • Shared read_text_with_fallback() helper for encoding detection
  • convert_url() convenience wrapper simplifies URL processing
  • Min version bounds on all dependencies
  • Added lxml>=5.0.0 as explicit dependency
  • Updated README: Python 3.10+ requirement, security section, feature table updates
  • GitHub metadata: CI workflow, CodeQL analysis, Dependabot, issue/PR templates

Install / Upgrade

pip install --upgrade any2md