Skip to content

Releases: weaiw/trove-ai

v1.1.0 — WeChat Channels (视频号) capture + smart generic extraction

31 May 01:25

Choose a tag to compare

✨ Highlights

🎬 WeChat Channels (视频号) capture

Links from channels.weixin.qq.com are now fetched and saved. Because Channels pages are JavaScript-rendered, they ride the new smart generic-extraction cascade — Trove renders the page with a headless browser before pulling the main content.

🪶 Smart generic extraction cascade

Pages without a dedicated parser (视频号, CSDN, Juejin, Medium, SSPai, 36Kr, and any other site) now go through a three-stage pipeline for far cleaner main-content extraction:

  1. trafilatura extracts the article body from raw HTML (stable; strips nav/footer/ads);
  2. if the text is too short (a client-rendered page), the page is rendered with the bundled headless Chromium and re-extracted, keeping the longer result;
  3. otherwise it falls back to the original BeautifulSoup heuristic cleaner.

The downstream clean_to_markdown pipeline is unchanged.

📄 Article-scoped Q&A

On an article detail page the assistant can now answer strictly from that one article (the whole article goes into context, with no library-wide vector search). A 📄 this-article / 📚 whole-library toggle appears in the chat box; /r /a /c still escalate to whole-library research/creation.

🐛 Fixes

  • Generic web capture — restored the missing _extract_content / _extract_title / _extract_author / _extract_cover helpers, which previously made capture of any non-custom site error out.
  • Xiaohongshu image proxying — restored the missing _proxy_url / _proxy_imgs_in_html helpers (and the hotlink-protected CDN list).

📦 Dependencies

  • Added trafilatura>=2.0.0,<3 and lxml_html_clean>=0.4.0.

Full changelog: see CHANGELOG.md