Releases: weaiw/trove-ai
v1.1.0 — WeChat Channels (视频号) capture + smart generic extraction
✨ Highlights
🎬 WeChat Channels (视频号) capture
Links from channels.weixin.qq.com are now fetched and saved. Because Channels pages are JavaScript-rendered, they ride the new smart generic-extraction cascade — Trove renders the page with a headless browser before pulling the main content.
🪶 Smart generic extraction cascade
Pages without a dedicated parser (视频号, CSDN, Juejin, Medium, SSPai, 36Kr, and any other site) now go through a three-stage pipeline for far cleaner main-content extraction:
trafilaturaextracts the article body from raw HTML (stable; strips nav/footer/ads);- if the text is too short (a client-rendered page), the page is rendered with the bundled headless Chromium and re-extracted, keeping the longer result;
- otherwise it falls back to the original BeautifulSoup heuristic cleaner.
The downstream clean_to_markdown pipeline is unchanged.
📄 Article-scoped Q&A
On an article detail page the assistant can now answer strictly from that one article (the whole article goes into context, with no library-wide vector search). A 📄 this-article / 📚 whole-library toggle appears in the chat box; /r /a /c still escalate to whole-library research/creation.
🐛 Fixes
- Generic web capture — restored the missing
_extract_content/_extract_title/_extract_author/_extract_coverhelpers, which previously made capture of any non-custom site error out. - Xiaohongshu image proxying — restored the missing
_proxy_url/_proxy_imgs_in_htmlhelpers (and the hotlink-protected CDN list).
📦 Dependencies
- Added
trafilatura>=2.0.0,<3andlxml_html_clean>=0.4.0.
Full changelog: see CHANGELOG.md