feat: add Threads Trending Now as content source via Playwright scraping#3
Conversation
- Create threads/trending.py: Playwright-based scraper for Threads trending topics and thread replies - Modify threads/threads_client.py: add source config check, integrate trending scraper with fallback to user threads - Update .config.template.toml: add source option (user/trending) Agent-Logs-Url: https://github.com/thaitien280401-stack/RedditVideoMakerBot/sessions/01a85c1b-5157-4723-80f1-ca726e410a39 Co-authored-by: thaitien280401-stack <271128961+thaitien280401-stack@users.noreply.github.com>
…r setup Agent-Logs-Url: https://github.com/thaitien280401-stack/RedditVideoMakerBot/sessions/01a85c1b-5157-4723-80f1-ca726e410a39 Co-authored-by: thaitien280401-stack <271128961+thaitien280401-stack@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Threads content source that scrapes “Trending now” via Playwright, enabling the pipeline to generate videos from trending topics when the official Threads API can’t provide search/trending data.
Changes:
- Added a Playwright-based trending scraper (
threads/trending.py) that extracts trending topics, threads, and replies. - Wired trending sourcing into
get_threads_posts()via a new[threads.thread].sourceconfig option, with fallback to user-based sourcing on scrape failure. - Replaced repeated magic
200title slicing with_MAX_TITLE_LENGTH.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
utils/.config.template.toml |
Introduces [threads.thread].source to select between user and trending. |
threads/trending.py |
New Playwright scraper for trending topics, thread posts, and replies scraping. |
threads/threads_client.py |
Adds trending integration + fallback, and centralizes title length constant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| container = link_handle.evaluate_handle( | ||
| """el => { | ||
| let node = el; | ||
| for (let i = 0; i < 10; i++) { | ||
| node = node.parentElement; | ||
| if (!node) return el.parentElement || el; | ||
| const text = node.innerText || ''; | ||
| if (text.length > 30 && ( | ||
| node.getAttribute('role') === 'article' || | ||
| node.tagName === 'ARTICLE' || | ||
| node.dataset && node.dataset.testid | ||
| )) { | ||
| return node; | ||
| } | ||
| } | ||
| return el.parentElement ? el.parentElement.parentElement || el.parentElement : el; | ||
| }""" | ||
| ) | ||
| raw = container.inner_text().strip() if container else "" |
There was a problem hiding this comment.
_get_post_text() dùng evaluate_handle() nhưng sau đó gọi container.inner_text(). Trong Playwright Python, evaluate_handle() trả về JSHandle, không đảm bảo có method inner_text(); cần chuyển sang container_handle.as_element() (và handle có thể None) trước khi đọc text, hoặc dùng link_handle.evaluate() trả về string trực tiếp. Nếu không, scraper sẽ lỗi runtime khi gặp handle không phải ElementHandle.
| if source == "trending" and not POST_ID: | ||
| content = _get_trending_content( | ||
| max_comment_length=max_comment_length, | ||
| min_comment_length=min_comment_length, | ||
| ) | ||
| if content is not None: | ||
| return content | ||
| # Fallback: nếu trending thất bại, tiếp tục dùng user threads | ||
| print_substep( | ||
| "⚠️ Trending không khả dụng, chuyển sang lấy từ user threads...", | ||
| style="bold yellow", | ||
| ) |
There was a problem hiding this comment.
Nhánh source == "trending" có thể return content ngay cả khi không scrape được đủ replies (thậm chí 0 replies). Downstream (TTS/engine_wrapper + final_video) giả định có ít nhất 1 comment trong comment mode, nên trường hợp 0 replies sẽ làm thiếu file comment_0.png/mp3 và pipeline hỏng. Nên truyền min_comments vào _get_trending_content() và nếu not storymode thì chỉ return khi len(comments) >= min_comments, còn lại return None để fallback sang user threads.
| sanitised = sanitize_text(reply_text) | ||
| if not sanitised or sanitised.strip() == "": | ||
| continue | ||
|
|
||
| if len(reply_text) > max_comment_length: | ||
| continue | ||
| if len(reply_text) < min_comment_length: | ||
| continue | ||
|
|
||
| content["comments"].append( | ||
| { | ||
| "comment_body": reply_text, | ||
| "comment_url": "", | ||
| "comment_id": re.sub( | ||
| r"[^\w\s-]", "", f"trending_reply_{idx}" | ||
| ), | ||
| "comment_author": f"@{reply_username}", | ||
| } |
There was a problem hiding this comment.
Trong _get_trending_content(), biến sanitised = sanitize_text(reply_text) chỉ được dùng để kiểm tra rỗng, nhưng comment_body lại lưu reply_text (chưa sanitize). Nếu sanitize_text() loại emoji/ký tự đặc biệt để TTS không lỗi, thì lưu raw text sẽ khiến các bước sau (add_periods/process_text) vẫn gặp ký tự không mong muốn. Nên dùng sanitised làm comment_body (hoặc giải thích rõ vì sao giữ nguyên raw text là an toàn).
| if not text or _contains_blocked_words(text): | ||
| continue | ||
| title_candidate = text[:_MAX_TITLE_LENGTH] | ||
| if is_title_used(title_candidate): | ||
| print_substep( | ||
| f"Bỏ qua trending đã tạo video: {text[:50]}...", | ||
| style="bold yellow", | ||
| ) | ||
| continue |
There was a problem hiding this comment.
Ở nhánh trending, is_title_used() đang kiểm tra title_candidate = text[:_MAX_TITLE_LENGTH] nhưng thread_title thực tế lại có thể là topic_title. Điều này làm dedup lệch: có thể tạo lại video nhiều lần cho cùng topic_title vì lịch sử lưu theo thread_title. Nên dùng cùng một giá trị (ví dụ display_title/topic_title nếu có) cho cả dedup và thread_title.
The app currently only fetches posts from a specific user's Threads profile via the official API. This adds support for sourcing content from the Trending Now section on Threads' search page instead.
The official Threads API has no trending/search endpoint, so this uses Playwright (already a dependency) to scrape
threads.net/search.Changes
threads/trending.py(new) — Playwright-based scraper that:scrape_thread_replies()fetches replies from thread pages (API only supports authenticated user's own threads)threads/threads_client.py— Added_get_trending_content()helper, wired intoget_threads_posts()via newsourceconfig check. Falls back to user threads automatically if scraping fails. Extracted_MAX_TITLE_LENGTHconstant to replace repeated magic200.utils/.config.template.toml— Addedsourceoption to[threads.thread]Usage
Notes
Web scraping is inherently fragile — if Threads changes their DOM structure, the selectors in
trending.pywill need updating. The automatic fallback to user-based sourcing ensures the pipeline doesn't break silently.