fix: drop markdown_v2 access in BasicWebScraper.extract by obchain · Pull Request #60 · sentient-agi/OpenDeepSearch

obchain · 2026-05-19T06:17:26Z

What

Apply the same markdown_v2 → markdown cleanup that #52 lands in WebScraper.extract to BasicWebScraper.extract. The two paths had byte-identical broken accesses; this PR brings the second one in line.

Why

src/opendeepsearch/context_scraping/basic_web_scraper.py:50-51:

if result.success:
    extraction_result.raw_markdown_length = len(result.markdown_v2.raw_markdown)
    extraction_result.citations_markdown_length = len(result.markdown_v2.markdown_with_citations)

crawl4ai 0.5.x removed markdown_v2 and replaced its __getattr__ with a hard AttributeError (full trace already in #34). BasicWebScraper is currently dead — crawl4ai_scraper.py only imports the ExtractionConfig dataclass that lives in the same module — but the class is public and the bug is real for anyone reaching for it.

Closes #59

How

src/opendeepsearch/context_scraping/basic_web_scraper.py:

compute markdown_obj = getattr(result, 'markdown', None) or getattr(result, 'markdown_v2', None) once per result
gate the length bookkeeping on result.success and markdown_obj is not None
read raw_markdown / markdown_with_citations with getattr(..., '') or '' so a partially-populated MarkdownGenerationResult doesn't crash either

Kept narrowly scoped. Did not delete the unused BasicWebScraper class itself — that's a separate cleanup the maintainers may want to weigh.

Testing

python3 -m py_compile src/opendeepsearch/context_scraping/basic_web_scraper.py — clean
grep -n markdown_v2 src/opendeepsearch/context_scraping/basic_web_scraper.py — only the intentional fallback line remains
Mirrors the manual repro path documented in fix: use crawl4ai result.markdown instead of removed markdown_v2 #52 (with crawl4ai 0.5.x and a fake result whose markdown_v2 accessor raises, the unpatched code re-raises; the patched code now reads result.markdown.raw_markdown).

`BasicWebScraper.extract` reads `result.markdown_v2.raw_markdown` / `markdown_with_citations`. crawl4ai 0.5.x removed the `markdown_v2` attribute and raises `AttributeError` on access. This is the same bug the `WebScraper` path hits (tracked in sentient-agi#34); `BasicWebScraper` is currently dead code (no callers — only the `ExtractionConfig` dataclass in the same module is imported elsewhere), but the broken access would bite anyone reaching for it. Mirror the fix used in the `WebScraper` cleanup: pull the markdown payload via `getattr(result, 'markdown', None) or getattr(result, 'markdown_v2', None)`, then read the length fields defensively. Older crawl4ai builds that still expose `markdown_v2` keep working through the fallback. Fixes sentient-agi#59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drop markdown_v2 access in BasicWebScraper.extract#60

fix: drop markdown_v2 access in BasicWebScraper.extract#60
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:fix/basic-webscraper-markdown-v2

obchain commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obchain commented May 19, 2026

What

Why

How

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant