Skip to content

fix: drop markdown_v2 access in BasicWebScraper.extract#60

Open
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:fix/basic-webscraper-markdown-v2
Open

fix: drop markdown_v2 access in BasicWebScraper.extract#60
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:fix/basic-webscraper-markdown-v2

Conversation

@obchain
Copy link
Copy Markdown

@obchain obchain commented May 19, 2026

What

Apply the same markdown_v2markdown cleanup that #52 lands in WebScraper.extract to BasicWebScraper.extract. The two paths had byte-identical broken accesses; this PR brings the second one in line.

Why

src/opendeepsearch/context_scraping/basic_web_scraper.py:50-51:

if result.success:
    extraction_result.raw_markdown_length = len(result.markdown_v2.raw_markdown)
    extraction_result.citations_markdown_length = len(result.markdown_v2.markdown_with_citations)

crawl4ai 0.5.x removed markdown_v2 and replaced its __getattr__ with a hard AttributeError (full trace already in #34). BasicWebScraper is currently dead — crawl4ai_scraper.py only imports the ExtractionConfig dataclass that lives in the same module — but the class is public and the bug is real for anyone reaching for it.

Closes #59

How

src/opendeepsearch/context_scraping/basic_web_scraper.py:

  • compute markdown_obj = getattr(result, 'markdown', None) or getattr(result, 'markdown_v2', None) once per result
  • gate the length bookkeeping on result.success and markdown_obj is not None
  • read raw_markdown / markdown_with_citations with getattr(..., '') or '' so a partially-populated MarkdownGenerationResult doesn't crash either

Kept narrowly scoped. Did not delete the unused BasicWebScraper class itself — that's a separate cleanup the maintainers may want to weigh.

Testing

  • python3 -m py_compile src/opendeepsearch/context_scraping/basic_web_scraper.py — clean
  • grep -n markdown_v2 src/opendeepsearch/context_scraping/basic_web_scraper.py — only the intentional fallback line remains
  • Mirrors the manual repro path documented in fix: use crawl4ai result.markdown instead of removed markdown_v2 #52 (with crawl4ai 0.5.x and a fake result whose markdown_v2 accessor raises, the unpatched code re-raises; the patched code now reads result.markdown.raw_markdown).

`BasicWebScraper.extract` reads `result.markdown_v2.raw_markdown` /
`markdown_with_citations`. crawl4ai 0.5.x removed the `markdown_v2`
attribute and raises `AttributeError` on access. This is the same bug
the `WebScraper` path hits (tracked in sentient-agi#34); `BasicWebScraper` is
currently dead code (no callers — only the `ExtractionConfig` dataclass
in the same module is imported elsewhere), but the broken access would
bite anyone reaching for it.

Mirror the fix used in the `WebScraper` cleanup: pull the markdown
payload via `getattr(result, 'markdown', None) or getattr(result,
'markdown_v2', None)`, then read the length fields defensively. Older
crawl4ai builds that still expose `markdown_v2` keep working through
the fallback.

Fixes sentient-agi#59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BasicWebScraper.extract carries the same markdown_v2 deprecation bug as #34

1 participant