Skip to content

v0.3.0

Choose a tag to compare

@svdC1 svdC1 released this 24 May 22:59
· 5 commits to main since this release

Added

  • scrape_do.async_api sub-packageScrapeDoAsyncAPIClient (backed by httpx.Client) and AsyncScrapeDoAsyncAPIClient (backed by httpx.AsyncClient) covering the full q.scrape.do surface: create_job, get_job, list_jobs, get_task, cancel_job, get_user_info, plus polling helpers wait_for_job and submit_and_wait. Typed status-code error routing with automatic retries on transient gateway errors (429 / 502 / 503 / 504) and per-request r_timeout / extensions escape hatches.

  • Polling configurationPollingStrategy (configurable exponential backoff with jitter, attempt count, and wall-clock budgets) and the PollingFunction type alias for fully-custom cadences. Both share the same (attempt, elapsed, job) -> float signature so wait_for_job accepts either interchangeably.

  • SDK-native event hooks for the Async APIAsyncAPIEventHooks (sync) and AsyncAPIAsyncEventHooks (async). Lifecycle covers request / response / retry / poll; the poll hook receives a parsed JobDetails snapshot on every non-terminal polling iteration.

  • scrape_do.plugins sub-package — typed *Parameters models for the Amazon and Google plugin gateways with cross-field validation. Companion *AsyncPlugin adapters under scrape_do.async_api.models.plugins plug into JobCreationRequest.plugin via a discriminated union. Every adapter (and the AsyncPlugin union itself) is also re-exported from scrape_do.async_api so the typical import pattern is two lines: from scrape_do.async_api import AsyncScrapeDoAsyncAPIClient, AmazonPdpAsyncPlugin + from scrape_do.plugins import AmazonPdpParameters. Also adds public Google localization constants.

  • Typed Async-API exception hierarchyAsyncAPIError (base) and per-status-code subclasses, AsyncAPIUnparsableResponseError for 2xx bodies the SDK can't parse, JobFailedError / JobCanceledError / TaskFailedError / TaskCanceledError for terminal lifecycle states, and JobTimeoutError for exhausted polling budgets. AsyncScrapeDoErrorMessage parses the gateway's {Error, Code} envelope.

  • ScrapeDoJSONErrorMessage — pydantic model for the structured JSON error envelope returned by the synchronous gateway. Exposes status_code / messages / url / possible_causes / error_type / error_code / contact, plus an is_auth_throttle property for detecting the auth-throttle case.

  • ScrapeDoResponse ergonomics__repr__ / __str__ for REPL inspection, to_dict() and to_json(**kwargs) for serialization, and a fixed json(raw_response=False) that extracts the content key from the Scrape.do JSON envelope when present.

  • scrape_do.models.validators — public helpers for parameter cross-validation (check_geo_code, check_postal_code, check_geo_exclusion, screenshot / return-json / play-with-browser dependency rules, etc.) usable standalone without instantiating a parameters model.

Changed

  • APIResponseError now uses ScrapeDoJSONErrorMessage.try_from_response for body parsing instead of the legacy key-list lookup (detail, Error, errorMessage, message, Message). Error messages are richer and the "Unknown API Error" fallback prints status + body on separate lines.

  • Added typing_extensions>=4.0 as a direct runtime dependency.

Fixed

  • ScrapeDoFrame.url / ScrapeDoNetworkRequest.url relaxed from HttpUrl to str. Real-world iframes and network requests produce technically-valid but quirky URLs (e.g., ?feature=oembed?wmode=transparent) that pydantic-core's URL parser rejected, which blew up the whole response parse.

  • ScrapeDoResponse.cookies regex no longer captures structural whitespace after ; separators. Second-and-later cookie names previously came back with a phantom leading space.

  • ScrapeDoResponse constructor no longer crashes with JSONDecodeError when Scrape.do returns HTML instead of JSON under returnJSON=true — the failure is now properly routed through is_proxy_error.

  • RequestParameters.to_proxy_url now double-encodes the param string so values with URL-reserved characters (notably the JSON-string playWithBrowser payload) survive httpx's transparent decode of the proxy password during Basic auth header construction.

  • Python 3.9 / 3.10 compatibility restored. Source files importing Self / Unpack / TypeAlias from typing (only available in 3.11+ / 3.10+) now use typing_extensions. Previously the package raised ImportError at import time on 3.9 / 3.10 despite the trove classifiers claiming support.

Internal

  • New scrape_do.async_api and scrape_do.plugins sub-package layout. Async-API helpers (_raise_for_status, _parse_response, _build_job_creation_request) live as module-level functions in scrape_do.async_api.client and are shared by both client classes.

  • New unit tests for scrape_do.async_api and models/response.py.

  • Integration coverage expanded from 22 → ~120 tests across the Sync API, Proxy Mode, and Async API surfaces. The new tests/integration/async_api/ suite exercises every endpoint, both client classes, polling helpers, event hooks, the render envelope, a live PlayWithBrowser action sequence, the typed-exception hierarchy, and 12 of the 15 *AsyncPlugin variants. The remaining three (google/trends, walmart/store, lowes/store) are unit-only; they hit upstream- or engine-side failures regardless of input.

  • Integration logging pipeline formalized around pytest.hookimpl-decorated setup / makereport / teardown hooks with per-test tokens stashed on item.stash; _validate_and_log_error_state consolidated into a response_trace fixture.

  • Unit test fixtures consolidated; new shared tests/unit/async_api/conftest.py for the Async-API unit suite plus tests/integration/async_api/conftest.py exposing live client fixtures, a tight fast_polling_strategy, best-effort cancel helpers, and a type-dispatched async_api_response_trace.

  • CI matrix expanded to Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 (fail-fast: false); lint job (ruff + mypy) split out and pinned to 3.13.

Full Changelog: v0.2.0...v0.3.0