Skip to content

Update push_data annotations to use JsonSerializable type #1191

Open
@vdusek

Description

@vdusek

Currently, we use on many places these annotations for data / user_data:

data: list[dict[str, Any]] | dict[str, Any]
data: dict[str, Any]

This works, but it isn't precise - we only accept JSON-serializable types.

We've got this recursive alias:

J = TypeVar('J', bound='JsonSerializable')
JsonSerializable: TypeAlias = Union[
    list[J],
    dict[str, J],
    str,
    bool,
    int,
    float,
    None,
]

But if we use it for these variables:

data: list[dict[str, JsonSerializable]] | dict[str, JsonSerializable]
data: dict[str, JsonSerializable]

We run into variance-related errors, like this:

tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py:450: error: Argument 1 to "__call__" of "PushDataFunction" has incompatible type "dict[str, str]"; expected "Union[list[dict[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]], dict[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]]"  [arg-type]
tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py:450: note: "Dict" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py:450: note: Consider using "Mapping" instead, which is covariant in the value type

If we follow the suggestions, and use the Mapping and Sequence:

data: Sequence[Mapping[str, JsonSerializable]] | Mapping[str, JsonSerializable]

We end up with even more errors on the usage side, e.g.

item = {'key': 'value', 'number': 42}
await dataset_client.push_data(item)

Error (dict[str, object] vs. Mapping[str, JsonSerializable])

Argument 1 to "push_data" of "MemoryDatasetClient" has incompatible type "dict[str, object]"; expected "Union[Sequence[Mapping[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]], Mapping[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]]" Mypy[arg-type](https://mypy.readthedocs.io/en/latest/_refs.html#code-arg-type)

Is using the JsonSerializable alias in this context the right choice? Should we adopt something different? How? The goal is to get precise JSON-serializable typing, avoid variance errors, and usage side errors.

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions