Skip to content

Representing & checking Dataset schemas  #1900

@max-sixty

Description

@max-sixty

What would be the best way to canonically describe a dataset, which could be read by both humans and machines?

For example, frequently in our code we have docstrings which look something like:

def get_returns(security_ids):
    """
    Retuns mega-dimensional dataset which gives recent returns for a set of
        securities by:
    - Date
    - Return (raw / economic / smoothed / etc)
    - Scaling (constant / risk_scaled)
    - Span
    - Hedged vs Unhedged

    Dataset keys are security ids. All dimensions have coords.
    """

This helps when attempting to understand what code is doing while only reading it.
But this isn't consistent between docstrings and can't be read or checked by a machine.
Has anyone solved this problem / have any suggestions for resources out there?

Tangentially related to python/typing#513 (but our issues are less about the type, dimension sizes, and more about the arrays within a dataset, their dimensions, and their names)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions