Skip to content

Commit

Permalink
core and backend pandera API internals rewrite (#913)
Browse files Browse the repository at this point in the history
* [wip] execution backends

* implement container, field, component backends

* wip

* implement ArraySchema for pandas

* move coerce logic to backend

* implement index and multiindex

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* move error_formatters, check cleanup

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* [wip]

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* implement checks

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* built-in checks

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* add todo

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* handle core and backend mypy issues

* fix test_schemas, test_checks unit tests

* [wip] fix unit tests for core

* fix io tests

* fix strategies tests

* fix mypy tests

* fix dask tests

* fix modin tests

* fix pyspark tests

* fix io and checks

* fix pylint

* make mypy happy

* update tests

* update requirements file

* pylint ignore get_type_hints

* fix requirements, bump cache

* debug ci conda issue

* debug ci

* debug ci

* debug ci

* use mamba

* debug ci: use mamba and environment.yml

* revert sphinx-autodoc-typehints pin

* fix ci

* fix ci

* favor typing_extensions

* use new error_formatters

* debugging tests

* fix doc test

* pin frictionless, fix modin tests and docs

* fix setup

* test full matrix

* rewrite Hypothesis internals

* clean up hypothesis modules and functions

* fix hypothesis docstring

* delete old modules for io, schema inference and statistics

* delete schemas.py

* delete model.py and model_components.py

* delete *_accessor.py

* delete extensions.py module

* delete checks and hypotheses modules

* delete schema_components

* delete pandera/typing/config.py

* clean up import statements

* clean up imports

* clean up docs

* fix mypy tests

* clean up docs

* update try pandera notebook

* fix mypy

* fix typo

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
  • Loading branch information
cosmicBboy committed Jan 24, 2023
1 parent e2bc5b9 commit 061f989
Show file tree
Hide file tree
Showing 139 changed files with 7,177 additions and 5,728 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/documentation-improvement.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ assignees: ''

#### Location of the documentation

[this should provide the location of the documentation, e.g. "pandera.schemas.DataFrameSchema" or the URL of the documentation, e.g. "https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#column-validation"]
[this should provide the location of the documentation, e.g. "pandera.core.pandas.container.DataFrameSchema" or the URL of the documentation, e.g. "https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#column-validation"]

**Note**: You can check the latest versions of the docs on `master` [here](https://pandera.readthedocs.io/en/latest/).

Expand Down
20 changes: 12 additions & 8 deletions .github/workflows/ci-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ jobs:
PYTEST_FLAGS: --cov=pandera --cov-report=term-missing --cov-report=xml --cov-append
HYPOTHESIS_FLAGS: -n=auto -q --hypothesis-profile=ci
strategy:
fail-fast: false
fail-fast: true
matrix:
os: ["ubuntu-latest", "macos-latest", "windows-latest"]
python-version: ["3.7", "3.8", "3.9", "3.10"]
Expand All @@ -121,7 +121,7 @@ jobs:
uses: actions/cache@v2
env:
# Increase this value to reset cache if etc/environment.yml has not changed
CACHE_NUMBER: 0
CACHE_NUMBER: 1
with:
path: ~/conda_pkgs_dir
key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{ hashFiles('environment.yml') }}
Expand All @@ -139,15 +139,20 @@ jobs:
with:
auto-update-conda: true
python-version: ${{ matrix.python-version }}
mamba-version: "*"
# mamba-version: "*"
miniforge-version: latest
miniforge-variant: Mambaforge
use-mamba: true
activate-environment: pandera-dev
channels: conda-forge
channel-priority: flexible
channel-priority: true
use-only-tar-bz2: true

- name: Install Conda Deps [Latest]
if: ${{ matrix.pandas-version == 'latest' }}
run: mamba install -c conda-forge pandas geopandas
run: |
mamba install -c conda-forge asv pandas geopandas bokeh
mamba env update -n pandera-dev -f environment.yml
- name: Install Conda Deps
if: ${{ matrix.pandas-version != 'latest' }}
Expand All @@ -160,9 +165,8 @@ jobs:

- name: Install Pip Deps
run: |
python -m pip install -U pip
python -m pip install -r requirements-dev.txt
python -m pip install bokeh
mamba install -c conda-forge asv pandas==${{ matrix.pandas-version }} geopandas bokeh
mamba env update -n pandera-dev -f environment.yml
- run: |
conda info
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
.vscode
dask-worker-space
spark-warehouse
docs/source/_contents

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
6 changes: 5 additions & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,8 @@ disable=
too-many-ancestors,
too-many-lines,
too-few-public-methods,
line-too-long
line-too-long,
ungrouped-imports,
function-redefined,
arguments-differ,
no-self-use
5 changes: 2 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,8 @@ requirements:
pip install -r requirements-dev.txt

docs:
rm -rf docs/**/generated docs/**/methods docs/_build && \
python -m sphinx -E "docs/source" "docs/_build" -W && \
make -C docs doctest
rm -rf docs/**/generated docs/**/methods docs/_build docs/source/_contents
python -m sphinx -E "docs/source" "docs/_build" && make -C docs doctest

quick-docs:
python -m sphinx -E "docs/source" "docs/_build" -W && \
Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ This is useful in production-critical or reproducible research settings. With
[hypothesis testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis).
1. Seamlessly integrate with existing data analysis/processing pipelines
via [function decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators).
1. Define schema models with the
[class-based API](https://pandera.readthedocs.io/en/stable/schema_models.html#schema-models)
1. Define dataframe models with the
[class-based API](https://pandera.readthedocs.io/en/stable/dataframe_models.html#dataframe-models)
with pydantic-style syntax and validate dataframes using the typing syntax.
1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies)
from schema objects for property-based testing with pandas data structures.
Expand Down Expand Up @@ -155,18 +155,18 @@ print(validated_df)
# 4 9 -20.4 value_1
```

## Schema Model
## DataFrame Model

`pandera` also provides an alternative API for expressing schemas inspired
by [dataclasses](https://docs.python.org/3/library/dataclasses.html) and
[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `SchemaModel`
[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `DataFrameModel`
for the above `DataFrameSchema` would be:


```python
from pandera.typing import Series

class Schema(pa.SchemaModel):
class Schema(pa.DataFrameModel):

column1: Series[int] = pa.Field(le=10)
column2: Series[float] = pa.Field(lt=-1.2)
Expand Down Expand Up @@ -223,7 +223,7 @@ page or reach out to the maintainers and pandera community on
[column nullability](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#null-values-in-columns),
and [uniqueness](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#validating-the-joint-uniqueness-of-columns)
are first-class concepts.
- Define [schema models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with
- Define [dataframe models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with
[pydantic](https://pydantic-docs.helpmanual.io/)-style syntax and validate dataframes using the typing syntax.
- `check_input` and `check_output` [decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators-for-pipeline-integration)
enable seamless integration with existing code.
Expand Down
5 changes: 5 additions & 0 deletions docs/source/_static/default.css
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,8 @@ div.sponsorship {
article .align-center, article .align-default {
text-align: left;
}

/* make font size of API reference smaller */
section[id^=pandera-] h1 {
font-size: 1.75em;
}
20 changes: 10 additions & 10 deletions docs/source/checks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Checks
Checking column properties
--------------------------

:class:`~pandera.checks.Check` objects accept a function as a required argument, which is
:class:`~pandera.core.checks.Check` objects accept a function as a required argument, which is
expected to take a ``pa.Series`` input and output a ``boolean`` or a ``Series``
of boolean values. For the check to pass, all of the elements in the boolean
series must evaluate to ``True``, for example:
Expand Down Expand Up @@ -53,15 +53,15 @@ For common validation tasks, built-in checks are available in ``pandera``.
"phone_number": Column(str, Check.str_matches(r'^[a-z0-9-]+$')),
})

See the :class:`~pandera.checks.Check` API reference for a complete list of built-in checks.
See the :class:`~pandera.core.checks.Check` API reference for a complete list of built-in checks.


.. _elementwise checks:

Vectorized vs. Element-wise Checks
------------------------------------

By default, :class:`~pandera.checks.Check` objects operate on ``pd.Series``
By default, :class:`~pandera.core.checks.Check` objects operate on ``pd.Series``
objects. If you want to make atomic checks for each element in the Column, then
you can provide the ``element_wise=True`` keyword argument:

Expand Down Expand Up @@ -106,29 +106,29 @@ with any null value are dropped.
If you want to check the properties of a pandas data structure while preserving
null values, specify ``Check(..., ignore_na=False)`` when defining a check.

Note that this is different from the ``nullable`` argument in :class:`~pandera.schema_components.Column`
Note that this is different from the ``nullable`` argument in :class:`~pandera.core.pandas.components.Column`
objects, which simply checks for null values in a column.

.. _column_check_groups:

Column Check Groups
-------------------

:class:`~pandera.schema_components.Column` checks support grouping by a different column so that you
:class:`~pandera.core.pandas.components.Column` checks support grouping by a different column so that you
can make assertions about subsets of the column of interest. This
changes the function signature of the :class:`~pandera.checks.Check` function so that its
changes the function signature of the :class:`~pandera.core.checks.Check` function so that its
input is a dict where keys are the group names and values are subsets of the
series being validated.

Specifying ``groupby`` as a column name, list of column names, or
callable changes the expected signature of the :class:`~pandera.checks.Check`
callable changes the expected signature of the :class:`~pandera.core.checks.Check`
function argument to:

``Callable[Dict[Any, pd.Series] -> Union[bool, pd.Series]``

where the dict keys are the discrete keys in the ``groupby`` columns.

In the example below we define a :class:`~pandera.schemas.DataFrameSchema` with column checks
In the example below we define a :class:`~pandera.core.pandas.container.DataFrameSchema` with column checks
for ``height_in_feet`` using a single column, multiple columns, and a more
complex groupby function that creates a new column ``age_less_than_15`` on the
fly.
Expand Down Expand Up @@ -309,12 +309,12 @@ want the resulting table for further analysis.
:skipif: SKIP_PANDAS_LT_V1

<Schema Column(name=var2, type=None)> failed series or dataframe validator 0:
<Check _hypothesis_check: normality test>
<Check normaltest: normality test>


Registering Custom Checks
-------------------------

``pandera`` now offers an interface to register custom checks functions so
that they're available in the :class:`~pandera.checks.Check` namespace. See
that they're available in the :class:`~pandera.core.checks.Check` namespace. See
:ref:`the extensions<extensions>` document for more information.
4 changes: 2 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
exclude_patterns = [".ipynb_checkpoints/*", "notebooks/try_pandera.ipynb"]

autoclass_content = "both"

Expand Down Expand Up @@ -200,7 +200,7 @@ def filter(self, record: pylogging.LogRecord) -> bool:
"Cannot resolve forward reference in type annotations of "
'"pandera.typing.DataFrame"',
"Cannot resolve forward reference in type annotations of "
'"pandera.schemas.DataFrameSchema',
'"pandera.core.pandas.container.DataFrameSchema',
"Cannot resolve forward reference in type annotations of "
'"pandera.typing.DataFrame.style"',
)
Expand Down
6 changes: 3 additions & 3 deletions docs/source/dask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ and :py:func:`~dask.dataframe.Series` objects directly. First, install
Then you can use pandera schemas to validate dask dataframes. In the example
below we'll use the :ref:`class-based API <schema_models>` to define a
:py:class:`SchemaModel` for validation.
below we'll use the :ref:`class-based API <dataframe_models>` to define a
:py:class:`~pandera.core.pandas.model.DataFrameModel` for validation.

.. testcode:: scaling_dask

Expand All @@ -31,7 +31,7 @@ below we'll use the :ref:`class-based API <schema_models>` to define a
from pandera.typing.dask import DataFrame, Series


class Schema(pa.SchemaModel):
class Schema(pa.DataFrameModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data_format_conversion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Consider this simple example:
import pandera as pa
from pandera.typing import DataFrame, Series

class InSchema(pa.SchemaModel):
class InSchema(pa.DataFrameModel):
str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
int_col: Series[int]
Expand Down
16 changes: 8 additions & 8 deletions docs/source/data_synthesis_strategies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Once you've defined a schema, it's easy to generate examples:


Note that here we've constrained the specific values in each column using
:class:`~pandera.checks.Check` s in order to make the data generation process
:class:`~pandera.core.checks.Check` s in order to make the data generation process
deterministic for documentation purposes.

Usage in Unit Tests
Expand Down Expand Up @@ -99,18 +99,18 @@ Now the ``test_processing_fn`` simply becomes an execution test, raising a
:class:`~pandera.errors.SchemaError` if ``processing_fn`` doesn't add
``column4`` to the dataframe.

Strategies and Examples from Schema Models
------------------------------------------
Strategies and Examples from DataFrame Models
---------------------------------------------

You can also use the :ref:`class-based API<schema_models>` to generate examples.
Here's the equivalent schema model for the above examples:
You can also use the :ref:`class-based API<dataframe_models>` to generate examples.
Here's the equivalent dataframe model for the above examples:

.. testcode:: data_synthesis_strategies
:skipif: SKIP_STRATEGY

from pandera.typing import Series, DataFrame

class InSchema(pa.SchemaModel):
class InSchema(pa.DataFrameModel):
column1: Series[int] = pa.Field(eq=10)
column2: Series[float] = pa.Field(eq=0.25)
column3: Series[str] = pa.Field(eq="foo")
Expand All @@ -130,7 +130,7 @@ Here's the equivalent schema model for the above examples:
Checks as Constraints
---------------------

As you may have noticed in the first example, :class:`~pandera.checks.Check` s
As you may have noticed in the first example, :class:`~pandera.core.checks.Check` s
further constrain the data synthesized from a strategy. Without checks, the
``example`` method would simply generate any value of the specified type. You
can specify multiple checks on a column and ``pandera`` should be able to
Expand Down Expand Up @@ -222,7 +222,7 @@ register custom checks and define strategies that correspond to them.
Defining Custom Strategies
--------------------------

All built-in :class:`~pandera.checks.Check` s are associated with a data
All built-in :class:`~pandera.core.checks.Check` s are associated with a data
synthesis strategy. You can define your own data synthesis strategies by using
the :ref:`extensions API<extensions>` to register a custom check function with
a corresponding strategy.
Loading

0 comments on commit 061f989

Please sign in to comment.