core and backend pandera API internals rewrite (#913)

* [wip] execution backends * implement container, field, component backends * wip * implement ArraySchema for pandas * move coerce logic to backend * implement index and multiindex Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * move error_formatters, check cleanup Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * [wip] Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * implement checks Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * built-in checks Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * add todo Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * handle core and backend mypy issues * fix test_schemas, test_checks unit tests * [wip] fix unit tests for core * fix io tests * fix strategies tests * fix mypy tests * fix dask tests * fix modin tests * fix pyspark tests * fix io and checks * fix pylint * make mypy happy * update tests * update requirements file * pylint ignore get_type_hints * fix requirements, bump cache * debug ci conda issue * debug ci * debug ci * debug ci * use mamba * debug ci: use mamba and environment.yml * revert sphinx-autodoc-typehints pin * fix ci * fix ci * favor typing_extensions * use new error_formatters * debugging tests * fix doc test * pin frictionless, fix modin tests and docs * fix setup * test full matrix * rewrite Hypothesis internals * clean up hypothesis modules and functions * fix hypothesis docstring * delete old modules for io, schema inference and statistics * delete schemas.py * delete model.py and model_components.py * delete *_accessor.py * delete extensions.py module * delete checks and hypotheses modules * delete schema_components * delete pandera/typing/config.py * clean up import statements * clean up imports * clean up docs * fix mypy tests * clean up docs * update try pandera notebook * fix mypy * fix typo Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
unionai-oss · Jan 24, 2023 · 061f989 · 061f989
1 parent e2bc5b9
commit 061f989
Show file tree

Hide file tree

Showing 139 changed files with 7,177 additions and 5,728 deletions.
diff --git a/.github/ISSUE_TEMPLATE/documentation-improvement.md b/.github/ISSUE_TEMPLATE/documentation-improvement.md
@@ -9,7 +9,7 @@ assignees: ''
 
 #### Location of the documentation
 
-[this should provide the location of the documentation, e.g. "pandera.schemas.DataFrameSchema" or the URL of the documentation, e.g. "https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#column-validation"]
+[this should provide the location of the documentation, e.g. "pandera.core.pandas.container.DataFrameSchema" or the URL of the documentation, e.g. "https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#column-validation"]
 
 **Note**: You can check the latest versions of the docs on `master` [here](https://pandera.readthedocs.io/en/latest/).
 

diff --git a/.github/workflows/ci-tests.yml b/.github/workflows/ci-tests.yml
@@ -97,7 +97,7 @@ jobs:
       PYTEST_FLAGS: --cov=pandera --cov-report=term-missing --cov-report=xml --cov-append
       HYPOTHESIS_FLAGS: -n=auto -q --hypothesis-profile=ci
     strategy:
-      fail-fast: false
+      fail-fast: true
       matrix:
         os: ["ubuntu-latest", "macos-latest", "windows-latest"]
         python-version: ["3.7", "3.8", "3.9", "3.10"]
@@ -121,7 +121,7 @@ jobs:
         uses: actions/cache@v2
         env:
           # Increase this value to reset cache if etc/environment.yml has not changed
-          CACHE_NUMBER: 0
+          CACHE_NUMBER: 1
         with:
           path: ~/conda_pkgs_dir
           key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{ hashFiles('environment.yml') }}
@@ -139,15 +139,20 @@ jobs:
         with:
           auto-update-conda: true
           python-version: ${{ matrix.python-version }}
-          mamba-version: "*"
+          # mamba-version: "*"
+          miniforge-version: latest
+          miniforge-variant: Mambaforge
+          use-mamba: true
           activate-environment: pandera-dev
           channels: conda-forge
-          channel-priority: flexible
+          channel-priority: true
           use-only-tar-bz2: true
 
       - name: Install Conda Deps [Latest]
         if: ${{ matrix.pandas-version == 'latest' }}
-        run: mamba install -c conda-forge pandas geopandas
+        run: |
+          mamba install -c conda-forge asv pandas geopandas bokeh
+          mamba env update -n pandera-dev -f environment.yml
 
       - name: Install Conda Deps
         if: ${{ matrix.pandas-version != 'latest' }}
@@ -160,9 +165,8 @@ jobs:
 
       - name: Install Pip Deps
         run: |
-          python -m pip install -U pip
-          python -m pip install -r requirements-dev.txt
-          python -m pip install bokeh
+          mamba install -c conda-forge asv pandas==${{ matrix.pandas-version }} geopandas bokeh
+          mamba env update -n pandera-dev -f environment.yml
 
       - run: |
           conda info

diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 .vscode
 dask-worker-space
 spark-warehouse
+docs/source/_contents
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/.pylintrc b/.pylintrc
@@ -29,4 +29,8 @@ disable=
     too-many-ancestors,
     too-many-lines,
     too-few-public-methods,
-    line-too-long
+    line-too-long,
+    ungrouped-imports,
+    function-redefined,
+    arguments-differ,
+    no-self-use
diff --git a/Makefile b/Makefile
@@ -21,9 +21,8 @@ requirements:
 	pip install -r requirements-dev.txt
 
 docs:
-	rm -rf docs/**/generated docs/**/methods docs/_build && \
-		python -m sphinx -E "docs/source" "docs/_build" -W && \
-		make -C docs doctest
+	rm -rf docs/**/generated docs/**/methods docs/_build docs/source/_contents
+	python -m sphinx -E "docs/source" "docs/_build" && make -C docs doctest
 
 quick-docs:
 	python -m sphinx -E "docs/source" "docs/_build" -W && \

diff --git a/README.md b/README.md
@@ -44,8 +44,8 @@ This is useful in production-critical or reproducible research settings. With
    [hypothesis testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis).
 1. Seamlessly integrate with existing data analysis/processing pipelines
    via [function decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators).
-1. Define schema models with the
-   [class-based API](https://pandera.readthedocs.io/en/stable/schema_models.html#schema-models)
+1. Define dataframe models with the
+   [class-based API](https://pandera.readthedocs.io/en/stable/dataframe_models.html#dataframe-models)
    with pydantic-style syntax and validate dataframes using the typing syntax.
 1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies)
    from schema objects for property-based testing with pandas data structures.
@@ -155,18 +155,18 @@ print(validated_df)
 #  4        9    -20.4  value_1
 ```
 
-## Schema Model
+## DataFrame Model
 
 `pandera` also provides an alternative API for expressing schemas inspired
 by [dataclasses](https://docs.python.org/3/library/dataclasses.html) and
-[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `SchemaModel`
+[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `DataFrameModel`
 for the above `DataFrameSchema` would be:
 
 
 ```python
 from pandera.typing import Series
 
-class Schema(pa.SchemaModel):
+class Schema(pa.DataFrameModel):
 
     column1: Series[int] = pa.Field(le=10)
     column2: Series[float] = pa.Field(lt=-1.2)
@@ -223,7 +223,7 @@ page or reach out to the maintainers and pandera community on
   [column nullability](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#null-values-in-columns),
   and [uniqueness](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#validating-the-joint-uniqueness-of-columns)
   are first-class concepts.
-- Define [schema models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with
+- Define [dataframe models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with
   [pydantic](https://pydantic-docs.helpmanual.io/)-style syntax and validate dataframes using the typing syntax.
 - `check_input` and `check_output` [decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators-for-pipeline-integration)
   enable seamless integration with existing code.

diff --git a/docs/source/_static/default.css b/docs/source/_static/default.css
@@ -91,3 +91,8 @@ div.sponsorship {
 article .align-center, article .align-default {
   text-align: left;
 }
+
+/* make font size of API reference smaller */
+section[id^=pandera-] h1  {
+  font-size: 1.75em;
+}
diff --git a/docs/source/checks.rst b/docs/source/checks.rst
@@ -10,7 +10,7 @@ Checks
 Checking column properties
 --------------------------
 
-:class:`~pandera.checks.Check` objects accept a function as a required argument, which is
+:class:`~pandera.core.checks.Check` objects accept a function as a required argument, which is
 expected to take a ``pa.Series`` input and output a ``boolean`` or a ``Series``
 of boolean values. For the check to pass, all of the elements in the boolean
 series must evaluate to ``True``, for example:
@@ -53,15 +53,15 @@ For common validation tasks, built-in checks are available in ``pandera``.
       "phone_number": Column(str, Check.str_matches(r'^[a-z0-9-]+$')),
   })
 
-See the :class:`~pandera.checks.Check` API reference for a complete list of built-in checks.
+See the :class:`~pandera.core.checks.Check` API reference for a complete list of built-in checks.
 
 
 .. _elementwise checks:
 
 Vectorized vs. Element-wise Checks
 ------------------------------------
 
-By default, :class:`~pandera.checks.Check` objects operate on ``pd.Series``
+By default, :class:`~pandera.core.checks.Check` objects operate on ``pd.Series``
 objects. If you want to make atomic checks for each element in the Column, then
 you can provide the ``element_wise=True`` keyword argument:
 
@@ -106,29 +106,29 @@ with any null value are dropped.
 If you want to check the properties of a pandas data structure while preserving
 null values, specify ``Check(..., ignore_na=False)`` when defining a check.
 
-Note that this is different from the ``nullable`` argument in :class:`~pandera.schema_components.Column`
+Note that this is different from the ``nullable`` argument in :class:`~pandera.core.pandas.components.Column`
 objects, which simply checks for null values in a column.
 
 .. _column_check_groups:
 
 Column Check Groups
 -------------------
 
-:class:`~pandera.schema_components.Column` checks support grouping by a different column so that you
+:class:`~pandera.core.pandas.components.Column` checks support grouping by a different column so that you
 can make assertions about subsets of the column of interest. This
-changes the function signature of the :class:`~pandera.checks.Check` function so that its
+changes the function signature of the :class:`~pandera.core.checks.Check` function so that its
 input is a dict where keys are the group names and values are subsets of the
 series being validated.
 
 Specifying ``groupby`` as a column name, list of column names, or
-callable changes the expected signature of the :class:`~pandera.checks.Check`
+callable changes the expected signature of the :class:`~pandera.core.checks.Check`
 function argument to:
 
 ``Callable[Dict[Any, pd.Series] -> Union[bool, pd.Series]``
 
 where the dict keys are the discrete keys in the ``groupby`` columns.
 
-In the example below we define a :class:`~pandera.schemas.DataFrameSchema` with column checks
+In the example below we define a :class:`~pandera.core.pandas.container.DataFrameSchema` with column checks
 for ``height_in_feet`` using a single column, multiple columns, and a more
 complex groupby function that creates a new column ``age_less_than_15`` on the
 fly.
@@ -309,12 +309,12 @@ want the resulting table for further analysis.
     :skipif: SKIP_PANDAS_LT_V1
 
     <Schema Column(name=var2, type=None)> failed series or dataframe validator 0:
-    <Check _hypothesis_check: normality test>
+    <Check normaltest: normality test>
 
 
 Registering Custom Checks
 -------------------------
 
 ``pandera`` now offers an interface to register custom checks functions so
-that they're available in the :class:`~pandera.checks.Check` namespace. See
+that they're available in the :class:`~pandera.core.checks.Check` namespace. See
 :ref:`the extensions<extensions>` document for more information.
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -106,7 +106,7 @@
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = []
+exclude_patterns = [".ipynb_checkpoints/*", "notebooks/try_pandera.ipynb"]
 
 autoclass_content = "both"
 
@@ -200,7 +200,7 @@ def filter(self, record: pylogging.LogRecord) -> bool:
                     "Cannot resolve forward reference in type annotations of "
                     '"pandera.typing.DataFrame"',
                     "Cannot resolve forward reference in type annotations of "
-                    '"pandera.schemas.DataFrameSchema',
+                    '"pandera.core.pandas.container.DataFrameSchema',
                     "Cannot resolve forward reference in type annotations of "
                     '"pandera.typing.DataFrame.style"',
                 )

diff --git a/docs/source/dask.rst b/docs/source/dask.rst
@@ -19,8 +19,8 @@ and :py:func:`~dask.dataframe.Series` objects directly. First, install
 
 
 Then you can use pandera schemas to validate dask dataframes. In the example
-below we'll use the :ref:`class-based API <schema_models>` to define a
-:py:class:`SchemaModel` for validation.
+below we'll use the :ref:`class-based API <dataframe_models>` to define a
+:py:class:`~pandera.core.pandas.model.DataFrameModel` for validation.
 
 .. testcode:: scaling_dask
 
@@ -31,7 +31,7 @@ below we'll use the :ref:`class-based API <schema_models>` to define a
     from pandera.typing.dask import DataFrame, Series
 
 
-    class Schema(pa.SchemaModel):
+    class Schema(pa.DataFrameModel):
         state: Series[str]
         city: Series[str]
         price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

diff --git a/docs/source/data_format_conversion.rst b/docs/source/data_format_conversion.rst
@@ -23,7 +23,7 @@ Consider this simple example:
     import pandera as pa
     from pandera.typing import DataFrame, Series
 
-    class InSchema(pa.SchemaModel):
+    class InSchema(pa.DataFrameModel):
         str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
         int_col: Series[int]
 

diff --git a/docs/source/data_synthesis_strategies.rst b/docs/source/data_synthesis_strategies.rst
@@ -45,7 +45,7 @@ Once you've defined a schema, it's easy to generate examples:
 
 
 Note that here we've constrained the specific values in each column using
-:class:`~pandera.checks.Check` s  in order to make the data generation process
+:class:`~pandera.core.checks.Check` s  in order to make the data generation process
 deterministic for documentation purposes.
 
 Usage in Unit Tests
@@ -99,18 +99,18 @@ Now the ``test_processing_fn`` simply becomes an execution test, raising a
 :class:`~pandera.errors.SchemaError` if ``processing_fn`` doesn't add
 ``column4`` to the dataframe.
 
-Strategies and Examples from Schema Models
-------------------------------------------
+Strategies and Examples from DataFrame Models
+---------------------------------------------
 
-You can also use the :ref:`class-based API<schema_models>` to generate examples.
-Here's the equivalent schema model for the above examples:
+You can also use the :ref:`class-based API<dataframe_models>` to generate examples.
+Here's the equivalent dataframe model for the above examples:
 
 .. testcode:: data_synthesis_strategies
    :skipif: SKIP_STRATEGY
 
    from pandera.typing import Series, DataFrame
 
-   class InSchema(pa.SchemaModel):
+   class InSchema(pa.DataFrameModel):
        column1: Series[int] = pa.Field(eq=10)
        column2: Series[float] = pa.Field(eq=0.25)
        column3: Series[str] = pa.Field(eq="foo")
@@ -130,7 +130,7 @@ Here's the equivalent schema model for the above examples:
 Checks as Constraints
 ---------------------
 
-As you may have noticed in the first example, :class:`~pandera.checks.Check` s
+As you may have noticed in the first example, :class:`~pandera.core.checks.Check` s
 further constrain the data synthesized from a strategy. Without checks, the
 ``example`` method would simply generate any value of the specified type. You
 can specify multiple checks on a column and ``pandera`` should be able to
@@ -222,7 +222,7 @@ register custom checks and define strategies that correspond to them.
 Defining Custom Strategies
 --------------------------
 
-All built-in :class:`~pandera.checks.Check` s are associated with a data
+All built-in :class:`~pandera.core.checks.Check` s are associated with a data
 synthesis strategy. You can define your own data synthesis strategies by using
 the :ref:`extensions API<extensions>` to register a custom check function with
 a corresponding strategy.