add data types docs, fix dtype bug at DF level (#1178)

* add data types docs, fix dtype bug at DF level Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * debug Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * debug Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * handle python<3.9 for TypedDict Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * pin typeguard Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * debug typeguard Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * debug typeddict Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
unionai-oss · May 7, 2023 · a057df0 · a057df0
1 parent 547aff1
commit a057df0
Show file tree

Hide file tree

Showing 15 changed files with 418 additions and 12 deletions.
diff --git a/docs/source/checks.rst b/docs/source/checks.rst
@@ -7,6 +7,17 @@
 Checks
 ======
 
+Checks are one of the fundamental constructs of pandera. They allow you to
+specify properties about dataframes, columns, indexes, and series objects, which
+are applied after data type validation/coercion and the core pandera checks
+are applied to the data to be validated.
+
+.. important::
+
+    You can learn more about how data type validation works
+    :ref:`dtype_validation`.
+
+
 Checking column properties
 --------------------------
 

diff --git a/docs/source/dataframe_models.rst b/docs/source/dataframe_models.rst
@@ -355,6 +355,11 @@ Any dtypes supported by ``pandera`` can be used as type parameters for
 :class:`~pandera.typing.Series` and :class:`~pandera.typing.Index`. There are,
 however, a couple of gotchas.
 
+.. important::
+
+    You can learn more about how data type validation works
+    :ref:`dtype_validation`.
+
 Dtype aliases
 ^^^^^^^^^^^^^
 

diff --git a/docs/source/dataframe_schemas.rst b/docs/source/dataframe_schemas.rst
@@ -67,6 +67,12 @@ Similarly to pandas, the data type can be specified as:
 * a pandera :class:`~pandera.dtypes.DataType`: it can also be an instance or a
   class.
 
+.. important::
+
+    You can learn more about how data type validation works
+    :ref:`dtype_validation`.
+
+
 :ref:`Column checks<checks>` allow for the DataFrame's values to be
 checked against a user-provided function. ``Check`` objects also support
 :ref:`grouping<grouping>` by a different column so that the user can make
@@ -124,6 +130,9 @@ nullable. In order to accept null values, you need to explicitly specify
     1      1.0
     2      NaN
 
+To learn more about how the nullable check interacts with data type checks,
+see :ref:`here <how_nullable_works>`.
+
 .. _coerced:
 
 Coercing Types on Columns

diff --git a/docs/source/dtype_validation.rst b/docs/source/dtype_validation.rst
@@ -0,0 +1,244 @@
+.. currentmodule:: pandera
+
+.. _dtype_validation:
+
+Data Type Validation
+====================
+
+The core utility of ``pandera`` is that it allows you to validate the types of
+incoming raw data so that your data pipeline can fail early and not propagate
+data corruption downstream to critical applications. These applications may
+include analytics, statistical, and machine learning use cases that rely on
+clean data for them to be valid.
+
+
+How can I specify data types?
+-----------------------------
+
+With pandera schemas, there are multiple ways of specifying the data types of
+columns, indexes, or even whole dataframes.
+
+.. testcode:: dtype_validation
+
+    import pandera as pa
+    import pandas as pd
+
+    # schema with datatypes at the column and index level
+    schema_field_dtypes = pa.DataFrameSchema(
+        {
+            "column1": pa.Column(int),
+            "column2": pa.Column(float),
+            "column3": pa.Column(str),
+        },
+        index = pa.Index(int),
+    )
+
+    # schema with datatypes at the dataframe level, if all columns are the
+    # same data type
+    schema_df_dtypes = pa.DataFrameSchema(dtype=int)
+
+
+The equivalent :py:class:`~pandera.api.pandas.model.DataFrameModel` would be:
+
+.. testcode:: dtype_validation
+
+    from pandera.typing import Series, Index
+
+    class ModelFieldDtypes(pa.DataFrameModel):
+        column1: Series[int]
+        column2: Series[float]
+        column3: Series[str]
+        index: Index[int]
+
+    class ModelDFDtypes(pa.DataFrameModel):
+        class Config:
+            dtype = int
+
+
+Supported pandas datatypes
+--------------------------
+
+By default, pandera supports the validation of pandas dataframes, so pandera
+schemas support any of the `data types <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>`__
+that pandas supports:
+
+- Built-in python types, e.g. ``int``, ``float``, ``str``, ``bool``, etc.
+- `Numpy data types <https://numpy.org/doc/stable/user/basics.types.html>`__, e.g. ``numpy.int_``, ``numpy.bool__``, etc.
+- Pandas-native data types, e.g. ``pd.StringDtype``, ``pd.BooleanDtype``, ``pd.DatetimeTZDtype``, etc.
+- Any of the `string aliases <https://pandas.pydata.org/docs/user_guide/basics.html#dtypes>`__ supported by pandas.
+
+We recommend using the built-in python datatypes for the common data types, but
+it's really up to you to figure out how you want to express these types.
+Additionally, you can use also the :ref:`pandera-defined datatypes <api-dtypes>`
+if you want.
+
+For example, the following schema expresses the equivalent integer types in
+six different ways:
+
+.. testcode:: dtype_validation
+
+    import numpy as np
+
+    integer_schema = pa.DataFrameSchema(
+        {
+            "builtin_python": pa.Column(int),
+            "builtin_python": pa.Column("int"),
+            "string_alias": pa.Column("int64"),
+            "numpy_dtype": pa.Column(np.int64),
+            "pandera_dtype": pa.Column(pa.Int),
+            "pandera_dtype": pa.Column(pa.Int64),
+        },
+    )
+
+.. note:: The default ``int`` type for Windows is 32-bit integers ``int32``.
+
+
+Parameterized data types
+------------------------
+
+One thing to be aware of is the difference between declaring pure Python types
+(i.e. classes) as the data type of a column vs parameterized types, which in
+the case of pandas, are actually instances of special classes defined by pandas.
+For example, using the object-based API, we can easily define a column as a
+timezone-aware datatype:
+
+.. testcode:: dtype_validation
+
+    datetimeschema = pa.DataFrameSchema({
+        "dt": pa.Column(pd.DatetimeTZDtype(unit="ns", tz="UTC"))
+    })
+
+However, since python's type annotations require types and not objects, to
+express this same type with the class-based API, we need to use an
+:py:class:`~typing.Annotated` type:
+
+.. testcode:: dtype_validation
+
+    try:
+        from typing import Annotated  # python 3.9+
+    except ImportError:
+        from typing_extensions import Annotated
+
+    class DateTimeModel(pa.DataFrameModel):
+        dt: Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]
+
+Or alternatively, you can pass in the ``dtype_kwargs`` into
+:py:func:`~pandera.api.pandas.model_components.Field`:
+
+.. testcode:: dtype_validation
+
+    class DateTimeModel(pa.DataFrameModel):
+        dt: Series[pd.DatetimeTZDtype] = pa.Field(dtype_kwargs={"unit": "ns", "tz": "UTC"})
+
+You can read more about the supported parameterized data types
+:ref:`here <parameterized dtypes>`.
+
+
+Data type coercion
+------------------
+
+Pandera is primarily a *validation* library: it only checks the schema metadata
+or data values of the dataframe without changing anything about the dataframe
+itself.
+
+However, in many cases its useful to *parse*, i.e. transform the data values
+to the data contract specified in the pandera schema. Currently, the only
+transformation pandera does is type coercion, which can be done by passing in
+the ``coerce=True`` argument to the schema or schema component objects:
+
+- :py:class:`~pandera.api.pandas.components.Column`
+- :py:class:`~pandera.api.pandas.components.Index`
+- :py:class:`~pandera.api.pandas.components.MultiIndex`
+- :py:class:`~pandera.api.pandas.container.DataFrameSchema`
+- :py:class:`~pandera.api.pandas.arrays.SeriesSchema`
+
+If this argument is provided, instead of simply checking the columns/index(es)
+for the correct types, calling ``schema.validate`` will attempt to coerce the
+incoming dataframe values into the specified data types.
+
+It will then apply the dataframe-, column-, and index-level checks to the
+data, all of which are purely *validators*.
+
+
+.. _how_nullable_works:
+
+How data types interact with ``nullable``
+------------------------------------------
+
+The ``nullable`` argument, which can be specified at the column-, index, or
+``SeriesSchema``-level, is essentially a core pandera check. As such, it is
+applied after the data type check/coercion step described in the previous
+section. Therefore, datatypes that are inherently not nullable will fail even
+if you specify ``nullable=True`` because pandera considers type checks a
+first-class check that's distinct from any downstream check that you may want
+to apply to the data.
+
+
+Support for the python ``typing`` module
+----------------------------------------
+
+*new in 0.15.0*
+
+Pandera also supports a limited set of generic and special types :py:mod:`typing`
+for you to validate columns containing ``object`` values:
+
+- ``typing.Dict[K, V]``
+- ``typing.List[T]``
+- ``typing.Tuple[T, ...]``
+- ``typing.TypedDict``
+- ``typing.NamedTuple``
+
+For example:
+
+.. testcode:: dtype_validation
+
+    from typing import Dict, List, Tuple, NamedTuple
+
+    if sys.version_info >= (3, 9):
+        from typing import TypedDict
+        # use typing_extensions.TypedDict for python < 3.9 in order to support
+        # run-time availability of optional/required fields
+    else:
+        from typing_extensions import TypedDict
+
+
+    class PointDict(TypedDict):
+        x: float
+        y: float
+
+    class PointTuple(NamedTuple):
+        x: float
+        y: float
+
+    schema = pa.DataFrameSchema(
+        {
+            "dict_column": pa.Column(Dict[str, int]),
+            "list_column": pa.Column(List[float]),
+            "tuple_column": pa.Column(Tuple[int, str, float]),
+            "typeddict_column": pa.Column(PointDict),
+            "namedtuple_column": pa.Column(PointTuple),
+        },
+    )
+
+    data = pd.DataFrame({
+        "dict_column": [{"foo": 1, "bar": 2}],
+        "list_column": [[1.0]],
+        "tuple_column": [(1, "bar", 1.0)],
+        "typeddict_column": [PointDict(x=2.1, y=4.8)],
+        "namedtuple_column": [PointTuple(x=9.2, y=1.6)],
+    })
+
+    schema.validate(data)
+
+Pandera uses `typeguard <https://typeguard.readthedocs.io/en/latest/>`__ for
+data type validation and `pydantic <https://docs.pydantic.dev/latest/>` for
+data value coercion, in the case that you've specified ``coerce=True`` at the
+column-, index-, or dataframe-level.
+
+.. note::
+
+    For certain types like ``List[T]``, ``typeguard`` will only check the type
+    of the first value, e.g. if you specify ``List[int]``, a data value of
+    ``[1, "foo", 1.0]`` will still pass. Checking all values will be
+    configurable in future  versions of pandera when ``typeguard > 4.*.*`` is
+    supported.
diff --git a/docs/source/dtypes.rst b/docs/source/dtypes.rst
@@ -18,6 +18,15 @@ Pandera defines its own interface for data types in order to abstract the
 specifics of dataframe-like data structures in the python ecosystem, such
 as Apache Spark, Apache Arrow and xarray.
 
+The pandera type system servers twow function:
+
+1. To provide a standardized API for data types that work well within pandera
+   so users can define data types with it if they so desire.
+2. Add a logical data types interface on top of the physical data type
+   representation. For example, on top of the ``str`` data type, I can define
+   an ``IPAddress`` or ``name`` data type, which needs to actually check the
+   underlying data values for correctness.
+
 .. note:: In the following section ``Pandera Data Type`` refers to a
     :class:`pandera.dtypes.DataType` object whereas ``native data type`` refers
     to data types used by third-party libraries that Pandera supports (e.g. pandas).
@@ -30,6 +39,16 @@ interface by:
 * modifying the behavior of the **coerce** argument for :class:`~pandea.schemas.DataFrameSchema`.
 * adding your **own custom data types**.
 
+The classes that define this data type hierarchy are in the following modules:
+
+- :py:mod:`~pandera.dtypes`: these define senantic types, which are not
+  user-facing, and are meant to be inheritied by framework-specific engines.
+- :py:mod:`~pandera.engines.numpy_engine`: this module implements numpy datatypes,
+  which pandas relies on.
+- :py:mod:`~pandera.engines.pandas_engine`: this module uses the ``numpy_engine``
+  where appropriate, and adds support for additional pandas-specific data types,
+  e.g. ``pd.DatetimeTZDtype``.
+
 DataType basics
 ~~~~~~~~~~~~~~~
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -355,6 +355,7 @@ page or reach out to the maintainers and pandera community on
    dataframe_schemas
    dataframe_models
    series_schemas
+   dtype_validation
    checks
    hypothesis
    dtypes

diff --git a/environment.yml b/environment.yml
@@ -20,7 +20,6 @@ dependencies:
   - pyarrow
   - pydantic
   - multimethod
-  - typeguard
 
   # mypy extra
   - pandas-stubs <= 1.5.2.221213
@@ -82,6 +81,7 @@ dependencies:
   - pip:
       - furo
       - ray
+      - typeguard >= 3.0.2
       - types-click
       - types-pyyaml
       - types-pkg_resources

diff --git a/pandera/api/pandas/container.py b/pandera/api/pandas/container.py
@@ -114,6 +114,7 @@ def __init__(
 
         if columns is None:
             columns = {}
+
         _validate_columns(columns)
         columns = _columns_renamed(columns)