Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs for polars #1613

Merged
merged 1 commit into from
May 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,4 +292,5 @@ def linkcode_resolve(domain, info):
myst_heading_anchors = 3

nb_execution_mode = "auto"
nb_execution_timeout = 60
nb_execution_excludepatterns = ["_contents/try_pandera.ipynb"]
25 changes: 19 additions & 6 deletions docs/source/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,29 @@

*New in version 0.17.3*

`pandera` provides a global config `~pandera.config.PanderaConfig`.
`pandera` provides a global config `~pandera.config.PanderaConfig`. The
global configuration is available through `pandera.config.CONFIG`. It can also
be modified with a configuration context `~pandera.config.config_context` and
fetched with `~pandera.config.get_config_context` in custom code.

This configuration can also be set using environment variables. For instance:
This configuration can also be set using environment variables.

## Validation depth

Validation depth determines whether pandera only runs schema-level validations
(column names and datatypes), data-level validations (checks on actual values),
or both:

```
export PANDERA_VALIDATION_ENABLED=False
export PANDERA_VALIDATION_DEPTH=DATA_ONLY # SCHEMA_AND_DATA, SCHEMA_ONLY, DATA_ONLY
```

Runtime data validation incurs a performance overhead. To mitigate this, you have
the option to disable validation globally. This can be achieved by setting the
environment variable `PANDERA_VALIDATION_ENABLED=False`. When validation is
disabled, any `validate` call will return `None`.
## Enabling/disabling validation

Runtime data validation incurs a performance overhead. To mitigate this in the
appropriate contexts, you have the option to disable validation globally.

This can be achieved by setting the environment variable
`PANDERA_VALIDATION_ENABLED=False`. When validation is disabled, any
`validate` call not actually run any validation checks.
4 changes: 4 additions & 0 deletions docs/source/dataframe_schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -472,6 +472,8 @@ df = pd.DataFrame({"a": [1, 2, 3]})
schema.validate(df)
```

(index-validation)=

## Index Validation

You can also specify an {class}`~pandera.api.pandas.components.Index` in the {class}`~pandera.api.pandas.container.DataFrameSchema`.
Expand Down Expand Up @@ -509,6 +511,8 @@ except pa.errors.SchemaError as exc:
print(exc)
```

(multiindex-validation)=

## MultiIndex Validation

`pandera` also supports multi-index column and index validation.
Expand Down
53 changes: 53 additions & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,59 @@ extra column and the `None` value.
This error report can be useful for debugging, with each item in the various
lists corresponding to a `SchemaError`


(supported-features)=

## Supported Features by DataFrame Backend

Currently, pandera provides three validation backends: `pandas`, `pyspark`, and
`polars`. The table below shows which of pandera's features are available for the
{ref}`supported dataframe libraries <dataframe-libraries>`:

:::{table}
:widths: auto
:align: left

| feature | pandas | pyspark | polars |
| :------ | ------ | ------- | ------ |
| {ref}`DataFrameSchema validation <dataframeschemas>` | ✅ | ✅ | ✅ |
| {ref}`DataFrameModel validation <dataframe-models>` | ✅ | ✅ | ✅ |
| {ref}`SeriesSchema validation <seriesschemas>` | ✅ | 🚫 | ❌ |
| {ref}`Index/MultiIndex validation <index-validation>` | ✅ | 🚫 | 🚫 |
| {ref}`Built-in and custom Checks <checks>` | ✅ | ✅ | ✅ |
| {ref}`Groupby checks <column-check-groups>` | ✅ | ❌ | ❌ |
| {ref}`Custom check registration <extensions>` | ✅ | ✅ | ❌ |
| {ref}`Hypothesis testing <hypothesis>` | ✅ | ❌ | ❌ |
| {ref}`Built-in <dtype-validation>` and {ref}`custom <dtypes>` `DataType`s | ✅ | ✅ | ✅ |
| {ref}`Preprocessing with Parsers <parsers>` | ✅ | ❌ | ❌ |
| {ref}`Data synthesis strategies <data-synthesis-strategies>` | ✅ | ❌ | ❌ |
| {ref}`Validation decorators <decorators>` | ✅ | ✅ | ✅ |
| {ref}`Lazy validation <lazy-validation>` | ✅ | ✅ | ✅ |
| {ref}`Dropping inalid rows <drop-invalid-rows>` | ✅ | ❌ | ✅ |
| {ref}`Pandera configuration <configuration>` | ✅ | ✅ | ✅ |
| {ref}`Schema Inference <schema-inference>` | ✅ | ❌ | ❌ |
| {ref}`Schema persistence <schema-persistence>` | ✅ | ❌ | ❌ |
| {ref}`Data Format Conversion <data-format-conversion>` | ✅ | ❌ | ❌ |
| {ref}`Pydantic type support <pydantic-integration>` | ✅ | ❌ | ❌ |
| {ref}`FastAPI support <fastapi-integration>` | ✅ | ❌ | ❌ |

:::

:::{admonition} Legend
:class: important

- ✅: Supported
- ❌: Not supported
- 🚫: Not applicable
:::


:::{note}
The `dask`, `modin`, `geopandas`, and `pyspark.pandas` support in pandera all
leverage the pandas validation backend.
:::


## Contributing

All contributions, bug reports, bug fixes, documentation improvements,
Expand Down
4 changes: 4 additions & 0 deletions docs/source/parsers.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ series objects before running the validation checks. This is useful when you wan
to normalize, clip, or otherwise clean data values before applying validation
checks.

:::{important}
This feature is only available in the pandas validation backend.
:::

## Parsing versus validation

Pandera distinguishes between data validation and parsing. Validation is the act
Expand Down
96 changes: 83 additions & 13 deletions docs/source/polars.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,14 @@ pip install 'pandera[polars]'
:::{important}
If you're on an Apple Silicon machine, you'll need to install polars via
`pip install polars-lts-cpu`.

You may have to delete `polars` if it's already installed:

```
pip uninstall polars
pip install polars-lts-cpu
```

:::

Then you can use pandera schemas to validate polars dataframes. In the example
Expand Down Expand Up @@ -89,14 +97,18 @@ schema.validate(lf).collect()

You can also validate {py:class}`polars.DataFrame` objects, which are objects that
execute computations eagerly. Under the hood, `pandera` will convert
the `polars.DataFrame` to a `polars.LazyFrame` before validating it:
the `polars.DataFrame` to a `polars.LazyFrame` before validating it. This is done
so that the internal validation routine that pandera implements can take
advantage of the optimizations that the polars lazy API provides.

```{code-cell} python
df = lf.collect()
df: pl.DataFrame = lf.collect()
schema.validate(df)
```

:::{note}
## Synthesizing data for testing

:::{warning}
The {ref}`data-synthesis-strategies` functionality is not yet supported in
the polars integration. At this time you can use the polars-native
[parametric testing](https://docs.pola.rs/py-polars/html/reference/testing.html#parametric-testing)
Expand All @@ -107,7 +119,7 @@ functions to generate test data for polars.

Compared to the way `pandera` handles `pandas` dataframes, `pandera`
attempts to leverage the `polars` [lazy API](https://docs.pola.rs/user-guide/lazy/using/)
as much as possible to leverage its performance optimization benefits.
as much as possible to leverage its query optimization benefits.

At a high level, this is what happens during schema validation:

Expand All @@ -130,19 +142,19 @@ informative error messages since all failure cases can be reported.
:::

`pandera`'s validation behavior aligns with the way `polars` handles lazy
vs. eager operations. When you can `schema.validate()` on a `polars.LazyFrame`,
vs. eager operations. When you call `schema.validate()` on a `polars.LazyFrame`,
`pandera` will apply all of the parsers and checks that can be done without
any `collect()` operations. This means that it only does validations
at the schema-level, e.g. column names and data types.

However, if you validate a `polars.DataFrame`, `pandera` perform
However, if you validate a `polars.DataFrame`, `pandera` performs
schema-level and data-level validations.

:::{note}
Under the hood, `pandera` will convert ``` polars.DataFrame``s to a
``polars.LazyFrame``s before validating them. This is done to leverage the
Under the hood, `pandera` will convert `polars.DataFrame`s to a
`polars.LazyFrame`s before validating them. This is done to leverage the
polars lazy API during the validation process. While this feature isn't
fully optimized in the ``pandera ``` library, this design decision lays the
fully optimized in the `pandera` library, this design decision lays the
ground-work for future performance improvements.
:::

Expand Down Expand Up @@ -411,6 +423,7 @@ pandera.errors.SchemaErrors: {

::::

(supported-polars-dtypes)=

## Supported Data Types

Expand Down Expand Up @@ -491,6 +504,53 @@ class ModelWithDtypeKwargs(pa.DataFrameModel):

::::

### Time-agnostic DateTime

In some use cases, it may not matter whether a column containing `pl.DateTime`
data has a timezone or not. In that case, you can use the pandera-native
polars datatype:

::::{tab-set}

:::{tab-item} DataFrameSchema

```{testcode} polars
from pandera.engines.polars_engine import DateTime


schema = pa.DataFrameSchema({
"created_at": pa.Column(DateTime(time_zone_agnostic=True)),
})
```

:::

:::{tab-item} DataFrameModel (Annotated)

```{testcode} polars
from pandera.engines.polars_engine import DateTime


class DateTimeModel(pa.DataFrameModel):
created_at: Annotated[DateTime, True]
```

:::

:::{tab-item} DataFrameModel (Field)

```{testcode} polars
from pandera.engines.polars_engine import DateTime


class DateTimeModel(pa.DataFrameModel):
created_at: DateTime = pa.Field(dtype_kwargs={"time_zone_agnostic": True})
```

:::

::::


## Custom checks

Expand Down Expand Up @@ -620,7 +680,7 @@ For column-level checks, the custom check function should return a

### DataFrame-level Checks

If you need to validate values on an entire dataframe, you can specify at check
If you need to validate values on an entire dataframe, you can specify a check
at the dataframe level. The expected output is a `polars.LazyFrame` containing
multiple boolean columns, a single boolean column, or a scalar boolean.

Expand Down Expand Up @@ -737,11 +797,11 @@ lf: pl.LazyFrame = (
```

This syntax is nice because it's clear what's happening just from reading the
code. Pandera schemas serve as an apparent point in the method chain that
materializes data.
code. Pandera schemas serve as a clear point in the method chain where the data
is materialized.

However, if you don't mind a little magic 🪄, you can set the
`PANDERA_VALIDATION_DEPTH` variable to `SCHEMA_AND_DATA` to
`PANDERA_VALIDATION_DEPTH` environment variable to `SCHEMA_AND_DATA` to
validate data-level properties on a `polars.LazyFrame`. This will be equivalent
to the explicit code above:

Expand All @@ -761,3 +821,13 @@ lf: pl.LazyFrame = (
Under the hood, the validation process will make `.collect()` calls on the
LazyFrame in order to run data-level validation checks, and it will still
return a `pl.LazyFrame` after validation is done.

## Supported and Unsupported Functionality

Since the pandera-polars integration is less mature than pandas support, some
of the functionality offered by the pandera with pandas DataFrames are
not yet supported with polars DataFrames.

Here is a list of supported and unsupported features. You can
refer to the {ref}`supported features matrix <supported-features>` to see
which features are implemented in the polars validation backend.
11 changes: 11 additions & 0 deletions docs/source/pyspark_sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,3 +338,14 @@ nature. It only works with `Config`.

Use with caution.
:::


## Supported and Unsupported Functionality

Since the pandera-pyspark-sql integration is less mature than pandas support, some
of the functionality offered by the pandera with pandas DataFrames are
not yet supported with pyspark sql DataFrames.

Here is a list of supported and unsupported features. You can
refer to the {ref}`supported features matrix <supported-features>` to see
which features are implemented in the pyspark-sql validation backend.
14 changes: 14 additions & 0 deletions docs/source/reference/core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,17 @@ Data Objects

pandera.api.polars.types.PolarsData
pandera.api.pyspark.types.PysparkDataframeColumnObject

Configuration
-------------

.. autosummary::
:toctree: generated
:template: class.rst
:nosignatures:

pandera.config.PanderaConfig
pandera.config.ValidationDepth
pandera.config.ValidationScope
pandera.config.config_context
pandera.config.get_config_context
6 changes: 4 additions & 2 deletions docs/source/schema_inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,16 @@ file_format: mystnb

(schema-inference)=

# Schema Inference
# Schema Inference and Persistence

*New in version 0.4.0*

With simple use cases, writing a schema definition manually is pretty
straight-forward with pandera. However, it can get tedious to do this with
dataframes that have many columns of various data types.

## Inferring a schema from data

To help you handle these cases, the {func}`~pandera.schema_inference.pandas.infer_schema` function enables
you to quickly infer a draft schema from a pandas dataframe or series. Below
is a simple example:
Expand Down Expand Up @@ -52,7 +54,7 @@ inferred schema.

(schema-persistence)=

## Schema Persistence
## Persisting a schema

The schema persistence feature requires a pandera installation with the `io`
extension. See the {ref}`installation<installation>` instructions for more
Expand Down
Loading
Loading