Pandera: A flexible and expressive pandas data validation library.

Submitting Author: Niels Bantilan (@cosmicbboy)
All current maintainers:  (@cosmicbboy)
Package Name: pandera
One-Line Description of Package: validate the types, properties, and statistics of pandas data structures
Repository Link: https://github.com/unionai-oss/pandera
Version submitted: 0.1.5
Editor: @lwasser
Reviewer 1: @mbjoseph
Reviewer 2: @xmnlab 
Archive: https://github.com/pandera-dev/pandera/releases/tag/v0.2.3
Version accepted: v0.2.3
Date Accepted: 10/10/2019

---

## Description

`pandas` data structures can hide a lot of information, and explicitly
validating them at runtime in production-critical or reproducible research
settings is a good idea for building reliable data transformation pipelines.
`pandera` enables users to:

1. Check the types and properties of columns in a `DataFrame` or values in
   a `Series`.
1. Perform descriptive and inferential statistical validation, e.g. two-sample
   t-tests.
1. Seamlessly integrate with existing data analysis/processing pipelines
   via function decorators.

`pandera` provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.


## Scope
- Please indicate which [category or categories](https://www.pyopensci.org/dev_guide/peer_review/aims_scope.html) this package falls under:
    - [ ] Data retrieval
    - [ ] Data extraction
    - [X] Data munging
    - [ ] Data deposition
    - [X] Reproducibility
    - [ ] Geospatial
    - [ ] Education
    - [ ] Data visualization*

\* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see [this section](https://www.pyopensci.org/dev_guide/peer_review/aims_scope.html#notes-on-categories) of our guidebook.

- Explain how and why the package falls under these categories (briefly, 1-2 sentences):

**Data munging**: the package makes ETL, data analysis, and data processing
pipelines more robust and reliable by providing users with tools to validate
assumptions about the schema and statistical properties of datasets.
This package supports validation on long (tidy) data and wide data.

**Reproducibility**: This package enables users to validate `DataFrame` or `Series`
objects at runtime or as unit/integration tests, and can easily be integrated
to existing pipelines using the `check_input` and `check_output` decorators.
It also supports collaboration and reproducible research by programmatically
enforcing assertions made about the statistical properties of a dataset in
addition to making it easier to review pandas code in production-critical
contexts.

-   Who is the target audience and what are scientific applications of this package?

The target audience of `pandera` consist of data scientists, data engineers,
machine learning engineers, and machine learning scientists who use `pandas` in
their data processing pipelines for various purposes e.g., transforming data
for reporting, analytics, model training, and data visualization. This tool is
built on top of `pandas` and `scipy` to provide a user-friendly interface for
explicitly specifying the set of properties that a `DataFrame` or `Series` must
fulfill in order to be considered valid. Since `pandera` makes no assumptions
about the domain of study or contents of these `pandas` data structures, it
could be used in a wide variety of quantitative fields that involve the
analysis of tabular data.

-   Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are a few alternatives to pandera in the the Python ecosystem and here
is how they compare:

- https://github.com/alecthomas/voluptuous
  - not specific to pandas, applies to JSON/YAML etc.
  - very flexible and reasonably simple
  - no decorators, hypothesis or sophisticated checks
- https://github.com/keleshev/schema
  - similar to voloptuous
  - validation of generic python data structures
- https://github.com/TMiguelT/PandasSchema
  - has a wider range of 'built-in' validator types
  - limited type support (only has a conversion/coercion check)
  - no decorators
  - implementation has less flexibility than pandera's
  - has generic 'check'-like validators
- https://github.com/danielvdende/opulent-pandas
  - similar to voluptuous, and conceptually similar to pandera, but lacking
    functionality
- https://github.com/c-data/pandas-validator
  - not maintained, inflexible syntax
- https://github.com/xguse/table_enforcer
  - not maintained
  - the `Enforcer` and `Column` objects are very similar to pandera, but it's a
    little difficult to follow

Key differentiators of pandera:

- column data types, nullability, and uniqueness are first-class concepts.
- `check_input` and `check_output` decorators enable seamless integration with
  existing code.
- `Check`s provide flexibility and performance by providing access to `pandas`
  API by design.
- `Hypothesis` class provides a tidy-first interface for statistical hypothesis
  testing.
- `Check`s and `Hypothesis` objects support both tidy and wide data validation.
- Comprehensive documentation on key functionality.

- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or `@tag` the editor you contacted:

https://pyopensci.discourse.group/t/candidate-package-pandera-a-flexible-pandas-data-structure-validation-package/92

## Technical checks

For details about the pyOpenSci packaging requirements, see our [packaging guide](https://www.pyopensci.org/dev_guide/packaging/packaging_guide.html). Confirm each of the following by checking the box.  This package:

- [X] does not violate the Terms of Service of any service it interacts with.
- [X] has an [OSI approved license](https://opensource.org/licenses)
- [X] contains a README with instructions for installing the development version.
- [X] includes documentation with examples for all functions.
- [X] contains a vignette with examples of its essential functions and uses.
- [X] has a test suite.
- [X] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

## Publication options

- [ ] Do you wish to automatically submit to the [Journal of Open Source Software](http://joss.theoj.org/)? If so:

<details>
 <summary>JOSS Checks</summary>

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
- [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
- [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`.
- [ ] The package is deposited in a long-term repository with the DOI:

*Note: Do not submit your package separately to JOSS*

</details>

## Are you OK with Reviewers Submitting Issues to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

- [X] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

## Code of conduct

- [X] I agree to abide by [pyOpenSci's Code of Conduct](https://www.pyopensci.org/dev_guide/peer_review/coc.html) during the review process and in maintaining my package should it be accepted.


**P.S.** *Have feedback/comments about our review process? Leave a comment [here](https://github.com/pyOpenSci/governance/issues/8)*

## Editor and Review Templates

[Editor and review templates can be found here](https://www.pyopensci.org/dev_guide/appendices/templates.html)

Previous Repo: https://github.com/cosmicBboy/pandera 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pandera: A flexible and expressive pandas data validation library. #12

Description

Scope

Technical checks

Publication options

Are you OK with Reviewers Submitting Issues to your Repo Directly?

Code of conduct

Editor and Review Templates

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pandera: A flexible and expressive pandas data validation library. #12

Description

Description

Scope

Technical checks

Publication options

Are you OK with Reviewers Submitting Issues to your Repo Directly?

Code of conduct

Editor and Review Templates

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions