Support native PySpark.sql on Pandera #1213

NeerajMalhotra-QB · 2023-06-02T18:45:36Z

Finally this branch as 'almost' all the changes to support PySpark.sql on Pandera. It also comes with additional functionalities as follows:

Support native PySpark.sql on Pandera
kill switch to enable/disable on production. (only for pyspark.sql schemas)
control the depth of validation (SCHEMA_ONLY, DATA_ONLY, SCHEMA_DATA_BOTH). (only for pyspark.sql schemas)
optional metadata dict object at column, field, schema and model levels for pyspark.sql and pandas schemas both
lint and mypy fixes

Super excited for this major milestone. In coming days, we will be release documentation around above changes.

support pyspark checks, validations, schema matching, etc

Implementation of checks import and spark columns information check

enhancing __call__, checks classes and builtin_checks I don’t think we need to add another arg column_name backend = self.get_backend(check_obj)(self) return backend(check_obj, column, column_name) since column is already present there. Moreover, this is in pandera/api/checks.py which will impact pandas and others functionality too.. So I reverted this code back for now… There’s a kwargs which we can leverage as follows: @register_builtin_check( error="str_startswith('{string}')", ) def str_startswith(data: DataFrame, string: str, kwargs: dict) -> bool: """Ensure that all values start with a certain string. :param string: String all values should start with :param kwargs: key-word arguments passed into the `Check` initializer. """ breakpoint() # TODO: change to accept column and perform check on it return True with this I can access both column and value to check > /Users/Neeraj_Malhotra/qb_assets/os/forked/pandera/pandera/backends/pyspark/builtin_checks.py(226)str_startswith() -> breakpoint() # TODO: change to accept column and perform check on it (Pdb) kwargs {'string': 'B'} (Pdb) string 'product' (Pdb) data.show() +-------+-----+ |product|price| +-------+-----+ | Bread| 9| | Butter| 15| +-------+-----+ Next we can do something like return df.filter(condition).limit(1) == 1 as boolean value as check result.

…send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe

Changes to fix the implemtation of checks. Changed Apply function to …

extending pyspark checks

…or pyspark

Fixed builtin check bug and added test for supported builtin checks f…

NeerajMalhotra-QB · 2023-06-05T22:54:52Z

I'll take a look.

Can you update the docs to fix the docs failures? https://github.com/unionai-oss/pandera/actions/runs/5180399308/jobs/9334552071?pr=1213. It's due to the new metadata property.

You can check this locally with make docs.

absolutely, will do.

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

NeerajMalhotra-QB · 2023-06-05T23:47:48Z

@cosmicBboy
Its still running for others but hopefully it seems like doctest issue is resolved for both pyspark and pandas.

https://github.com/unionai-oss/pandera/actions/runs/5182895439/jobs/9340171409?pr=1213
https://github.com/unionai-oss/pandera/actions/runs/5182895439/jobs/9340171960?pr=1213

cosmicBboy · 2023-06-06T13:03:40Z

Note that docs CI is only run for python 3.8 and 3.9: https://github.com/unionai-oss/pandera/blob/main/.github/workflows/ci-tests.yml#L235-L241

So the failure in this CI run still needs to be fixed: https://github.com/unionai-oss/pandera/actions/runs/5182895439/jobs/9340171657?pr=1213

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

NeerajMalhotra-QB · 2023-06-06T19:17:12Z

@cosmicBboy / Niels, any idea about this one?

https://github.com/unionai-oss/pandera/actions/runs/5191836192/jobs/9360195108

cosmicBboy · 2023-06-06T20:26:58Z

https://github.com/NeerajMalhotra-QB/pandera/blob/union/pandera/api/extensions.py#L144-L148 this change needs to be reverted needs to include ps.DataFrame if pyspark is installed.

We should add a test for this in the unit tests, but good thing the docs caught it!

NeerajMalhotra-QB · 2023-06-06T20:36:16Z

https://github.com/NeerajMalhotra-QB/pandera/blob/union/pandera/api/extensions.py#L144-L148 this change needs to be reverted needs to include ps.DataFrame if pyspark is installed.

We should add a test for this in the unit tests, but good thing the docs caught it!

oh yeah!!!

NeerajMalhotra-QB · 2023-06-06T21:57:45Z

@cosmicBboy / Niels, I think issues related to docs and doctests are resolved now: https://github.com/unionai-oss/pandera/actions/runs/5193388249/jobs/9363869457?pr=1213

I am also seeing CI failures on tests/pyspark/test_schemas_on_pyspark_pandas.py although it works locally for me. Can you please take a look when possible?
https://github.com/unionai-oss/pandera/actions/runs/5193388249/jobs/9363869160?pr=1213

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-06-09T13:20:00Z

@NeerajMalhotra-QB tests are now passing!

The broken tests were actually due to issues in test_pyspark_params.py (which I renamed to test_pyspark_config.py)... the fact that we needed to do the whole del module cache thing made me think that it shouldn't be so hard to test a piece for functionality. I then backtracked to the way ParamsConfig was done... then I realized the problem of converting env vars to config is exactly the problem that pydantic solves...

So I took the liberty of simplifying the way the config works, with a pandera.config module, see here.

I also expose a global pandera.config.CONFIG variable, which is an instance of PanderaConfig (the new parameter config class), and refactored validate_scope and a bunch of other places to use this global value.

I'm not sure if you wanted a per-backend configuration, which we can certainly do, but as the code was only using a global configuration this seemed like the simplest solution. Let me know your thoughts!

NeerajMalhotra-QB · 2023-06-09T16:19:11Z

@NeerajMalhotra-QB tests are now passing!

The broken tests were actually due to issues in test_pyspark_params.py (which I renamed to test_pyspark_config.py)... the fact that we needed to do the whole del module cache thing made me think that it shouldn't be so hard to test a piece for functionality. I then backtracked to the way ParamsConfig was done... then I realized the problem of converting env vars to config is exactly the problem that pydantic solves...

So I took the liberty of simplifying the way the config works, with a pandera.config module, see here.

I also expose a global pandera.config.CONFIG variable, which is an instance of PanderaConfig (the new parameter config class), and refactored validate_scope and a bunch of other places to use this global value.

I'm not sure if you wanted a per-backend configuration, which we can certainly do, but as the code was only using a global configuration this seemed like the simplest solution. Let me know your thoughts!

hi @cosmicBboy,
It looks good to me and I like the idea of pandera.config as we can leverage it for pandas and other backends. I don't think we need different configs for different backends. In fact it should be one config for entire setup irrespective of what client is using in pipeline e.g.pyspark or pandas. It's good that we can globalize this under pandera.config.

cosmicBboy · 2023-06-09T17:15:43Z

okay! will merge this by the end of day... I will likely do another look-over to see if there are any dangling abstractions, e.g. I don't think ColumnBase and the Column generic types of needed anymore, correct?

If so, we can clean up things like that next week. If you can help identify other clean-up points that would be great too.

NeerajMalhotra-QB · 2023-06-09T17:52:49Z

Sure, I can take a look at it next week including ColumnBase/Column (may be in new branch). I just want to ensure removing it doesn't break anything but for now we should be good to merge this branch and next week, we can push another branch on dev before release cut out. agree?

cosmicBboy

🚀

* Support for Native PySpark (#1185) * init * init * init structure * disable imports' * adding structure for pyspark * setting dependency * update class import * fixing datatype cls * adding dtypes for pyspark * keep only bool type * remove pydantic schema * register pyspark data types * add check method * updating equivalents to native types * update column schema * refactor array to column * rename array to column_schema * remove pandas imports * remove index and multiindex functionality * adding pydantic schema class * adding model components * add model config * define pyspark BaseConfig class * removing index and multi-indexes * remove modify schema * Pyspark backend components, base, container, accessor, test file for accessor * Pyspark backend components, base, container, accessor, test file for accessor * Pyspark backend components, base, container, accessor, test file for accessor * Pyspark backend components, base, container, accessor, test file for accessor * add pyspark model components and types * remove hypothesis * remove synthesis and hypothesis * Pyspark backend components, base, container, accessor, test file for accessor * test for pyspark dataframeschema class * test schema with alias types * ensuring treat dataframes as tables types * update container for pyspark dataframe * adding negative test flow * removing series and index on pysparrk dataframes * remove series * revert series from pyspark.pandas * adding checks for pyspark * registering pysparkCheckBackend * cleaning base * Fixing the broken type cast check, validation of schema fix. * define spark level schema * fixing check flow * setting apply fn * add sub sample functionality * adjusting test case against common attributes * need apply for column level check * adding builtin checks for pyspark * adding checks for pyspark df * getting check registered * fixing a bug a in error handling for schema check * check_name validation fixed * implementing dtype checks for pyspark * updating error msg * fixing dtype reason_code * updating builtin checks for pyspark * registeration * Implementation of checks import and spark columns information check * enhancing __call__, checks classes and builtin_checks * delete junk files * Changes to fix the implemtation of checks. Changed Apply function to send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe * extending pyspark checks * Fixed builtin check bug and added test for supported builtin checks for pyspark * add todos * bydefault validate all checks * fixing issue with sqlctx * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * doc * run black linter Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * update pylintrc Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * Support native PySpark.sql on Pandera (#1213) * fixing check flow * setting apply fn * add sub sample functionality * adjusting test case against common attributes * need apply for column level check * adding builtin checks for pyspark * adding checks for pyspark df * getting check registered * fixing a bug a in error handling for schema check * check_name validation fixed * implementing dtype checks for pyspark * updating error msg * fixing dtype reason_code * updating builtin checks for pyspark * registeration * Implementation of checks import and spark columns information check * enhancing __call__, checks classes and builtin_checks * delete junk files * Changes to fix the implemtation of checks. Changed Apply function to send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe * extending pyspark checks * Fixed builtin check bug and added test for supported builtin checks for pyspark * add todos * bydefault validate all checks * fixing issue with sqlctx * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * Changes to remove config yaml and introduce environment variables for parameterized runs * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * Changes to remove config yaml and introduce environment variables for parameterized runs * Changes to rename the config object and call only in utils.py * Fixing merge conflict issue * Updating the test cases to support new checks types * Added individualized test for each configuration type. * Removing unnecessary prints * The changes include the following: 1) Fixed test case for validating the environment variable 2) Improved docstrings for test cases and few test cases asserts * Fix reference to with wrong key in test_pyspark_schema_data_checks * minor change * Added Support for docstring substitution method. * Removing an extra indent * Removing commented docstring substitution from __new__ method * remove union * cleaning * Feature to add metadata dictionary for pandas schema * Added test to check the docstring substitution decorator * Added test to check the docstring substitution decorator * Feature to add metadata dictionary for pandas schema * Changes to ensure only pandas run does not import pyspark dependencies * Fix of imports for pandas and pyspark for separation * Rename the function from pyspark to pandas * black lint and isort * black lint and isort * Fixes of pyliny issue and suppression wherever necessary * Fixes of mypy failures and redone black linting post changes. * Added new test cases, removed redundant codes and black lint. * Fixed the doc strings, added functionality and test for custom checks * add rst for pyspark.sql * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Fixed the doc strings, added functionality and test for custom checks * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Rename for environment variable key name * removing rst * Black lint * Removed daytime interval type * refactor * override pyspark patching of __class_getitem__ Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fixiing mypy error * lint fixes * lint fixes * fixing more lint and type issues * fixing mypy issues * fixing doctest * doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * adding doctest:metadata for pandas container classes Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * doctest * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing rst Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * black formatting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing str repr for DataFrameSchema across rst * add ps.DataFrame * fixing tests * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use full class name in pandas accessor Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use os.environ instead of parameters.yaml Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * simplify config Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * support multiple pyspark versions (#1221) support pyspark 3.2 and 3.3 for string representation of dtypes. --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * Refactors for dead code (#1229) * updating builtin checks for pyspark * registeration * Implementation of checks import and spark columns information check * enhancing __call__, checks classes and builtin_checks * delete junk files * Changes to fix the implemtation of checks. Changed Apply function to send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe * extending pyspark checks * Fixed builtin check bug and added test for supported builtin checks for pyspark * add todos * bydefault validate all checks * fixing issue with sqlctx * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * Changes to remove config yaml and introduce environment variables for parameterized runs * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * Changes to remove config yaml and introduce environment variables for parameterized runs * Changes to rename the config object and call only in utils.py * Fixing merge conflict issue * Updating the test cases to support new checks types * Added individualized test for each configuration type. * Removing unnecessary prints * The changes include the following: 1) Fixed test case for validating the environment variable 2) Improved docstrings for test cases and few test cases asserts * Fix reference to with wrong key in test_pyspark_schema_data_checks * minor change * Added Support for docstring substitution method. * Removing an extra indent * Removing commented docstring substitution from __new__ method * remove union * cleaning * Feature to add metadata dictionary for pandas schema * Added test to check the docstring substitution decorator * Added test to check the docstring substitution decorator * Feature to add metadata dictionary for pandas schema * Changes to ensure only pandas run does not import pyspark dependencies * Fix of imports for pandas and pyspark for separation * Rename the function from pyspark to pandas * black lint and isort * black lint and isort * Fixes of pyliny issue and suppression wherever necessary * Fixes of mypy failures and redone black linting post changes. * Added new test cases, removed redundant codes and black lint. * Fixed the doc strings, added functionality and test for custom checks * add rst for pyspark.sql * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Fixed the doc strings, added functionality and test for custom checks * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Rename for environment variable key name * removing rst * Black lint * Removed daytime interval type * refactor * override pyspark patching of __class_getitem__ Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fixiing mypy error * lint fixes * lint fixes * fixing more lint and type issues * fixing mypy issues * fixing doctest * doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * adding doctest:metadata for pandas container classes Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * doctest * Fix to support pyspark 3.2 and 3.3 both. The string representation of datatype changed in 3.2 and 3.3. Fix ensure both versions are supported. * Black Lint * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing rst Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * black formatting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing str repr for DataFrameSchema across rst * add ps.DataFrame * fixing tests * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use full class name in pandas accessor Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use os.environ instead of parameters.yaml Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * simplify config Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * merge with develop * Black Lint * refactor Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * lint fix Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * refactor * remove Column class due to redundancy Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * linting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * Adding readme for pyspark_sql enhancements (#1236) * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * Changes to remove config yaml and introduce environment variables for parameterized runs * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * Changes to remove config yaml and introduce environment variables for parameterized runs * Changes to rename the config object and call only in utils.py * Fixing merge conflict issue * Updating the test cases to support new checks types * Added individualized test for each configuration type. * Removing unnecessary prints * The changes include the following: 1) Fixed test case for validating the environment variable 2) Improved docstrings for test cases and few test cases asserts * Fix reference to with wrong key in test_pyspark_schema_data_checks * minor change * Added Support for docstring substitution method. * Removing an extra indent * Removing commented docstring substitution from __new__ method * remove union * cleaning * Feature to add metadata dictionary for pandas schema * Added test to check the docstring substitution decorator * Added test to check the docstring substitution decorator * Feature to add metadata dictionary for pandas schema * Changes to ensure only pandas run does not import pyspark dependencies * Fix of imports for pandas and pyspark for separation * Rename the function from pyspark to pandas * black lint and isort * black lint and isort * Fixes of pyliny issue and suppression wherever necessary * Fixes of mypy failures and redone black linting post changes. * Added new test cases, removed redundant codes and black lint. * Fixed the doc strings, added functionality and test for custom checks * add rst for pyspark.sql * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Fixed the doc strings, added functionality and test for custom checks * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Rename for environment variable key name * removing rst * Black lint * Removed daytime interval type * refactor * override pyspark patching of __class_getitem__ Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fixiing mypy error * lint fixes * lint fixes * fixing more lint and type issues * fixing mypy issues * fixing doctest * doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * adding doctest:metadata for pandas container classes Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * doctest * Fix to support pyspark 3.2 and 3.3 both. The string representation of datatype changed in 3.2 and 3.3. Fix ensure both versions are supported. * Black Lint * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing rst Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * black formatting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing str repr for DataFrameSchema across rst * add ps.DataFrame * fixing tests * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use full class name in pandas accessor Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use os.environ instead of parameters.yaml Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * simplify config Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * merge with develop * Black Lint * refactor Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * lint fix Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * readme * refine example * about granular controls * native pyspark.sql documentation Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * docs Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * Add to index * add to index * Delete oryx-build-commands.txt * fix index * revert * fix index * Update pyspark_sql.rst * Update pyspark_sql.rst * update error print * Update pyspark_sql.rst * Update pyspark_sql.rst * adding import in code block Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * Update pyspark_sql.rst * Update pyspark_sql.rst * Update pyspark_sql.rst * clean up docs Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * fix metadata arg in SeriesSchema Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * improve coverage * improve coverage in extensions --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: Neeraj Malhotra <52220398+NeerajMalhotra-QB@users.noreply.github.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com>

NeerajMalhotra-QB and others added 30 commits March 29, 2023 11:20

fixing check flow

8a5fdcb

setting apply fn

e35fdd9

add sub sample functionality

74bddc5

adjusting test case against common attributes

3e7d39f

need apply for column level check

2c04cd8

adding builtin checks for pyspark

da613ed

adding checks for pyspark df

d14faac

getting check registered

336455b

fixing a bug a in error handling for schema check

c9b3f01

check_name validation fixed

2bced40

implementing dtype checks for pyspark

95dfeae

updating error msg

2902956

fixing dtype reason_code

63f0c3c

updating builtin checks for pyspark

bbcece6

registeration

027e060

Merge pull request #11 from NeerajMalhotra-QB/pyspark_schema

c5d58bb

support pyspark checks, validations, schema matching, etc

Implementation of checks import and spark columns information check

99446e9

Merge pull request #12 from NeerajMalhotra-QB/feature_pyspark_backend

337d874

Implementation of checks import and spark columns information check

enhancing __call__, checks classes and builtin_checks

a644b7e

delete junk files

bd86409

Merge branch 'pyspark_builtin_checks' into develop

60c1313

Changes to fix the implemtation of checks. Changed Apply function to …

54f8b32

…send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe

Merge pull request #14 from NeerajMalhotra-QB/feature_pyspark_backend

f41f518

Changes to fix the implemtation of checks. Changed Apply function to …

extending pyspark checks

d5b5304

Merge pull request #15 from NeerajMalhotra-QB/ps_err

288eed7

extending pyspark checks

Fixed builtin check bug and added test for supported builtin checks f…

993c968

…or pyspark

Merge pull request #16 from NeerajMalhotra-QB/feature_pyspark_backend

b17e150

Fixed builtin check bug and added test for supported builtin checks f…

add todos

b291aa0

bydefault validate all checks

a7c9f99

NeerajMalhotra-QB added 3 commits June 5, 2023 15:56

fixing doctest

06462d4

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

adding doctest:metadata for pandas container classes

30acb4a

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

doctest

ea4537c

NeerajMalhotra-QB added 4 commits June 6, 2023 09:01

fixing doctest

8312c9a

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

fixing rst

12de1b9

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

black formatting

70e7c62

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

fixing str repr for DataFrameSchema across rst

5b215b9

add ps.DataFrame

8e7d24f

fixing tests

be88a04

cosmicBboy mentioned this pull request Jun 6, 2023

Support for polars #1064

Closed

cosmicBboy added 2 commits June 7, 2023 20:29

fix lint

85a52b2

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

use full class name in pandas accessor

6cd7aa4

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy mentioned this pull request Jun 8, 2023

Support pyspark.sql.DataFrame #1138

Closed

NeerajMalhotra-QB and others added 2 commits June 8, 2023 15:55

use os.environ instead of parameters.yaml

23d43e5

Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com>

simplify config

30bad1b

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy approved these changes Jun 9, 2023

View reviewed changes

cosmicBboy merged commit d3410a6 into unionai-oss:dev Jun 9, 2023
41 checks passed

NeerajMalhotra-QB mentioned this pull request Jun 12, 2023

Allow checks based on data types #1169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support native PySpark.sql on Pandera #1213

Support native PySpark.sql on Pandera #1213

NeerajMalhotra-QB commented Jun 2, 2023

NeerajMalhotra-QB commented Jun 5, 2023

NeerajMalhotra-QB commented Jun 5, 2023 •

edited

Loading

cosmicBboy commented Jun 6, 2023

NeerajMalhotra-QB commented Jun 6, 2023

cosmicBboy commented Jun 6, 2023

NeerajMalhotra-QB commented Jun 6, 2023

NeerajMalhotra-QB commented Jun 6, 2023 •

edited

Loading

cosmicBboy commented Jun 9, 2023 •

edited

Loading

NeerajMalhotra-QB commented Jun 9, 2023

cosmicBboy commented Jun 9, 2023

NeerajMalhotra-QB commented Jun 9, 2023

cosmicBboy left a comment

Support native PySpark.sql on Pandera #1213

Support native PySpark.sql on Pandera #1213

Conversation

NeerajMalhotra-QB commented Jun 2, 2023

NeerajMalhotra-QB commented Jun 5, 2023

NeerajMalhotra-QB commented Jun 5, 2023 • edited Loading

cosmicBboy commented Jun 6, 2023

NeerajMalhotra-QB commented Jun 6, 2023

cosmicBboy commented Jun 6, 2023

NeerajMalhotra-QB commented Jun 6, 2023

NeerajMalhotra-QB commented Jun 6, 2023 • edited Loading

cosmicBboy commented Jun 9, 2023 • edited Loading

NeerajMalhotra-QB commented Jun 9, 2023

cosmicBboy commented Jun 9, 2023

NeerajMalhotra-QB commented Jun 9, 2023

cosmicBboy left a comment

Choose a reason for hiding this comment

NeerajMalhotra-QB commented Jun 5, 2023 •

edited

Loading

NeerajMalhotra-QB commented Jun 6, 2023 •

edited

Loading

cosmicBboy commented Jun 9, 2023 •

edited

Loading