Support for Native PySpark #1185

NeerajMalhotra-QB · 2023-05-12T15:38:03Z

This PR introduces several enhancements and new features to support native PySpark on Pandera. The key changes implemented include:

Native PySpark Support - The pull request enables native PySpark support in Pandera, allowing seamless integration and utilization of PySpark functionalities.
Kill Switch - A kill switch feature has been added, providing the ability to halt the validation process temporarily.
Granular Control over Depth of Validation - The pull request introduces enhanced control over the depth of validation, enabling users to define specific levels or scopes such as SchemaOnly, DataOnly or Schema_And_Data
Error Reporting - An error reporting mechanism has been incorporated, enhancing the visibility and clarity of error messages and generating dictionary object for diagnostics later.
Metadata at Schema & Column levels - The PR also introduces the concept of metadata at the schema and column levels, enabling inclusion of additional contextual information and attributes associated with the data being validated.

These changes significantly enhance the capabilities of Pandera, empowering users leverage Pandera on native PySpark applications.

Pyspark engine

updating equivalents to native types

…accessor

implementing component and container changes for pyspark dataframes

Feature kill switch

change to consistent interface

removing dead code and refactoring as needed

adding setter for errors object in pyspark accessor

reformatting error dict

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-05-12T20:25:02Z

hey @NeerajMalhotra-QB, so there're a considerable number of pylint and mypy errors.

I'm going to silence them before merging into the pyspark branch on the unionai-oss pandera repo. From there we can work on either fixing the errors or disabling them with pylint disable and type: ignore.

On this front, I want to create an alpha release 0.16.0a1 to include all these changes, but ideally we fix all test and linting issues before creating a beta release. How does that sound?

NeerajMalhotra-QB · 2023-05-12T20:34:51Z

hey @NeerajMalhotra-QB, so there're a considerable number of pylint and mypy errors.

I'm going to silence them before merging into the pyspark branch on the unionai-oss pandera repo. From there we can work on either fixing the errors or disabling them with pylint disable and type: ignore.

On this front, I want to create an alpha release 0.16.0a1 to include all these changes, but ideally we fix all test and linting issues before creating a beta release. How does that sound?

That sounds good to me. Please let me know when you merge it into branch so we can start pulling it and updating in future. Thanks @cosmicBboy

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-05-12T20:44:29Z

I'm going to silence them before merging into the pyspark branch on the unionai-oss pandera repo.

Quick update: going to use the union-oss/pandera dev branch instead of pyspark: this branch is configured to run tests on CI, might as well use it!

codecov · 2023-05-12T21:32:12Z

Codecov Report

Patch coverage: 12.43% and project coverage change: -21.37 ⚠️

Comparison is base (f401617) 97.23% compared to head (534e6ce) 75.87%.

Additional details and impacted files

@@             Coverage Diff             @@
##              dev    #1185       +/-   ##
===========================================
- Coverage   97.23%   75.87%   -21.37%     
===========================================
  Files          65       88       +23     
  Lines        5067     6674     +1607     
===========================================
+ Hits         4927     5064      +137     
- Misses        140     1610     +1470

Impacted Files	Coverage Δ
pandera/accessors/pyspark_sql_accessor.py	`0.00% <0.00%> (ø)`
pandera/api/base/checks.py	`100.00% <ø> (ø)`
pandera/api/base/model.py	`97.22% <ø> (ø)`
pandera/api/checks.py	`98.54% <ø> (ø)`
pandera/api/extensions.py	`99.01% <ø> (ø)`
pandera/api/hypotheses.py	`100.00% <ø> (ø)`
pandera/api/pandas/components.py	`99.11% <ø> (ø)`
pandera/api/pandas/container.py	`99.25% <ø> (ø)`
pandera/api/pyspark/__init__.py	`0.00% <0.00%> (ø)`
pandera/api/pyspark/column_schema.py	`0.00% <0.00%> (ø)`
... and 57 more

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-05-13T03:20:40Z

okay, so there are more issues with the tests and linters, which I wasn't able to address @NeerajMalhotra-QB.

I'll go ahead and merge this into the dev branch and cut an alpha release. You and the QB team can then iterate from there!

NeerajMalhotra-QB · 2023-05-15T17:24:25Z

Thanks @cosmicBboy , We will look into it.

* Support for Native PySpark (#1185) * init * init * init structure * disable imports' * adding structure for pyspark * setting dependency * update class import * fixing datatype cls * adding dtypes for pyspark * keep only bool type * remove pydantic schema * register pyspark data types * add check method * updating equivalents to native types * update column schema * refactor array to column * rename array to column_schema * remove pandas imports * remove index and multiindex functionality * adding pydantic schema class * adding model components * add model config * define pyspark BaseConfig class * removing index and multi-indexes * remove modify schema * Pyspark backend components, base, container, accessor, test file for accessor * Pyspark backend components, base, container, accessor, test file for accessor * Pyspark backend components, base, container, accessor, test file for accessor * Pyspark backend components, base, container, accessor, test file for accessor * add pyspark model components and types * remove hypothesis * remove synthesis and hypothesis * Pyspark backend components, base, container, accessor, test file for accessor * test for pyspark dataframeschema class * test schema with alias types * ensuring treat dataframes as tables types * update container for pyspark dataframe * adding negative test flow * removing series and index on pysparrk dataframes * remove series * revert series from pyspark.pandas * adding checks for pyspark * registering pysparkCheckBackend * cleaning base * Fixing the broken type cast check, validation of schema fix. * define spark level schema * fixing check flow * setting apply fn * add sub sample functionality * adjusting test case against common attributes * need apply for column level check * adding builtin checks for pyspark * adding checks for pyspark df * getting check registered * fixing a bug a in error handling for schema check * check_name validation fixed * implementing dtype checks for pyspark * updating error msg * fixing dtype reason_code * updating builtin checks for pyspark * registeration * Implementation of checks import and spark columns information check * enhancing __call__, checks classes and builtin_checks * delete junk files * Changes to fix the implemtation of checks. Changed Apply function to send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe * extending pyspark checks * Fixed builtin check bug and added test for supported builtin checks for pyspark * add todos * bydefault validate all checks * fixing issue with sqlctx * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * doc * run black linter Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * update pylintrc Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * Support native PySpark.sql on Pandera (#1213) * fixing check flow * setting apply fn * add sub sample functionality * adjusting test case against common attributes * need apply for column level check * adding builtin checks for pyspark * adding checks for pyspark df * getting check registered * fixing a bug a in error handling for schema check * check_name validation fixed * implementing dtype checks for pyspark * updating error msg * fixing dtype reason_code * updating builtin checks for pyspark * registeration * Implementation of checks import and spark columns information check * enhancing __call__, checks classes and builtin_checks * delete junk files * Changes to fix the implemtation of checks. Changed Apply function to send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe * extending pyspark checks * Fixed builtin check bug and added test for supported builtin checks for pyspark * add todos * bydefault validate all checks * fixing issue with sqlctx * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * Changes to remove config yaml and introduce environment variables for parameterized runs * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * Changes to remove config yaml and introduce environment variables for parameterized runs * Changes to rename the config object and call only in utils.py * Fixing merge conflict issue * Updating the test cases to support new checks types * Added individualized test for each configuration type. * Removing unnecessary prints * The changes include the following: 1) Fixed test case for validating the environment variable 2) Improved docstrings for test cases and few test cases asserts * Fix reference to with wrong key in test_pyspark_schema_data_checks * minor change * Added Support for docstring substitution method. * Removing an extra indent * Removing commented docstring substitution from __new__ method * remove union * cleaning * Feature to add metadata dictionary for pandas schema * Added test to check the docstring substitution decorator * Added test to check the docstring substitution decorator * Feature to add metadata dictionary for pandas schema * Changes to ensure only pandas run does not import pyspark dependencies * Fix of imports for pandas and pyspark for separation * Rename the function from pyspark to pandas * black lint and isort * black lint and isort * Fixes of pyliny issue and suppression wherever necessary * Fixes of mypy failures and redone black linting post changes. * Added new test cases, removed redundant codes and black lint. * Fixed the doc strings, added functionality and test for custom checks * add rst for pyspark.sql * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Fixed the doc strings, added functionality and test for custom checks * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Rename for environment variable key name * removing rst * Black lint * Removed daytime interval type * refactor * override pyspark patching of __class_getitem__ Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fixiing mypy error * lint fixes * lint fixes * fixing more lint and type issues * fixing mypy issues * fixing doctest * doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * adding doctest:metadata for pandas container classes Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * doctest * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing rst Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * black formatting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing str repr for DataFrameSchema across rst * add ps.DataFrame * fixing tests * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use full class name in pandas accessor Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use os.environ instead of parameters.yaml Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * simplify config Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * support multiple pyspark versions (#1221) support pyspark 3.2 and 3.3 for string representation of dtypes. --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * Refactors for dead code (#1229) * updating builtin checks for pyspark * registeration * Implementation of checks import and spark columns information check * enhancing __call__, checks classes and builtin_checks * delete junk files * Changes to fix the implemtation of checks. Changed Apply function to send list with dataframe and column name, builtin function registers functions with lists which inculdes the dataframe * extending pyspark checks * Fixed builtin check bug and added test for supported builtin checks for pyspark * add todos * bydefault validate all checks * fixing issue with sqlctx * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * Changes to remove config yaml and introduce environment variables for parameterized runs * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * Changes to remove config yaml and introduce environment variables for parameterized runs * Changes to rename the config object and call only in utils.py * Fixing merge conflict issue * Updating the test cases to support new checks types * Added individualized test for each configuration type. * Removing unnecessary prints * The changes include the following: 1) Fixed test case for validating the environment variable 2) Improved docstrings for test cases and few test cases asserts * Fix reference to with wrong key in test_pyspark_schema_data_checks * minor change * Added Support for docstring substitution method. * Removing an extra indent * Removing commented docstring substitution from __new__ method * remove union * cleaning * Feature to add metadata dictionary for pandas schema * Added test to check the docstring substitution decorator * Added test to check the docstring substitution decorator * Feature to add metadata dictionary for pandas schema * Changes to ensure only pandas run does not import pyspark dependencies * Fix of imports for pandas and pyspark for separation * Rename the function from pyspark to pandas * black lint and isort * black lint and isort * Fixes of pyliny issue and suppression wherever necessary * Fixes of mypy failures and redone black linting post changes. * Added new test cases, removed redundant codes and black lint. * Fixed the doc strings, added functionality and test for custom checks * add rst for pyspark.sql * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Fixed the doc strings, added functionality and test for custom checks * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Rename for environment variable key name * removing rst * Black lint * Removed daytime interval type * refactor * override pyspark patching of __class_getitem__ Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fixiing mypy error * lint fixes * lint fixes * fixing more lint and type issues * fixing mypy issues * fixing doctest * doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * adding doctest:metadata for pandas container classes Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * doctest * Fix to support pyspark 3.2 and 3.3 both. The string representation of datatype changed in 3.2 and 3.3. Fix ensure both versions are supported. * Black Lint * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing rst Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * black formatting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing str repr for DataFrameSchema across rst * add ps.DataFrame * fixing tests * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use full class name in pandas accessor Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use os.environ instead of parameters.yaml Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * simplify config Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * merge with develop * Black Lint * refactor Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * lint fix Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * refactor * remove Column class due to redundancy Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * linting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * Adding readme for pyspark_sql enhancements (#1236) * add dtypes pytests * setting up schema * add negative and positive tests * add fixtures and refactor tests * generalize spark_df func * refactor to use conftest * use conftest * add support for decimal dtype and fixing other types * Added new Datatypes support for pyspark, test cases for dtypes pyspark, created test file for error * refactor ArraySchema * rename array to column.py * 1) Changes in test cases to look for summarised error raise instead of fast fail, since default behaviour is changed to summarised. 2) Added functionality to accept and check the precision and scale in Decimal Datatypes. * add neg test * add custom ErrorHandler * Added functionality to DayTimeIntervalType datatype to accept parameters * Added functionality to DayTimeIntervalType datatype to accept parameters * return summarized error report * replace dataframe to dict for return obj * Changed checks input datatype to custom named tuple from the existing list. Also started changing the pyspark checks to include more datatypes * refactor * introduce error categories * rename error categories * fixing bug in schema.dtype.check * fixing error category to by dynamic * Added checks for each datatype in test cases. Reduced the code redundancy of the code in test file. Refactored the name of custom datatype object for checks. * error_handler pass through * add ErrorHandler to column api * removed SchemaErrors since we now aggregate in errorHandler * fixing dict keys * Added Decorator to raise TypeError in case of unexpected input type for the check function. * replace validator with report_errors * cleaning debugs * Support DataModels and Field * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Fix to run using the class schema type * use alias types * clean up * add new typing for pyspark.sql * Added Decorator to raise TypeError in case of unexpected input type for the check function. Merged with Develop * Added changes to support raising error for use of datatype not supported by the check and support for map and array type. * support bare dtypes for DataFrameModel * remove resolved TODOs and breakpoints * change to bare types * use spark types instead of bare types * using SchemaErrorReason instead of hardcode in container * fixing an issue with error reason codes * minor fix * fixing checks and errors in pyspark * Changes include the following: 1) Updated dtypes test functionality to make it more readable 2) Changes in accessor tests to support the new functionality 3) Changes in engine class to conform to check class everywhere else * enhancing dataframeschema and model classes * Changes to remove the pandas dependency * Refactoring of the checks test functions * Fixing the test case breaking * Isort and Black formatting * Container Test function failure * Isort and black linting * Changes to remove the pandas dependency * Refactoring of the checks test functions * Isort and black linting * Added Changes to refactor the checks class. Fixes to some test cases failures. * Removing breakpoint * fixing raise error * adding metadata dict * Removing the reference of pandas from docstrings * Removing redundant code block in utils * Changes to return dataframe with errors property * add accessor for errorHandler * support errors access on pyspark.sql * updating pyspark error tcs * fixing model test cases * adjusting errors to use pandera.errors * use accessor instead of dict * revert to develop * Removal of imports which are not needed and improved test case. * setting independent pyspark import * pyspark imports * revert comments * store and retrieve metadata at schema levels * adding metadata support * Added changes to support parameter based run. 1) Added parameters.yaml file to hold the configurations 2) Added code in utility to read the config 3) Updated the test cases to support the parameter based run 4) Moved pyspark decorators to a new file decorators.py in backend 5) Type fix in get_matadata property in container.py file * Changing the default value in config * change to consistent interface * Changes to remove config yaml and introduce environment variables for parameterized runs * cleaning api/pyspark * backend and tests * adding setter on errors accessors for pyspark * reformatting error dict * Changes to remove config yaml and introduce environment variables for parameterized runs * Changes to rename the config object and call only in utils.py * Fixing merge conflict issue * Updating the test cases to support new checks types * Added individualized test for each configuration type. * Removing unnecessary prints * The changes include the following: 1) Fixed test case for validating the environment variable 2) Improved docstrings for test cases and few test cases asserts * Fix reference to with wrong key in test_pyspark_schema_data_checks * minor change * Added Support for docstring substitution method. * Removing an extra indent * Removing commented docstring substitution from __new__ method * remove union * cleaning * Feature to add metadata dictionary for pandas schema * Added test to check the docstring substitution decorator * Added test to check the docstring substitution decorator * Feature to add metadata dictionary for pandas schema * Changes to ensure only pandas run does not import pyspark dependencies * Fix of imports for pandas and pyspark for separation * Rename the function from pyspark to pandas * black lint and isort * black lint and isort * Fixes of pyliny issue and suppression wherever necessary * Fixes of mypy failures and redone black linting post changes. * Added new test cases, removed redundant codes and black lint. * Fixed the doc strings, added functionality and test for custom checks * add rst for pyspark.sql * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Fixed the doc strings, added functionality and test for custom checks * removing rst * Renamed check name and Fixed pylint and mypy issues * add rst for pyspark.sql * Rename for environment variable key name * removing rst * Black lint * Removed daytime interval type * refactor * override pyspark patching of __class_getitem__ Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * fixiing mypy error * lint fixes * lint fixes * fixing more lint and type issues * fixing mypy issues * fixing doctest * doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * adding doctest:metadata for pandas container classes Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * doctest * Fix to support pyspark 3.2 and 3.3 both. The string representation of datatype changed in 3.2 and 3.3. Fix ensure both versions are supported. * Black Lint * fixing doctest Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing rst Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * black formatting Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * fixing str repr for DataFrameSchema across rst * add ps.DataFrame * fixing tests * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use full class name in pandas accessor Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * use os.environ instead of parameters.yaml Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * simplify config Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * merge with develop * Black Lint * refactor Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * lint fix Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * readme * refine example * about granular controls * native pyspark.sql documentation Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * docs Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * Add to index * add to index * Delete oryx-build-commands.txt * fix index * revert * fix index * Update pyspark_sql.rst * Update pyspark_sql.rst * update error print * Update pyspark_sql.rst * Update pyspark_sql.rst * adding import in code block Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> * Update pyspark_sql.rst * Update pyspark_sql.rst * Update pyspark_sql.rst * clean up docs Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com> * fix metadata arg in SeriesSchema Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * improve coverage * improve coverage in extensions --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Neeraj Malhotra <neeraj_malhotra@mckinsey.com> Co-authored-by: Neeraj Malhotra <52220398+NeerajMalhotra-QB@users.noreply.github.com> Co-authored-by: jaskaransinghsidana <jaskaran_singh_sidana@mckinsey.com> Co-authored-by: jaskaransinghsidana <112083212+jaskaransinghsidana@users.noreply.github.com>

NeerajMalhotra-QB and others added 30 commits March 22, 2023 11:08

init

b9eb217

init

4a8937f

init structure

c5da3e4

disable imports'

2db1433

adding structure for pyspark

66d0243

setting dependency

08488d3

update class import

ac31fe1

fixing datatype cls

c56dca0

adding dtypes for pyspark

7c961b4

keep only bool type

5510e66

remove pydantic schema

bfd9a05

register pyspark data types

40f10af

add check method

519dc64

Merge pull request #3 from NeerajMalhotra-QB/pyspark_engine

8f52456

Pyspark engine

updating equivalents to native types

8bccf01

Merge pull request #4 from NeerajMalhotra-QB/pyspark_engine

ed172a9

updating equivalents to native types

update column schema

1ff202f

refactor array to column

b300380

rename array to column_schema

717506b

remove pandas imports

3e9b009

remove index and multiindex functionality

c735670

adding pydantic schema class

29ca61f

adding model components

524756e

add model config

a674495

define pyspark BaseConfig class

76924ee

removing index and multi-indexes

ebd4e4a

remove modify schema

40a3903

Pyspark backend components, base, container, accessor, test file for …

a271164

…accessor

Pyspark backend components, base, container, accessor, test file for …

a24c1c4

…accessor

Merge pull request #5 from NeerajMalhotra-QB/pyspark_col

90707dc

implementing component and container changes for pyspark dataframes

NeerajMalhotra-QB and others added 11 commits May 9, 2023 09:14

Merge pull request #56 from NeerajMalhotra-QB/feature_kill_switch

6cde57c

Feature kill switch

change to consistent interface

5b01e87

Merge pull request #57 from NeerajMalhotra-QB/validate

1269eab

change to consistent interface

cleaning api/pyspark

ae86903

backend and tests

6e7cb8a

Merge pull request #59 from NeerajMalhotra-QB/refactor

2d706a0

removing dead code and refactoring as needed

adding setter on errors accessors for pyspark

36d648b

Merge pull request #60 from NeerajMalhotra-QB/getter_for_error

9d7fc7d

adding setter for errors object in pyspark accessor

reformatting error dict

dd98d4c

Merge pull request #61 from NeerajMalhotra-QB/pretty_errs

97ac681

reformatting error dict

doc

f4a6b3e

NeerajMalhotra-QB self-assigned this May 12, 2023

NeerajMalhotra-QB added the enhancement New feature or request label May 12, 2023

NeerajMalhotra-QB requested a review from cosmicBboy May 12, 2023 17:57

cosmicBboy added 2 commits May 12, 2023 15:02

merge pyspark with main

104ca7b

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

run black linter

84a4ee4

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy changed the base branch from main to pyspark May 12, 2023 19:22

fix lint

8f85a47

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy changed the base branch from pyspark to dev May 12, 2023 20:43

cosmicBboy closed this May 12, 2023

cosmicBboy reopened this May 12, 2023

update pylintrc

534e6ce

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy merged commit 74be58c into unionai-oss:dev May 13, 2023
4 of 41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Native PySpark #1185

Support for Native PySpark #1185

NeerajMalhotra-QB commented May 12, 2023 •

edited

cosmicBboy commented May 12, 2023

NeerajMalhotra-QB commented May 12, 2023

cosmicBboy commented May 12, 2023

codecov bot commented May 12, 2023 •

edited

cosmicBboy commented May 13, 2023

NeerajMalhotra-QB commented May 15, 2023

Support for Native PySpark #1185

Support for Native PySpark #1185

Conversation

NeerajMalhotra-QB commented May 12, 2023 • edited

cosmicBboy commented May 12, 2023

NeerajMalhotra-QB commented May 12, 2023

cosmicBboy commented May 12, 2023

codecov bot commented May 12, 2023 • edited

Codecov Report

cosmicBboy commented May 13, 2023

NeerajMalhotra-QB commented May 15, 2023

NeerajMalhotra-QB commented May 12, 2023 •

edited

codecov bot commented May 12, 2023 •

edited