Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722
Open
3 tasks done
Labels
spark ⚡
PySpark features!
Current Behaviour
When using ydata-profiling with Spark, if the dataset is empty after filtering numeric columns, an exception is raised due to Correlation.corr() not handling empty DataFrames properly. This issue occurs in _compute_spark_corr_natively when converting a DataFrame into a feature vector and then computing correlation.
The code does not check if df_vector is empty before calling Correlation.corr(), leading to a RuntimeException from Spark.
The process crashes instead of handling the case properly.
Expected Behaviour
If the DataFrame is empty after filtering, Correlation.corr() should be skipped gracefully instead of raising an exception.
The function _compute_spark_corr_natively should check if df_vector is empty before calling Correlation.corr().
Data Description
when there are cols whose cells are all empty or reaches 98~99% missing, in Spark DataFrame
(I didn't see such error when I converted to Pandas dataframe)
Code that reproduces the bug
pandas-profiling version
v2.2.3
Dependencies
OS
macos
Checklist
The text was updated successfully, but these errors were encountered: