Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722

minseokim12 · 2025-03-06T13:15:48Z

Current Behaviour

When using ydata-profiling with Spark, if the dataset is empty after filtering numeric columns, an exception is raised due to Correlation.corr() not handling empty DataFrames properly. This issue occurs in _compute_spark_corr_natively when converting a DataFrame into a feature vector and then computing correlation.

The code does not check if df_vector is empty before calling Correlation.corr(), leading to a RuntimeException from Spark.
The process crashes instead of handling the case properly.

Expected Behaviour

If the DataFrame is empty after filtering, Correlation.corr() should be skipped gracefully instead of raising an exception.
The function _compute_spark_corr_natively should check if df_vector is empty before calling Correlation.corr().

Data Description

when there are cols whose cells are all empty or reaches 98~99% missing, in Spark DataFrame
(I didn't see such error when I converted to Pandas dataframe)

Code that reproduces the bug

# Sample 10%
df = spark.sql(
    "select * from @@@@.@@@@ where rand() < 0.1"
).cache()
# type casting 1
df_casted = df.select(
    [
        (
            col(field.name).cast("string").alias(field.name)
            if isinstance(field.dataType, (DateType, TimestampType))
            else col(field.name)
        )
        for field in df.schema
    ]
)
# type casting 2
complex_columns = [
    field.name
    for field in df.schema.fields
    if isinstance(field.dataType, (ArrayType, MapType, StructType))
]
for col_name in complex_columns:
    df_casted = df_casted.withColumn(col_name, to_json(col(col_name)))


profile = ProfileReport(df_casted, title=app_name, explorative=True)
profile.to_file(f"/tmp/ydata.html")

pandas-profiling version

v2.2.3

Dependencies

dependencies:
  - bzip2=1.0.8
  - ca-certificates=2025.1.31
  - conda-pack=0.8.1
  - libffi=3.4.2
  - liblzma=5.6.4
  - libsqlite=3.49.1
  - libzlib=1.3.1
  - ncurses=6.5
  - openssl=3.4.1
  - pip=25.0.1
  - pyspark=3.5.3
  - python=3.9.21
  - readline=8.2
  - setuptools=75.8.2
  - tk=8.6.13
  - wheel=0.45.1
  - pip:
      - executing==2.2.0
      - fastjsonschema==2.21.1
      - great-expectations==0.18.22
      - jupyter-events==0.12.0
      - notebook-shim==0.2.4
      - pandocfilters==1.5.1
      - phik==0.12.4
      - pydantic-core==2.27.2
      - python-json-logger==3.2.1
      - ruamel-yaml-clib==0.2.12
      - soupsieve==2.6
      - stack-data==0.6.3
      - tzdata==2025.1
      - ydata-profiling==4.12.2

OS

macos

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Common Issues.

The text was updated successfully, but these errors were encountered:

azory-ydata added the needs-triage label Mar 6, 2025

fabclmnt added spark ⚡ and removed needs-triage labels Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722

Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722

minseokim12 commented Mar 6, 2025 •

edited

Loading

Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722

Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722

Comments

minseokim12 commented Mar 6, 2025 • edited Loading

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist

minseokim12 commented Mar 6, 2025 •

edited

Loading