Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report - Correlation.corr() fails when input DataFrame is empty in Spark #1722

Open
3 tasks done
minseokim12 opened this issue Mar 6, 2025 · 0 comments
Open
3 tasks done
Labels
spark ⚡ PySpark features!

Comments

@minseokim12
Copy link

minseokim12 commented Mar 6, 2025

Current Behaviour

When using ydata-profiling with Spark, if the dataset is empty after filtering numeric columns, an exception is raised due to Correlation.corr() not handling empty DataFrames properly. This issue occurs in _compute_spark_corr_natively when converting a DataFrame into a feature vector and then computing correlation.

The code does not check if df_vector is empty before calling Correlation.corr(), leading to a RuntimeException from Spark.
The process crashes instead of handling the case properly.

Expected Behaviour

If the DataFrame is empty after filtering, Correlation.corr() should be skipped gracefully instead of raising an exception.
The function _compute_spark_corr_natively should check if df_vector is empty before calling Correlation.corr().

Data Description

when there are cols whose cells are all empty or reaches 98~99% missing, in Spark DataFrame
(I didn't see such error when I converted to Pandas dataframe)

Code that reproduces the bug

# Sample 10%
df = spark.sql(
    "select * from @@@@.@@@@ where rand() < 0.1"
).cache()
# type casting 1
df_casted = df.select(
    [
        (
            col(field.name).cast("string").alias(field.name)
            if isinstance(field.dataType, (DateType, TimestampType))
            else col(field.name)
        )
        for field in df.schema
    ]
)
# type casting 2
complex_columns = [
    field.name
    for field in df.schema.fields
    if isinstance(field.dataType, (ArrayType, MapType, StructType))
]
for col_name in complex_columns:
    df_casted = df_casted.withColumn(col_name, to_json(col(col_name)))


profile = ProfileReport(df_casted, title=app_name, explorative=True)
profile.to_file(f"/tmp/ydata.html")

pandas-profiling version

v2.2.3

Dependencies

dependencies:
  - bzip2=1.0.8
  - ca-certificates=2025.1.31
  - conda-pack=0.8.1
  - libffi=3.4.2
  - liblzma=5.6.4
  - libsqlite=3.49.1
  - libzlib=1.3.1
  - ncurses=6.5
  - openssl=3.4.1
  - pip=25.0.1
  - pyspark=3.5.3
  - python=3.9.21
  - readline=8.2
  - setuptools=75.8.2
  - tk=8.6.13
  - wheel=0.45.1
  - pip:
      - executing==2.2.0
      - fastjsonschema==2.21.1
      - great-expectations==0.18.22
      - jupyter-events==0.12.0
      - notebook-shim==0.2.4
      - pandocfilters==1.5.1
      - phik==0.12.4
      - pydantic-core==2.27.2
      - python-json-logger==3.2.1
      - ruamel-yaml-clib==0.2.12
      - soupsieve==2.6
      - stack-data==0.6.3
      - tzdata==2025.1
      - ydata-profiling==4.12.2

OS

macos

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
@fabclmnt fabclmnt added spark ⚡ PySpark features! and removed needs-triage labels Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spark ⚡ PySpark features!
Projects
None yet
Development

No branches or pull requests

3 participants