Skip to content

[pyspark] Validate the validation indicator column type. #11535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

trivialfis
Copy link
Member

Close #11496 .

@trivialfis trivialfis changed the title [pyspark] Validation the validation indicator column type. [pyspark] Validate the validation indicator column type. Jun 26, 2025
@trivialfis
Copy link
Member Author

cc @wbo4958 @WeichenXu123

@trivialfis trivialfis requested a review from Copilot June 27, 2025 04:41
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces validation to ensure that the validation indicator column is of boolean type. Key changes include:

  • Addition of a new test (test_valid_type) in the Spark test suite to verify that non-boolean indicator values result in an error.
  • Introduction of the helper function is_bool_column in the Spark data module.
  • Update of the make_blob function to raise a TypeError when the validation indicator column is not boolean.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
tests/test_distributed/test_with_spark/test_spark_local.py Added a negative test case to confirm the validation indicator column type check
python-package/xgboost/spark/data.py Added is_bool_column helper and updated make_blob to raise an error if the column type is invalid
Comments suppressed due to low confidence (2)

tests/test_distributed/test_with_spark/test_spark_local.py:1311

  • [nitpick] Consider adding a brief inline comment to explain why a generic Exception is expected in this test case due to the variability of Spark's error types.
    def test_valid_type(self, spark: SparkSession) -> None:

python-package/xgboost/spark/data.py:75

  • [nitpick] It may be helpful to include an inline comment clarifying the dual check using both cuDF and pandas boolean type verifications to improve maintainability.
            if not is_bool_column(col.dtype):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XGBoost Stage failed because barrier task ResultTask
1 participant