Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race detected in Parquet writer #14598

Merged
merged 4 commits into from
Dec 9, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 8, 2023

Description

While investigating #14597 it was found that there was a race introduced by #14000. Adding a sync between invocations of a block reduce resolves the error.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner December 8, 2023 17:59
Copy link

copy-pr-bot bot commented Dec 8, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 8, 2023
@davidwendt davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Dec 8, 2023
@davidwendt
Copy link
Contributor

/ok to test

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -206,7 +206,8 @@ void __device__ calculate_frag_size(frag_init_state_s* const s, int t)
}
}
__syncthreads();
auto const total_len = block_reduce(reduce_storage).Sum(len);
auto const total_len = block_reduce(reduce_storage).Sum(len);
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I hesitated to bring this up since the CI didn't complain. Generally, a sync in-between is required when the same temp storage is used for different reductions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried too, but dug into the Sum code and saw that there were syncthreads inside and convinced myself it was ok :(

Comment on lines 208 to 210
__syncthreads();
auto const total_len = block_reduce(reduce_storage).Sum(len);
auto const total_len = block_reduce(reduce_storage).Sum(len);
__syncthreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we really need two sync like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the first one is unecessary, but now I'm scared to change it 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, maybe we should remove it if racecheck does not complain

Copy link
Contributor Author

@etseidl etseidl Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try removing it and see...

Yep, no racecheck problems (beyond the nvcomp and decode ones) without the first sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is racecheck?

Copy link
Contributor

@vuule vuule Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try removing it and see

trying this on my end as well

What is racecheck?

racecheck tool in compute-sanitizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no racecheck errors, and, looking at the code, no shared data is modified before the reductions. IMHO we should be good to remove the first sync.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this issue promptly!
Hopefully that's the last of the ManyFragments failures.

@PointKernel
Copy link
Member

/ok to test

@vuule
Copy link
Contributor

vuule commented Dec 9, 2023

/merge

@rapids-bot rapids-bot bot merged commit 899e392 into rapidsai:branch-24.02 Dec 9, 2023
67 checks passed
@etseidl etseidl deleted the block_reduce_sync branch December 9, 2023 01:02
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
While investigating rapidsai#14597 it was found that there was a race introduced by rapidsai#14000. Adding a sync between invocations of a block reduce resolves the error.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#14598
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants