Automate include grouping order in .clang-format #14993

harrism · 2024-02-07T08:35:04Z

Description

This uses the IncludeCategories settings in .clang-format to attempt to enforce our documented #include order in libcudf. See https://docs.rapids.ai/api/libcudf/stable/developer_guide

I realize that there was a previous attempt at this by @bdice that met with some resistance. Reading it, I wouldn't say it was vetoed; rather, reviewers requested something much simpler. I have a few reasons to attempt this again.

To make a separate task much easier. We are undertaking a refactoring of RMM that will replace rmm::mr::device_memory_resource* with rmm::device_async_resource-ref everywhere in RAPIDS (not just cuDF). This requires adding an include to MANY files. Getting the location of the include right everywhere is very difficult without automatic grouping of headers. I started out writing a bash script to do this before realizing clang-format has the necessary feature. And I realized that my script would never properly handle files like this.
To increase velocity. Everywhere in RAPIDS that we have automated code standard/style/formatting/other, the benefits to velocity have outweighed the costs. To paraphrase @bdice, $auto \nearrow \rightarrow \mu \searrow \rightarrow v \nearrow$
The previous PR Automatic include sorting with clang-format #12760 had nearly 50 categories of headers. There was no way this could be applied universally across RAPIDS repos. My proposal has 10 categories. I tried to reduce it further but realized that it wouldn't be much less configuration to maintain, so I stopped at 10.

Note that one of the ways that having few categories can work while still maintaining clear groups is that this PR updates many files to use quotes ("") instead of angle brackets (<>) for local cuDF headers that do not live in cudf/cpp/include. With our "near to far" include ordering policy, these are arguably the nearest files, and using quotes allows us to have our first category simply check for quotes. These files will be grouped and sorted without blank lines, but in practice this does not lose clarity because typically headers from more than two directories are not included from the same file. The downside of this change is I don't yet know how to automatically enforce it. I hope that when developers accidentally use <> for internal includes that don't start with (e.g.) "cudf", they will be grouped one of the lowest priority categories, and perhaps this will induce them to switch to "" to get the headers listed at the top. The rule is simple: if it's in libcudf but not in cpp/include/cudf, then use quotes. For everything else, use angle brackets.

Other than headers from RAPIDS repos, I group all other headers that have a file extension in a single group, and all files that have no file extension in another group. Since the latter also matches includes some files from libcu++, I have an explicit category for <cuda/ includes to keep them separate from STL includes. A frequent effect of the single "." group is that cub and thrust headers get grouped without a blank line between them. I don't think this is a problem.

Below I'm listing the (fairly simple, in my opinion) .clang-format settings for this PR. Note that categories 2-5 will require tweaking for different RAPIDS repos.

Some may ask why I ordered cudf_test headers before cudf headers. I tried both orders, and putting cudf_test first generated significantly fewer changes in the PR, meaning that it's already the more common ordering (I suppose cudf_test is closer to the files that include it, since they are libcudf tests).

I've opened a similar PR for RMM with only 5 groups. rapidsai/rmm#1463

CC @davidwendt @vyasr @wence- @GregoryKimball for feedback

@isVoid contributed to this PR via pair programming.

IncludeBlocks: Regroup
IncludeCategories:
  - Regex:           '^"' # quoted includes
    Priority:        1
  - Regex:           '^<(benchmarks|tests)/' # benchmark includes
    Priority:        2
  - Regex:           '^<cudf_test/' # cuDF includes
    Priority:        3
  - Regex:           '^<(nvtext|cudf_kafka)' # other libcudf includes
    Priority:        4
  - Regex:           '^<cudf/' # cuDF includes
    Priority:        5
  - Regex:           '^<(cugraph|cuml|cuspatial|raft|kvikio)' # Other RAPIDS includes
    Priority:        6
  - Regex:           '^<rmm/' # RMM includes
    Priority:        7
  - Regex:           '^<(thrust|cub|cuda)/' # CCCL includes
    Priority:        8
  - Regex:           '^<(cuda.h|cuda_runtime|device_types|math_constants|cooperative_groups)' # CUDART includes
    Priority:        8
  - Regex:           '^<.*\..*' # other system includes (e.g. with a '.')
    Priority:        9
  - Regex:           '^<[^.]+' # STL includes (no '.')
    Priority:        10

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

… using new grouping rules.

cpp/benchmarks/join/generate_input_tables.cuh

harrism · 2024-02-07T10:24:54Z

Note that these changes did turn up two places where needed headers were not included.

…rapidsai#14868) Changes some internal offset arrays used for managing temporary unicode code-points to int64 type. This effects the nvtext normalize and subword-tokenizer functions. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#14868

This PR ensures that all calls to `ctest` include the flag `--no-tests=error`. See rapidsai/build-planning#18. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Jake Awe (https://github.com/AyodeAwe) URL: rapidsai#14983

…ai#14989) This PR filters all `DeprecationWarning`'s that are being originated by `ArrowTable.to_pandas` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#14989

bdice

Flushing comments for now. I'll look further at this later.

cpp/benchmarks/io/orc/orc_writer_chunks.cpp

cpp/benchmarks/synchronization/synchronization.hpp

.clang-format

…idsai#14995) This PR fixes `DataFrame.sort_index` to properly ignore indexes for all values of `axis`. This is fixed in pandas-2.2, hence xfailing the tests with a version check. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#14995

Deprecated in pandas 2.2 pandas-dev/pandas#55719 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#15000

davidwendt · 2024-02-08T13:31:31Z

Seems something here makes the jitify code-paths fail. I know jitify has issues with certain include files.
I would recommend addressing the build/test errors sooner than later.

…apidsai#14999) This only really affected `to_hdf` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#14999

Adds offsetalator in place of hardcoded offset type arrays to the strings split functions: - `cudf::strings::split()` - `cudf::strings::rsplit()` - `cudf::strings::split_record()` - `cudf::strings::rsplit_record()` - `cudf::strings::split_re()` - `cudf::strings::rsplit_re()` - `cudf::strings::split_record_re()` - `cudf::strings::rsplit_record_re()` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#14757

For comparison: pandas-dev/pandas#55856 pandas-dev/pandas#55895 pandas-dev/pandas#55499 The `errors="ignore"` parameter is the only one that is implemented so just added a test for that deprecation Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#14984

…#14998) A small bug in our previous implementation leads to a segfault when calling `.get_groups()` with no `values`. Thankfully, the cuDF Python API always calls this function with a value, but it's possible `pylibcudf` consumers will not. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#14998

wence- · 2024-02-12T10:55:21Z

I am very much in favour of this. I like the groupings. One minor question around the colour of the bikeshed:

Do we want to (probably not in this change due to the churn it would cause) have separate groups for the public (<cudf/public_header.hpp>) and detail (<cudf/detail/private_header.hpp>) include separation?

Updates `cudf::strings::wrap()` to use the offsetalator instead of hardcoded int32 type for offsets column data. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#15002

…i#14956) Resolves [10219](NVIDIA/spark-rapids#10219) This PR introduces a new class named `GetJsonObjectOptions` that holds the configurations to control the behavior of the underlying `cudf::get_json_object` function. It incorporates this new class into the `getJSONObject` JAVA API as an additional argument but also keeps the previous API to maintain backwards compatibility. It also includes a test case, `testGetJSONObjectWithSingleQuotes`, validating the behavior of `getJSONObject` when single quotes are enabled. Authors: - Suraj Aralihalli (https://github.com/SurajAralihalli) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - MithunR (https://github.com/mythrocks) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#14956

Forward-merge branch-24.02 to branch-24.04

I think unpinning `numba` in the conda recipe was just missed in rapidsai#14616. I discovered this issue [trying to build the `24.02` release](https://github.com/rapidsai/cudf/actions/runs/7878153691/job/21496377912#step:7:1674). PRs & nightly builds are working because the `rapidsai-nightly` channel has an older version of `pynvjitlink` that supported `numba>=0.57` whereas the `rapidsai` channel only has the latest version which pins to `numba>=0.58`. Authors: - Ray Douglass (https://github.com/raydouglass) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice)

…apidsai#15018) Found while working on large strings where copy-if is called. In places where `copy_if_safe` utility is called the non-stencil overload calls the stencil-ed function by forwarding the `first` iterator as the `stencil` parameter. This works logically because both values will return the same result. Unfortunately, this can be a performance issue if the iterator is complex/slow transform iterator since it would be called twice (an inlined twice). Changing the non-stencil version to call `thrust::copy_if` directly fixes the potential issue. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#15018

Forward-merge branch-24.02 to branch-24.04

Toward pandas 2.2 compat: Deprecated in pandas in pandas-dev/pandas#56557 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: rapidsai#14986

Updates `cudf::get_json_object()` to use the offsetalator to build the output strings column. It adds a sizes vector to hold the output row lengths which is then converted to offsets using the new `make_offsets_child_column()` utitlity. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Vukasin Milovanovic (https://github.com/vuule) URL: rapidsai#15009

Updates `cudf::interleave_columns()` to use the new `make_offsets_child_column` utility and the offsetalator to build the output strings column. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: rapidsai#15004

Closes rapidsai#14991 Authors: - Sanjana Gajendran (https://github.com/sanjana098) Approvers: - Bradley Dice (https://github.com/bdice) - Mark Harris (https://github.com/harrism) URL: rapidsai#15025

…#15052) Developers expect that 'cleaning' a build directory will remove all forms of cached files ( objects, libraries, jit cache, etc ). To ensure that happens consistenly we also need to remove the jitify cache objects for cudf. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Bradley Dice (https://github.com/bdice) - Mark Harris (https://github.com/harrism) URL: rapidsai#15052

Reworks the `cudf::strings::detail::copy_if_else()` to improve performance for long strings. The rework builds a vector of rows to pass to the `make_strings_column` factory that uses the optimized `gather_chars` function. Also includes a benchmark for copy_if_else specifically for strings columns. Closes rapidsai#15014 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: rapidsai#15017

Closes rapidsai#14495 Adds support for reading and writing ORC and Parquet files with LZ4 compression. Also adds the new value to the Python API. Included basic C++ and Python tests so that the option is exercised in CI. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Shruti Shivakumar (https://github.com/shrshi) - MithunR (https://github.com/mythrocks) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#14906

…m/cudf into fea/clang-format-include-grouping

copy-pr-bot · 2024-02-15T00:12:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

harrism · 2024-02-15T00:14:06Z

The commit history on this PR has gotten out of hand somehow. I think it's best to start a new PR with the latest agreed-upon .clang-format changes and then rerun the pre-commit hooks and apply any other applicable reviewer suggestions from this PR. This way I will minimize changes and have a better change of figuring out any CI failures.

wence- · 2024-02-15T09:33:29Z

Closing in favour of #15063

harrism added 2 commits February 7, 2024 05:02

Enforce include grouping order in .clang-format

66a1be1

Slightly expand header groupings

f425436

harrism added feature request New feature or request non-breaking Non-breaking change labels Feb 7, 2024

harrism self-assigned this Feb 7, 2024

Change internal includes to quotes, and run clang-format on all files…

4906fba

… using new grouping rules.

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Feb 7, 2024

harrism mentioned this pull request Feb 7, 2024

Automate include grouping using clang-format rapidsai/rmm#1463

Merged

3 tasks

harrism changed the title ~~Enforce include grouping order in .clang-format~~ Automate include grouping order in .clang-format Feb 7, 2024

harrism added 6 commits February 7, 2024 09:24

Somehow pre-commit missed two files...

1f7665a

copyright

2100247

Fix include quote mismatch

cb20306

Add missing include of json.hpp and missing namespace qualification.

f1702b9

add missing include of cassert

caf7214

copyright

e23d1f0

isVoid reviewed Feb 7, 2024

View reviewed changes

cpp/benchmarks/join/generate_input_tables.cuh Outdated Show resolved Hide resolved

davidwendt and others added 4 commits February 7, 2024 15:08

Additional single group for CCCL and CUDA includes

0e39e4d

bdice reviewed Feb 8, 2024

View reviewed changes

cpp/benchmarks/io/orc/orc_writer_chunks.cpp Outdated Show resolved Hide resolved

cpp/benchmarks/synchronization/synchronization.hpp Outdated Show resolved Hide resolved

.clang-format Outdated Show resolved Hide resolved

.clang-format Outdated Show resolved Hide resolved

galipremsagar and others added 2 commits February 8, 2024 08:02

Deprecate groupby fillna (rapidsai#15000)

7f28f2f

Deprecated in pandas 2.2 pandas-dev/pandas#55719 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#15000

mroeschke and others added 4 commits February 8, 2024 13:41

Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (r…

03f63ec

…apidsai#14999) This only really affected `to_hdf` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#14999

harrism added 2 commits February 12, 2024 09:43

Remove stale comment.

8508b6a

Remove unnecessary cuda_runtime include

cc0e4a1

davidwendt and others added 5 commits February 12, 2024 15:46

Update Changelog [skip ci]

82f6a53

Merge pull request rapidsai#15029 from rapidsai/branch-24.02

d848707

Forward-merge branch-24.02 to branch-24.04

Revert copyright-only change

06d5009

harrism mentioned this pull request Feb 12, 2024

Remove obsolete restrictions of __CUDA_ARCH__ >= 700. #15030

Closed

raydouglass and others added 12 commits February 13, 2024 09:22

Merge branch-24.02 into branch-24.04

2e17a40

Merge pull request rapidsai#15039 from bdice/branch-24.04-merge-24.02

bb6ae07

Forward-merge branch-24.02 to branch-24.04

Deprecate delim_whitespace in read_csv for pandas 2.2 (rapidsai#14986)

ac4debd

Toward pandas 2.2 compat: Deprecated in pandas in pandas-dev/pandas#56557 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: rapidsai#14986

Fix broken link for developer guide (rapidsai#15025)

ee1c768

Closes rapidsai#14991 Authors: - Sanjana Gajendran (https://github.com/sanjana098) Approvers: - Bradley Dice (https://github.com/bdice) - Mark Harris (https://github.com/harrism) URL: rapidsai#15025

Merge branch 'fea/clang-format-include-grouping' of github.com:harris…

7bf3b5b

…m/cudf into fea/clang-format-include-grouping

github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. labels Feb 15, 2024

harrism mentioned this pull request Feb 15, 2024

Automate include grouping order in .clang-format #15063

Merged

3 tasks

wence- closed this Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate include grouping order in .clang-format #14993

Automate include grouping order in .clang-format #14993

harrism commented Feb 7, 2024 •

edited

Loading

harrism commented Feb 7, 2024

bdice left a comment

davidwendt commented Feb 8, 2024 •

edited

Loading

wence- commented Feb 12, 2024

copy-pr-bot bot commented Feb 15, 2024

harrism commented Feb 15, 2024

wence- commented Feb 15, 2024

Automate include grouping order in .clang-format #14993

Automate include grouping order in .clang-format #14993

Conversation

harrism commented Feb 7, 2024 • edited Loading

Description

Checklist

harrism commented Feb 7, 2024

bdice left a comment

Choose a reason for hiding this comment

davidwendt commented Feb 8, 2024 • edited Loading

wence- commented Feb 12, 2024

copy-pr-bot bot commented Feb 15, 2024

harrism commented Feb 15, 2024

wence- commented Feb 15, 2024

harrism commented Feb 7, 2024 •

edited

Loading

davidwendt commented Feb 8, 2024 •

edited

Loading