-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] columns parameter is being ignored in specific cases in DataFrame constructor #6821
Comments
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 23, 2021
This PR rewrites much of the internals of `cudf.concat`. It includes the following feature improvements and bug fixes: 1. Bug: When passed a single `cudf.Series` object with `ignore_index=True`, `cudf.concat` would upcast the output to a `DataFrame`. 2. Bug: When initializing a `cudf.DataFrame` using a `dict` _and_ providing column names, if none of the column names are actually in the dictionary the resulting `DataFrame` would be placed in an invalid state where the data would contain empty rows of size 0 but the index would have a size corresponding to the provided data columns. `pandas` returns a `DataFrame` with the expected shape (number of rows corresponding to the size of each provided data column, number of columns corresponding to the number specified in the column argument list), except all the data is filled with NaN. `cudf` agrees with `pandas` as long as at least one valid column exists in the data, it just needed to be modified for the case where no valid columns were provided. **EDIT**: Actually, the behavior is a bit more complicated than I originally stated. The number of rows in the resulting DataFrame depends on the index provided, so it should be empty when no index is given but have rows corresponding to `len(index)` when there is an index. 3. Feature: Column sorting was disabled in this function due to #6821. Now that bug has been resolved, sorting can be reenabled. 4. Bug: Various places were using the check `if columns`, which is invalid when the input is an `Index`. 5. Feature/Bug: Previously, concatenations based on the "inner" pattern would always sort the input columns simply by choice of algorithm. `pandas` does not do this, so we should be matching that behavior. In addition, this PR includes the following changes to improve code quality and clarity: 1. Moves the definition of `Frame._concat` to `DataFrame`. The other `Frame` subclasses (`Index` and `Series`) each provide their own implementations, so the "default" implementation is really the `DataFrame` implementation. 2. Use Python `set` operations for identifying column intersections rather than a mix of `numpy` and `pandas` functions. 3. Various other minor changes to simplify parts of the code and reduce overhead from unnecessary operations like dictionary copies or complex iterations. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - https://github.com/brandon-b-miller - Michael Wang (https://github.com/isVoid) URL: #7867
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
The
columns
parameter is not being considered for few specific types ofdata
inputs.Steps/Code to reproduce bug
Expected behavior
We should enable
columns
to be functional for all cases of input types.Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: