Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] cudf.read_csv and internal ioutils should indicate how **kwargs are used. #10823

Closed
bdice opened this issue May 10, 2022 · 5 comments
Closed
Assignees
Labels
doc Documentation Python Affects Python cuDF API.

Comments

@bdice
Copy link
Contributor

bdice commented May 10, 2022

Report incorrect documentation

Location of incorrect documentation
https://docs.rapids.ai/api/cudf/nightly/api_docs/api/cudf.read_csv.html#cudf.read_csv

Describe the problems or issues found in the documentation
The cudf.read_csv function accepts **kwargs but it does not indicate how they are used. We should avoid accepting arbitrary keyword arguments because it makes it difficult for users to recognize when unsupported keywords are passed. For example, this silently ignores the user's intention (which might be to use usecols and dtype instead of cols and dtypes):

data = cudf.read_csv(data_file, cols=cols, dtypes=dtypes)[col]

These **kwargs are passed on in a couple places but I'm not familiar with their intended use:

Suggested fix for documentation
Investigate read_csv and related utility functions to determine how **kwargs are used, and document them accordingly.

If possible, I would prefer to remove **kwargs from the I/O functions in favor of named keyword(s) that can serve the same purpose.

Tagging @rjzamora @ayushdg who may have insight from editing this code in the past.

@bdice bdice added doc Documentation Needs Triage Need team to review and classify labels May 10, 2022
@github-actions github-actions bot added this to Needs prioritizing in Other Issues May 10, 2022
@bdice bdice added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 10, 2022
@bdice bdice added this to Needs prioritizing in Bug Squashing via automation May 10, 2022
@bdice bdice removed this from Needs prioritizing in Other Issues May 10, 2022
@ayushdg
Copy link
Member

ayushdg commented May 10, 2022

Thanks for the ping @bdice. iirc the origin for kwargs was to allow users to pass in the storage_options kwarg for filesystem specific options (for eg the s3 key or gcs token) while reading data from remote filesystems. At the time this could not be an explicit argument in cudf apis since the goal was to match Pandas API which did not support this at the time.

However since then Pandas has also adopted fsspec, the same library used by cudf and dask for handling different filesystems and as a result does expose storage_options (more details in #6916).
IMO we should be good to remove kwargs and expose storage_options for the relevant io api's that expose the same param as well.

@rjzamora
Copy link
Member

IMO we should be good to remove kwargs and expose storage_options for the relevant io api's that expose the same param as well.

Right - We can remove kwargs if we expose storage_options and open_file_options (used for remote-storage optimizations) in read_csv. In an ideal world, the information in open_file_options could be included in storage_options, but this isn't quite possible yet.

@github-actions
Copy link

github-actions bot commented Jun 9, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Sep 7, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@galipremsagar galipremsagar self-assigned this Sep 27, 2022
@galipremsagar galipremsagar added this to Issue-Needs prioritizing in v22.10 Release via automation Sep 27, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 27, 2022
Fixes: #11683, #10823

This PR:

- [x] Removes `kwargs` in CSV reader & writer such that users get clear errors when they misspell a parameter.
- [x] Re-orders `read_csv` & `to_csv` parameters which will now match to pandas.

The diff is actually adding `storage_options` to `read_csv` & `to_csv` after removing `kwargs`, and the rest of it all re-ordering appropriately.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11762
@galipremsagar
Copy link
Contributor

Resolved by #11762

Bug Squashing automation moved this from Needs prioritizing to Closed Sep 27, 2022
v22.10 Release automation moved this from Issue-Needs prioritizing to Done Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation Python Affects Python cuDF API.
Projects
No open projects
Development

No branches or pull requests

4 participants