New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262
EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262
Conversation
ping @adrinjalali @thomasjpfan I am attending to finish the original PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @glemaitre
@@ -433,12 +454,14 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None, | |||
download_if_missing=download_if_missing) | |||
|
|||
if os.path.exists(target_file): | |||
X_train, X_test = joblib.load(target_file) | |||
X_train, X_test, feature_names = joblib.load(target_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably should handle the case that the file exists from an earlier version, but doesn't have the feature_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the strategy there? I would almost advocate for asking to remove the cached file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with asking to remove as long as the full path is shown to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also version those files somehow so that it works with multiple versions of sklearn in the same home directory with the default paths. But I guess that's too much work.
As an alternative I'm happy to ask the user and remove while explaining why we need to remove it and that the new one works for sklearn>=0.xx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @glemaitre!
Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>
@@ -433,12 +454,14 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None, | |||
download_if_missing=download_if_missing) | |||
|
|||
if os.path.exists(target_file): | |||
X_train, X_test = joblib.load(target_file) | |||
X_train, X_test, feature_names = joblib.load(target_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with asking to remove as long as the full path is shown to the user.
@@ -433,12 +451,22 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None, | |||
download_if_missing=download_if_missing) | |||
|
|||
if os.path.exists(target_file): | |||
X_train, X_test = joblib.load(target_file) | |||
try: | |||
X_train, X_test, feature_names = joblib.load(target_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running this locally, forces us to delete /Users/thomasfan/scikit_learn_data/20newsgroup_vectorized_py3.pkl
.
Would there be a downside with removing the target_file
when a ValueError
is raised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be fine with me as well. Do you think of a case where trying to delete the file would fail (with some OS lock)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be completely thread safe. On the other hand, we do something like this here:
scikit-learn/sklearn/datasets/_openml.py
Lines 48 to 54 in 0864c58
def _retry_with_clean_cache( | |
openml_path: str, data_home: Optional[str] | |
) -> Callable: | |
"""If the first call to the decorated function fails, the local cached | |
file is removed, and the function is called again. If ``data_home`` is | |
``None``, then the function is called once. | |
""" |
where we remove files in the cache when we are not able to read and redownload the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it would make sense to do something similar then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I am unsure if this is a good idea. We might post-pone this part and come with something consistent across all fetchers.
I am thinking about the following point:
- it might be better to have a download and retry common to all fetcher
- in case
download_if_missing=False
and that we have an invalidated cache, then which type of error do we raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might post-pone this part and come with something consistent across all fetchers.
I think I am okay with this. I hope there isn't any more testing related issues with this PR. Once this gets merged, developers would need to manually remove the pickled file.
it might be better to have a download and retry common to all fetcher
Yea... it makes sense to do.
in case download_if_missing=False and that we have an invalidated cache, then which type of error do we raise?
Cache is invalid, please set download_if_missing=True
to redownload ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so we would need to detect that we removed the file before.
@cmarmo since we postponed the part with the downloader, this PR is ready to be reviewed and approved. @adrinjalali @thomasjpfan Could you make a new pass on it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does Codecov complain about sklearn/datasets/_base.py#L77?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of the codecov complaints seem legit, like sparse checks?
@lorentzenchr because we don't run the test related to fetcher in the CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be prepared for us to delete the cached file locally once this is merged.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fingers crossed
Supersede #17499
Closes #17499
Add the parameter
as_frame
tofetch_20newsgroups_vectorized
.