Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262

Merged
merged 22 commits into from Nov 1, 2020

Conversation

glemaitre
Copy link
Member

Supersede #17499
Closes #17499

Add the parameter as_frame to fetch_20newsgroups_vectorized.

@glemaitre
Copy link
Member Author

ping @adrinjalali @thomasjpfan I am attending to finish the original PR.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glemaitre

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/datasets/_base.py Outdated Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Outdated Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Outdated Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Outdated Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Show resolved Hide resolved
@@ -433,12 +454,14 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
download_if_missing=download_if_missing)

if os.path.exists(target_file):
X_train, X_test = joblib.load(target_file)
X_train, X_test, feature_names = joblib.load(target_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should handle the case that the file exists from an earlier version, but doesn't have the feature_names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the strategy there? I would almost advocate for asking to remove the cached file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with asking to remove as long as the full path is shown to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also version those files somehow so that it works with multiple versions of sklearn in the same home directory with the default paths. But I guess that's too much work.

As an alternative I'm happy to ask the user and remove while explaining why we need to remove it and that the new one works for sklearn>=0.xx

sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved
Copy link
Member

@alfaro96 alfaro96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @glemaitre!

sklearn/datasets/_twenty_newsgroups.py Outdated Show resolved Hide resolved
doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Outdated Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved
glemaitre and others added 5 commits August 28, 2020 09:35
sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_20news.py Show resolved Hide resolved
@@ -433,12 +454,14 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
download_if_missing=download_if_missing)

if os.path.exists(target_file):
X_train, X_test = joblib.load(target_file)
X_train, X_test, feature_names = joblib.load(target_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with asking to remove as long as the full path is shown to the user.

@glemaitre glemaitre added this to WAITING FOR REVIEW in Guillaume's pet Sep 7, 2020
@@ -433,12 +451,22 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
download_if_missing=download_if_missing)

if os.path.exists(target_file):
X_train, X_test = joblib.load(target_file)
try:
X_train, X_test, feature_names = joblib.load(target_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this locally, forces us to delete /Users/thomasfan/scikit_learn_data/20newsgroup_vectorized_py3.pkl.

Would there be a downside with removing the target_file when a ValueError is raised?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be fine with me as well. Do you think of a case where trying to delete the file would fail (with some OS lock)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be completely thread safe. On the other hand, we do something like this here:

def _retry_with_clean_cache(
openml_path: str, data_home: Optional[str]
) -> Callable:
"""If the first call to the decorated function fails, the local cached
file is removed, and the function is called again. If ``data_home`` is
``None``, then the function is called once.
"""

where we remove files in the cache when we are not able to read and redownload the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it would make sense to do something similar then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am unsure if this is a good idea. We might post-pone this part and come with something consistent across all fetchers.

I am thinking about the following point:

  • it might be better to have a download and retry common to all fetcher
  • in case download_if_missing=False and that we have an invalidated cache, then which type of error do we raise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might post-pone this part and come with something consistent across all fetchers.

I think I am okay with this. I hope there isn't any more testing related issues with this PR. Once this gets merged, developers would need to manually remove the pickled file.

it might be better to have a download and retry common to all fetcher

Yea... it makes sense to do.

in case download_if_missing=False and that we have an invalidated cache, then which type of error do we raise?

Cache is invalid, please set download_if_missing=True to redownload ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so we would need to detect that we removed the file before.

@glemaitre
Copy link
Member Author

@cmarmo since we postponed the part with the downloader, this PR is ready to be reviewed and approved.
I am adding back the label.

@adrinjalali @thomasjpfan Could you make a new pass on it?

@glemaitre glemaitre added this to the 0.24 milestone Oct 25, 2020
Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does Codecov complain about sklearn/datasets/_base.py#L77?

sklearn/datasets/_twenty_newsgroups.py Show resolved Hide resolved
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the codecov complaints seem legit, like sparse checks?

sklearn/datasets/_base.py Outdated Show resolved Hide resolved
sklearn/datasets/_base.py Outdated Show resolved Hide resolved
sklearn/datasets/_twenty_newsgroups.py Show resolved Hide resolved
sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member Author

Why does Codecov complain about sklearn/datasets/_base.py#L77?

@lorentzenchr because we don't run the test related to fetcher in the CI.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be prepared for us to delete the cached file locally once this is merged.

LGTM

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fingers crossed

@adrinjalali adrinjalali merged commit 15cb869 into scikit-learn:master Nov 1, 2020
@glemaitre glemaitre moved this from WAITING FOR REVIEW to WAITING FOR CONSENSUS in Guillaume's pet Jun 30, 2021
@glemaitre glemaitre moved this from WAITING FOR CONSENSUS to TO BE MERGED in Guillaume's pet Jun 30, 2021
@glemaitre glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants