EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262

glemaitre · 2020-08-26T13:23:47Z

Supersede #17499
Closes #17499

Add the parameter as_frame to fetch_20newsgroups_vectorized.

glemaitre · 2020-08-26T13:24:55Z

ping @adrinjalali @thomasjpfan I am attending to finish the original PR.

adrinjalali

Thanks @glemaitre

doc/whats_new/v0.24.rst

sklearn/datasets/_base.py

sklearn/datasets/_twenty_newsgroups.py

adrinjalali · 2020-08-26T18:37:50Z

sklearn/datasets/_twenty_newsgroups.py

@@ -433,12 +454,14 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
                                   download_if_missing=download_if_missing)

    if os.path.exists(target_file):
-        X_train, X_test = joblib.load(target_file)
+        X_train, X_test, feature_names = joblib.load(target_file)


we probably should handle the case that the file exists from an earlier version, but doesn't have the feature_names

What would be the strategy there? I would almost advocate for asking to remove the cached file.

I am okay with asking to remove as long as the full path is shown to the user.

we could also version those files somehow so that it works with multiple versions of sklearn in the same home directory with the default paths. But I guess that's too much work.

As an alternative I'm happy to ask the user and remove while explaining why we need to remove it and that the new one works for sklearn>=0.xx

sklearn/datasets/tests/test_20news.py

alfaro96

Thank you @glemaitre!

sklearn/datasets/_twenty_newsgroups.py

doc/whats_new/v0.24.rst

sklearn/datasets/_twenty_newsgroups.py

sklearn/datasets/tests/test_20news.py

Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>

…ntynews_as_frame

sklearn/datasets/tests/test_20news.py

thomasjpfan · 2020-08-29T01:05:26Z

sklearn/datasets/_twenty_newsgroups.py

@@ -433,12 +454,14 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
                                   download_if_missing=download_if_missing)

    if os.path.exists(target_file):
-        X_train, X_test = joblib.load(target_file)
+        X_train, X_test, feature_names = joblib.load(target_file)


I am okay with asking to remove as long as the full path is shown to the user.

thomasjpfan · 2020-10-05T17:06:42Z

sklearn/datasets/_twenty_newsgroups.py

@@ -433,12 +451,22 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
                                   download_if_missing=download_if_missing)

    if os.path.exists(target_file):
-        X_train, X_test = joblib.load(target_file)
+        try:
+            X_train, X_test, feature_names = joblib.load(target_file)


Running this locally, forces us to delete /Users/thomasfan/scikit_learn_data/20newsgroup_vectorized_py3.pkl.

Would there be a downside with removing the target_file when a ValueError is raised?

It would be fine with me as well. Do you think of a case where trying to delete the file would fail (with some OS lock)?

It may not be completely thread safe. On the other hand, we do something like this here:

scikit-learn/sklearn/datasets/_openml.py

Lines 48 to 54 in 0864c58

def _retry_with_clean_cache(

openml_path: str, data_home: Optional[str]

) -> Callable:

"""If the first call to the decorated function fails, the local cached

file is removed, and the function is called again. If ``data_home`` is

``None``, then the function is called once.

"""

where we remove files in the cache when we are not able to read and redownload the file.

Good point, it would make sense to do something similar then.

Actually I am unsure if this is a good idea. We might post-pone this part and come with something consistent across all fetchers.

I am thinking about the following point:

it might be better to have a download and retry common to all fetcher

in case download_if_missing=False and that we have an invalidated cache, then which type of error do we raise?

We might post-pone this part and come with something consistent across all fetchers.

I think I am okay with this. I hope there isn't any more testing related issues with this PR. Once this gets merged, developers would need to manually remove the pickled file.

it might be better to have a download and retry common to all fetcher

Yea... it makes sense to do.

in case download_if_missing=False and that we have an invalidated cache, then which type of error do we raise?

Cache is invalid, please set download_if_missing=True to redownload ?

OK so we would need to detect that we removed the file before.

glemaitre · 2020-10-21T10:07:52Z

@cmarmo since we postponed the part with the downloader, this PR is ready to be reviewed and approved.
I am adding back the label.

@adrinjalali @thomasjpfan Could you make a new pass on it?

agramfort

LGTM

lorentzenchr

Why does Codecov complain about sklearn/datasets/_base.py#L77?

sklearn/datasets/_twenty_newsgroups.py

adrinjalali

some of the codecov complaints seem legit, like sparse checks?

sklearn/datasets/_base.py

sklearn/datasets/_twenty_newsgroups.py

sklearn/datasets/tests/test_20news.py

glemaitre · 2020-10-28T13:50:57Z

Why does Codecov complain about sklearn/datasets/_base.py#L77?

@lorentzenchr because we don't run the test related to fetcher in the CI.

thomasjpfan

Be prepared for us to delete the cached file locally once this is merged.

LGTM

adrinjalali

Fingers crossed

bsipocz and others added 9 commits June 12, 2020 01:35

Adding as_frame kwarg to fetch_20newsgroups_vectorized

f1f1ed1

Using SparseArray to store the data

5a6ae1d

Adding changelog

915b400

Adding pandas version dependency, and more tests

28f3dea

DOC: rephrasing changelog [skip ci]

34b47bc

iter

79419a9

fix

42e48f1

Merge remote-tracking branch 'origin/master' into twentynews_as_frame

d45b568

update whats new

4ba7289

github-actions bot added the module:datasets label Aug 26, 2020

adrinjalali reviewed Aug 26, 2020

View reviewed changes

apply some suggestions

e9aaad2

alfaro96 reviewed Aug 28, 2020

View reviewed changes

glemaitre and others added 5 commits August 28, 2020 09:35

Apply suggestions from code review

97b2ed9

Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>

review

6663903

Merge remote-tracking branch 'glemaitre/twentynews_as_frame' into twe…

417c2f7

…ntynews_as_frame

iter

4e45ef7

iter

618cce4

thomasjpfan reviewed Aug 29, 2020

View reviewed changes

glemaitre added 4 commits August 31, 2020 16:37

improve test

4da0bd1

TST handle the case that the pickle file is outdated

0acfa65

iter

723fc58

skip if no network

984c78c

glemaitre added this to WAITING FOR REVIEW in Guillaume's pet Sep 7, 2020

glemaitre added 2 commits October 5, 2020 14:22

Merge remote-tracking branch 'origin/master' into twentynews_as_frame

c7c4b50

use fixture available

6bdc93c

glemaitre added the Waiting for Reviewer label Oct 5, 2020

thomasjpfan reviewed Oct 5, 2020

View reviewed changes

cmarmo removed the Waiting for Reviewer label Oct 15, 2020

cmarmo mentioned this pull request Oct 20, 2020

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #17499

Closed

glemaitre added the Waiting for Reviewer label Oct 21, 2020

glemaitre added this to the 0.24 milestone Oct 25, 2020

agramfort approved these changes Oct 26, 2020

View reviewed changes

lorentzenchr reviewed Oct 26, 2020

View reviewed changes

sklearn/datasets/_twenty_newsgroups.py Show resolved Hide resolved

adrinjalali reviewed Oct 27, 2020

View reviewed changes

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_twenty_newsgroups.py Show resolved Hide resolved

sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved

address adrin comments

e7d023a

thomasjpfan approved these changes Oct 29, 2020

View reviewed changes

adrinjalali approved these changes Nov 1, 2020

View reviewed changes

adrinjalali merged commit 15cb869 into scikit-learn:master Nov 1, 2020

glemaitre moved this from WAITING FOR REVIEW to WAITING FOR CONSENSUS in Guillaume's pet Jun 30, 2021

glemaitre moved this from WAITING FOR CONSENSUS to TO BE MERGED in Guillaume's pet Jun 30, 2021

glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Jun 30, 2021

marenwestermann mentioned this pull request Feb 26, 2023

Should the meaning of default=None be specified? #17295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262

glemaitre commented Aug 26, 2020

glemaitre commented Aug 26, 2020

adrinjalali left a comment

adrinjalali Aug 26, 2020

glemaitre Aug 27, 2020

thomasjpfan Aug 29, 2020

adrinjalali Aug 31, 2020

alfaro96 left a comment

thomasjpfan Aug 29, 2020

thomasjpfan Oct 5, 2020

glemaitre Oct 5, 2020

thomasjpfan Oct 5, 2020 •

edited

glemaitre Oct 5, 2020

glemaitre Oct 5, 2020

thomasjpfan Oct 5, 2020

glemaitre Oct 6, 2020

glemaitre commented Oct 21, 2020

agramfort left a comment

lorentzenchr left a comment

adrinjalali left a comment

glemaitre commented Oct 28, 2020

thomasjpfan left a comment

adrinjalali left a comment

	def _retry_with_clean_cache(
	openml_path: str, data_home: Optional[str]
	) -> Callable:
	"""If the first call to the decorated function fails, the local cached
	file is removed, and the function is called again. If ``data_home`` is
	``None``, then the function is called once.
	"""

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262

Conversation

glemaitre commented Aug 26, 2020

glemaitre commented Aug 26, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alfaro96 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Oct 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Oct 21, 2020

agramfort left a comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

glemaitre commented Oct 28, 2020

thomasjpfan left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

thomasjpfan Oct 5, 2020 •

edited