Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX introduce a refresh_cache param to fetch_... functions. #14197

Merged
merged 21 commits into from
Jul 12, 2019

Conversation

adrinjalali
Copy link
Member

This kinda fixes #14177.

This does not change the warning message. But it introduces a refresh_cache parameter to the fetch_... functions that download and persist the data using joblib.

The proposal is to add this parameter, with some variation after reviews:

+    refresh_cache : str or bool, optional (default='joblib')
+        - ``True``: remove the previously downloaded data, and fetche it again.
+        - ``'joblib'``: only re-fetch the data if the previously downloaded
+          data has been persisted using the previously vendored `joblib`.
+        - ``False``: do not re-fetch the data.
+        
+        From version 0.23, ``'joblib'`` as an input value will be ignored and
+        assumed ``False``.
+
+        .. versionadded:: 0.21.3

I've changed only one of the dataset files, will fix the others once we agree on a solution.

@glemaitre @jnothman

@adrinjalali adrinjalali added this to the 0.21.3 milestone Jun 26, 2019
@adrinjalali
Copy link
Member Author

also, with this change, I think we can remove the if not hasattr(sys, "_is_pytest_session"): from joblib/__init__.py.

@qinhanmin2014
Copy link
Member

Is it possible to only introduce a function (i.e., refresh_cache)? If we introduce a parameter, we'll need a deprecation cycle to remove it.
I'm not able to run experiments but seems that some other functions also suffer from this issue (w.g., fetch_20newsgroups_vectorized)

@adrinjalali
Copy link
Member Author

I'm not able to run experiments but seems that some other functions also suffer from this issue (w.g., fetch_20newsgroups_vectorized)

Yes, as I said: I've changed only one of the dataset files, will fix the others once we agree on a solution.

Is it possible to only introduce a function (i.e., refresh_cache)? If we introduce a parameter, we'll need a deprecation cycle to remove it.

Yes, as Joel suggested:

I would be okay to detect this case and re-cache over a deprecation period... (but we have to also allow for the cache not being writable, I think.)

This also allows the user to set the parameter to False, for the case when the data is read-only.

@qinhanmin2014
Copy link
Member

Yes, as Joel suggested:

What's Joel's suggestion? I saw something like "I would be okay to detect this case and re-cache over a deprecation period". Does that mean that we need to download/construct the dataset everytime we use the dataset?

@jnothman
Copy link
Member

I had mostly just intended this to happen automatically without a parameter to control it, and with that functionality disappearing after a couple of versions.

@adrinjalali
Copy link
Member Author

I had mostly just intended this to happen automatically without a parameter to control it, and with that functionality disappearing after a couple of versions.

the issue is when users don't have reliable, or easy connection from those systems, and the functionality would delete the files, while not being able to easily download them. Although with the default value here the same thing happens. I can remove the parameter, and have the deprecation cycle completely done silent, if that's what you prefer. I find the parameter useful anyway, even w/o joblib option.

@jnothman
Copy link
Member

jnothman commented Jun 27, 2019 via email

@qinhanmin2014
Copy link
Member

thinking more about it, seems that it's difficult to solve the issue if we only introduce a function to refresh the cache, because it's difficult to determine which files to remove. So I'm OK with this solution (i.e., add a parameter). If we don't like it later, we can deprecate it.
Why do we need the joblib option? True and False seems enough?

@qinhanmin2014
Copy link
Member

If I understand correctly, users can still use the datasets they pickled previously (e.g., before sklearn 0.21), right? So it seems not good to download / construct the datasets again by default?

@jnothman
Copy link
Member

jnothman commented Jun 27, 2019 via email

@adrinjalali
Copy link
Member Author

I don't see why we should download again here at all... Shouldn't we just unpickle (with sklearn.externals.joblib) and re-pickle (with joblib)?

Didn't think of that, will do.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's more what I was thinking ;)

@qinhanmin2014
Copy link
Member

I'll vote +1 for this solution.

@@ -919,3 +920,25 @@ def _fetch_remote(remote, dirname=None):
"file may be corrupted.".format(file_path, checksum,
remote.checksum))
return file_path


def _refresh_cache(path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you put this function in base.py but actually it's designed only for fetch_covtype?

@adrinjalali
Copy link
Member Author

It still suppresses the warning if the folder/file is read-only. Trying to fix that.

@qinhanmin2014
Copy link
Member

you're replying my comment? I mean things like

samples_path = _pkl_filepath(path, "samples")
targets_path = _pkl_filepath(path, "targets")

won't work for other functions (e.g., fetch_20newsgroups_vectorized)

@adrinjalali
Copy link
Member Author

I'll see what I need to do once I check the other instances which need to be fixed. For now working still on this one. It was not a response to your comment, it was a response to @jnothman 's comment regarding handling the case where I/O fails.

@adrinjalali
Copy link
Member Author

This now fixes the issue for all the instances of the issue, except for sklearn/datasets/lfw.py, which is kinda more complicated.

The codecov fails. I've tested all of the examples with having the old persisted data converted to the new one, and also for the case when the files are read-only. But I don't think we can easily test them (unless we do some monkey patching, not sure if it's worth it). How does this look @jnothman ?

@adrinjalali
Copy link
Member Author

codecov doesn't seem to be working properly on this patch anyway. I suppose it's kinda ready now.

sklearn/datasets/base.py Outdated Show resolved Hide resolved

other_warns = [w for w in warns if not str(w.message).startswith(msg)]
for w in other_warns:
warnings.warn(message=w.message, category=w.category)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally you should also be using the original module and line number. Is there an alternative function for best issuing the warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's the warn_explicit, but it didn't actually end up showing the warning and I couldn't fix it in 5 minutes, so I switched back to warn.

@jnothman
Copy link
Member

Add a test refreshing a pretend pickle?

@adrinjalali
Copy link
Member Author

Add a test refreshing a pretend pickle?

Not really testing the pickle itself, but testing different routes through the warnings now.

return 0

def _load_warn_unrelated(*args, **kwargs):
warnings.warn("unrelated warning", UserWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a tighter test, this could be a DeprecationWarning

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an entry in the what's new. We will need it to recall to make the changes (it was quite useful when I wanted to remove deprecated code). I don't know if we should have a specific section like maintenance to group things like deprecation or these type of changes.



def _refresh_cache(files, compress):
# REMOVE in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO marker can be easier to catch

Suggested change
# REMOVE in v0.23
# TODO: REMOVE in v0.23

warnings.warn(message=message, category=DeprecationWarning)

if len(data) == 1:
return data[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one liner:

return data[0] if len(data) == 1 else data

@@ -129,7 +130,9 @@ def fetch_california_housing(data_home=None, download_if_missing=True,
remove(archive_path)

else:
cal_housing = joblib.load(filepath)
cal_housing = _refresh_cache([filepath], 6)
# Revert to the following two lines in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Revert to the following two lines in v0.23
# TODO: Revert to the following two lines in v0.23

X = joblib.load(samples_path)
y = joblib.load(targets_path)
X, y = _refresh_cache([samples_path, targets_path], 9)
# Revert to the following two lines in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Revert to the following two lines in v0.23
# TODO: Revert to the following two lines in v0.23

X = joblib.load(samples_path)
y = joblib.load(targets_path)
X, y = _refresh_cache([samples_path, targets_path], 0)
# Revert to the following two lines in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Revert to the following two lines in v0.23
# TODO: Revert to the following two lines in v0.23

@@ -107,7 +108,9 @@ def fetch_olivetti_faces(data_home=None, shuffle=False, random_state=0,
joblib.dump(faces, filepath, compress=6)
del mfile
else:
faces = joblib.load(filepath)
faces = _refresh_cache([filepath], 6)
# Revert to the following two lines in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Revert to the following two lines in v0.23
# TODO: Revert to the following two lines in v0.23

X = joblib.load(samples_path)
sample_id = joblib.load(sample_id_path)
X, sample_id = _refresh_cache([samples_path, sample_id_path], 9)
# Revert to the following two lines in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Revert to the following two lines in v0.23
# TODO: Revert to the following two lines in v0.23

y = joblib.load(sample_topics_path)
categories = joblib.load(topics_path)
y, categories = _refresh_cache([sample_topics_path, topics_path], 9)
# Revert to the following two lines in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Revert to the following two lines in v0.23
# TODO: Revert to the following two lines in v0.23

@@ -259,6 +260,8 @@ def fetch_species_distributions(data_home=None,
**extra_params)
joblib.dump(bunch, archive_path, compress=9)
else:
bunch = joblib.load(archive_path)
bunch = _refresh_cache([archive_path], 9)
# Revert to the following two lines in v0.23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Revert to the following two lines in v0.23
# TODO: Revert to the following two lines in v0.23

@glemaitre
Copy link
Member

Otherwise LGTM

@adrinjalali
Copy link
Member Author

Codecov doesn't seem to be properly analyzing this PR. Otherwise would you please check the what's new entry and let me know if that's what you though @glemaitre ?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You approve, @glemaitre?

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved
:mod:`sklearn.datasets`
.......................

- |Fix| :func:`fetch_california_housing`, :func:`fetch_covtype`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

links are missing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may just work without the func. Otherwise you can use the ~ prefix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're resolved now.

@jnothman
Copy link
Member

Is this good to merge?

@glemaitre glemaitre merged commit da66111 into scikit-learn:master Jul 12, 2019
@glemaitre
Copy link
Member

Thanks @adrinjalali

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better warning message with deprecation of sklearn.externals._joblib
5 participants