FIX introduce a refresh_cache param to `fetch_...` functions. #14197

adrinjalali · 2019-06-26T13:21:34Z

This kinda fixes #14177.

This does not change the warning message. But it introduces a refresh_cache parameter to the fetch_... functions that download and persist the data using joblib.

The proposal is to add this parameter, with some variation after reviews:

+    refresh_cache : str or bool, optional (default='joblib')
+        - ``True``: remove the previously downloaded data, and fetche it again.
+        - ``'joblib'``: only re-fetch the data if the previously downloaded
+          data has been persisted using the previously vendored `joblib`.
+        - ``False``: do not re-fetch the data.
+        
+        From version 0.23, ``'joblib'`` as an input value will be ignored and
+        assumed ``False``.
+
+        .. versionadded:: 0.21.3

I've changed only one of the dataset files, will fix the others once we agree on a solution.

@glemaitre @jnothman

adrinjalali · 2019-06-26T13:26:49Z

also, with this change, I think we can remove the if not hasattr(sys, "_is_pytest_session"): from joblib/__init__.py.

qinhanmin2014 · 2019-06-26T14:06:49Z

Is it possible to only introduce a function (i.e., refresh_cache)? If we introduce a parameter, we'll need a deprecation cycle to remove it.
I'm not able to run experiments but seems that some other functions also suffer from this issue (w.g., fetch_20newsgroups_vectorized)

adrinjalali · 2019-06-26T14:12:04Z

I'm not able to run experiments but seems that some other functions also suffer from this issue (w.g., fetch_20newsgroups_vectorized)

Yes, as I said: I've changed only one of the dataset files, will fix the others once we agree on a solution.

Is it possible to only introduce a function (i.e., refresh_cache)? If we introduce a parameter, we'll need a deprecation cycle to remove it.

Yes, as Joel suggested:

I would be okay to detect this case and re-cache over a deprecation period... (but we have to also allow for the cache not being writable, I think.)

This also allows the user to set the parameter to False, for the case when the data is read-only.

qinhanmin2014 · 2019-06-26T14:19:34Z

Yes, as Joel suggested:

What's Joel's suggestion? I saw something like "I would be okay to detect this case and re-cache over a deprecation period". Does that mean that we need to download/construct the dataset everytime we use the dataset?

jnothman · 2019-06-27T11:30:23Z

I had mostly just intended this to happen automatically without a parameter to control it, and with that functionality disappearing after a couple of versions.

adrinjalali · 2019-06-27T11:43:50Z

I had mostly just intended this to happen automatically without a parameter to control it, and with that functionality disappearing after a couple of versions.

the issue is when users don't have reliable, or easy connection from those systems, and the functionality would delete the files, while not being able to easily download them. Although with the default value here the same thing happens. I can remove the parameter, and have the deprecation cycle completely done silent, if that's what you prefer. I find the parameter useful anyway, even w/o joblib option.

jnothman · 2019-06-27T12:11:07Z

Why would it necessarily delete the files?

qinhanmin2014 · 2019-06-27T12:21:22Z

thinking more about it, seems that it's difficult to solve the issue if we only introduce a function to refresh the cache, because it's difficult to determine which files to remove. So I'm OK with this solution (i.e., add a parameter). If we don't like it later, we can deprecate it.
Why do we need the joblib option? True and False seems enough?

qinhanmin2014 · 2019-06-27T12:26:53Z

If I understand correctly, users can still use the datasets they pickled previously (e.g., before sklearn 0.21), right? So it seems not good to download / construct the datasets again by default?

jnothman · 2019-06-27T12:36:23Z

I don't see why we should download again here at all... Shouldn't we just unpickle (with sklearn.externals.joblib) and re-pickle (with joblib)?

adrinjalali · 2019-06-27T12:37:08Z

I don't see why we should download again here at all... Shouldn't we just unpickle (with sklearn.externals.joblib) and re-pickle (with joblib)?

Didn't think of that, will do.

jnothman

Yes, that's more what I was thinking ;)

qinhanmin2014 · 2019-06-27T13:08:16Z

I'll vote +1 for this solution.

qinhanmin2014 · 2019-06-27T13:09:37Z

sklearn/datasets/base.py

@@ -919,3 +920,25 @@ def _fetch_remote(remote, dirname=None):
                      "file may be corrupted.".format(file_path, checksum,
                                                      remote.checksum))
    return file_path
+
+
+def _refresh_cache(path):


you put this function in base.py but actually it's designed only for fetch_covtype?

adrinjalali · 2019-06-27T13:09:42Z

It still suppresses the warning if the folder/file is read-only. Trying to fix that.

qinhanmin2014 · 2019-06-27T13:12:37Z

you're replying my comment? I mean things like

samples_path = _pkl_filepath(path, "samples")
targets_path = _pkl_filepath(path, "targets")

won't work for other functions (e.g., fetch_20newsgroups_vectorized)

adrinjalali · 2019-06-27T13:14:26Z

I'll see what I need to do once I check the other instances which need to be fixed. For now working still on this one. It was not a response to your comment, it was a response to @jnothman 's comment regarding handling the case where I/O fails.

sklearn/datasets/base.py

adrinjalali · 2019-06-27T21:17:37Z

This now fixes the issue for all the instances of the issue, except for sklearn/datasets/lfw.py, which is kinda more complicated.

The codecov fails. I've tested all of the examples with having the old persisted data converted to the new one, and also for the case when the files are read-only. But I don't think we can easily test them (unless we do some monkey patching, not sure if it's worth it). How does this look @jnothman ?

adrinjalali · 2019-06-28T10:07:18Z

codecov doesn't seem to be working properly on this patch anyway. I suppose it's kinda ready now.

sklearn/datasets/base.py

jnothman · 2019-06-30T04:09:17Z

sklearn/datasets/base.py

+
+        other_warns = [w for w in warns if not str(w.message).startswith(msg)]
+        for w in other_warns:
+            warnings.warn(message=w.message, category=w.category)


Ideally you should also be using the original module and line number. Is there an alternative function for best issuing the warning?

there's the warn_explicit, but it didn't actually end up showing the warning and I couldn't fix it in 5 minutes, so I switched back to warn.

jnothman · 2019-06-30T04:11:35Z

Add a test refreshing a pretend pickle?

adrinjalali · 2019-06-30T12:20:09Z

Add a test refreshing a pretend pickle?

Not really testing the pickle itself, but testing different routes through the warnings now.

jnothman · 2019-06-30T14:05:59Z

sklearn/datasets/tests/test_base.py

+        return 0
+
+    def _load_warn_unrelated(*args, **kwargs):
+        warnings.warn("unrelated warning", UserWarning)


for a tighter test, this could be a DeprecationWarning

glemaitre

Could you add an entry in the what's new. We will need it to recall to make the changes (it was quite useful when I wanted to remove deprecated code). I don't know if we should have a specific section like maintenance to group things like deprecation or these type of changes.

glemaitre · 2019-07-01T15:36:53Z

sklearn/datasets/base.py

+
+
+def _refresh_cache(files, compress):
+    # REMOVE in v0.23


The TODO marker can be easier to catch

Suggested change

# REMOVE in v0.23

# TODO: REMOVE in v0.23

glemaitre · 2019-07-01T15:38:46Z

sklearn/datasets/base.py

+            warnings.warn(message=message, category=DeprecationWarning)
+
+    if len(data) == 1:
+        return data[0]


one liner:

return data[0] if len(data) == 1 else data

glemaitre · 2019-07-01T15:41:49Z

sklearn/datasets/california_housing.py

@@ -129,7 +130,9 @@ def fetch_california_housing(data_home=None, download_if_missing=True,
        remove(archive_path)

    else:
-        cal_housing = joblib.load(filepath)
+        cal_housing = _refresh_cache([filepath], 6)
+        # Revert to the following two lines in v0.23