[MRG+2] use distinct filenames for pickled datasets under Python 2 and 3 #5355

ogrisel · 2015-10-07T09:00:04Z

This is a fix for #3596.

Note that I changed the covtype loader to be consistent with the others even though that might re-trigger a useless download once.

ogrisel · 2015-10-07T09:14:25Z

Hum, I will fix the issue under Python 2.

ogrisel · 2015-10-07T10:25:11Z

The what's new entry is conflicting. As it's going to change a lot today, I will do a final rebase once the tests are all green and reviewers are ok with the change.

jnothman · 2015-10-07T10:31:32Z

Is it possible to just use the lowest pickle version setting? Or is the issue that the constructors differ?

Should we anticipate different Py3 versions producing mutually incompatible pickles? Someone may have multiple versions of Py3 installed in different venvs but share the same default data dir.

TomDLT · 2015-10-07T10:38:17Z

LGTM

ogrisel · 2015-10-07T10:50:25Z

Is it possible to just use the lowest pickle version setting? Or is the issue that the constructors differ?

This is what @lesteve tried to do in joblib but there is still issues with the fact that the numpy pickler used in some cases (small arrays with compression) tries to represent the buffer content with a str instance under Python 2 (which is fine) but str instance under Python 3 imply decoding. They should be unpickled as bytes under Python 3 but only for str instances that represent the binary content of a numpy buffer. This is very complex to get right and I feel that is going to be a serious maintenance headache.

I prefer to make it explicit that we expect pickles not to be compatible across versions of Python.

ogrisel · 2015-10-07T10:51:28Z

Should we anticipate different Py3 versions producing mutually incompatible pickles? Someone may have multiple versions of Py3 installed in different venvs but share the same default data dir.

That should not be a problem as the str type behavior should not change again under Python 3.

jnothman · 2015-10-07T11:31:43Z

Okay.

On 7 October 2015 at 21:51, Olivier Grisel notifications@github.com wrote:

Should we anticipate different Py3 versions producing mutually
incompatible pickles? Someone may have multiple versions of Py3 installed
in different venvs but share the same default data dir.

That should not be a problem as the str type behavior should not change
again under Python 3.

—
Reply to this email directly or view it on GitHub
#5355 (comment)
.

jnothman · 2015-10-07T11:36:50Z

sklearn/datasets/species_distributions.py

@@ -72,9 +72,8 @@ def _load_coverage(F, header_length=6, dtype=np.int16):
    header = dict([make_tuple(line) for line in header])

    M = np.loadtxt(F, dtype=dtype)
-    nodata = header[b'NODATA_value']
+    nodata = int(header[b'NODATA_value'])


What's going on here? Does this fix make a practical difference that should be recorded as a bug fix?

It was to fix a numpy warning that does not like indexing by float values. The value is 128.0. I do not really understand this code but that should not impact the behavior besides the warning.

Ah of course

jnothman · 2015-10-07T12:44:26Z

LGTM

ogrisel · 2015-10-07T13:04:33Z

Ok I will merge by rebase then.

jnothman · 2015-10-07T13:08:05Z

Ta

amueller · 2015-10-12T22:25:27Z

Maybe it would also be a good idea to include joblib and scikit-learn version in the files. That might be more insight the pickle, though? We had some issues with changes in scikit-learn already.

ogrisel force-pushed the py3-datasets branch from f2853cb to c5317da Compare October 7, 2015 09:01

ogrisel added this to the 0.17 milestone Oct 7, 2015

ogrisel added 8 commits October 7, 2015 11:17

ENH utility to have distinct dataset .pkl filenames

58335f9

FIX separate filenames for RCV1

c821cfa

FIX separate filenames for species distributions

4f03d2f

FIX separate filenames covtype

b9ced33

FIX separate filenames for 20 newsgroups

0b6dcac

FIX separate filenames for california housing

4bfb273

FIX separate filenames for Olivetti faces

bc1768e

DOC what's new entry for dataset fetchers

7c33b3d

ogrisel force-pushed the py3-datasets branch from c5317da to 7c33b3d Compare October 7, 2015 09:22

jnothman reviewed Oct 7, 2015
View reviewed changes

TomDLT changed the title ~~[MRG] use distinct filenames for pickled datasets under Python 2 and 3~~ [MRG+2] use distinct filenames for pickled datasets under Python 2 and 3 Oct 7, 2015

ogrisel closed this Oct 7, 2015

ogrisel mentioned this pull request Oct 7, 2015

Python 2 / 3 incompatibility when fetching joblib compressed datasets #3596

Closed

jakirkham mentioned this pull request May 22, 2017

Iiboost recipe conda-forge/staged-recipes#2931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] use distinct filenames for pickled datasets under Python 2 and 3 #5355

[MRG+2] use distinct filenames for pickled datasets under Python 2 and 3 #5355

ogrisel commented Oct 7, 2015

ogrisel commented Oct 7, 2015

ogrisel commented Oct 7, 2015

jnothman commented Oct 7, 2015

TomDLT commented Oct 7, 2015

ogrisel commented Oct 7, 2015

ogrisel commented Oct 7, 2015

jnothman commented Oct 7, 2015

jnothman Oct 7, 2015

ogrisel Oct 7, 2015

jnothman Oct 7, 2015

jnothman commented Oct 7, 2015

ogrisel commented Oct 7, 2015

jnothman commented Oct 7, 2015

amueller commented Oct 12, 2015

[MRG+2] use distinct filenames for pickled datasets under Python 2 and 3 #5355

[MRG+2] use distinct filenames for pickled datasets under Python 2 and 3 #5355

Conversation

ogrisel commented Oct 7, 2015

ogrisel commented Oct 7, 2015

ogrisel commented Oct 7, 2015

jnothman commented Oct 7, 2015

TomDLT commented Oct 7, 2015

ogrisel commented Oct 7, 2015

ogrisel commented Oct 7, 2015

jnothman commented Oct 7, 2015

jnothman Oct 7, 2015

Choose a reason for hiding this comment

ogrisel Oct 7, 2015

Choose a reason for hiding this comment

jnothman Oct 7, 2015

Choose a reason for hiding this comment

jnothman commented Oct 7, 2015

ogrisel commented Oct 7, 2015

jnothman commented Oct 7, 2015

amueller commented Oct 12, 2015