New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch_20newsgroups 'remove' argument (Python 2.7.11) #6196

Closed
boskaiolo opened this Issue Jan 20, 2016 · 6 comments

Comments

Projects
None yet
3 participants
@boskaiolo

boskaiolo commented Jan 20, 2016

On Python 2.7.11, the remove argument of the fetch_20newsgroups method doesn't work.
Here's an example (you can change '10' with another index, the problem appear again):

from sklearn.datasets import fetch_20newsgroups
print fetch_20newsgroups(shuffle=False, remove=('headers', 'footers', 'quotes')).data[10]

Although the removal of headers, footers and quotes is set, this is the output:

From: maler@vercors.imag.fr (Oded Maler)
Subject: Re: FLAME and a Jewish home in Palestine
Nntp-Posting-Host: pelvoux
Organization: IMAG, University of Grenoble, France
Lines: 40

In article <C5HJBC.1HC@bony1.bony.com>, jake@bony1.bony.com (Jake Livni) writes:
|> In article <1993Apr13.172422.2407@newshub.ariel.yorku.ca> nabil@ariel.yorku.ca (Nabil Gangi) writes:
|>
|> >According to Exodus, there were 600,000 Jews that marched out of Egypt.
|>
|> This is only the number of adult males.  The total number of Jewish
|> slaves leaving Egypt was much larger.
|>
|> >The number which could have arrived to the Holy Lands must have been
|> >substantially less ude to the harsh desert and the killings between the
|> >Jewish tribes on the way..
|> >
|> >NABIL
|>
|> Typical Arabic thinking.  If we are guilty of something, so is
|> everyone else.  Unfortunately for you, Nabil, Jewish tribes are not
|> nearly as susceptible to the fratricidal murdering that is still so
|> common among Arabs in the Middle East.  There were no " killings
|> between the Jewish tribes on the way."

I don't like this comment about "Typical" thinking. You could state
your interpretation of Exodus without it. As I read Exodus I can see
a lot of killing there, which is painted by the author of the bible
in ideological/religious colors. The history in the desert can be seen
as an ethos of any nomadic people occupying a land. That's why I think
it is a great book with which descendants Arabs, Turks and Mongols can
unify as well.


|> Jake
|> --
|> Jake Livni  jake@bony1.bony.com           Ten years from now, George Bush will
|> American-Occupied New York                   have replaced Jimmy Carter as the
|> My opinions only - employer has no opinions.    standard of a failed President.

--
===============================================================
Oded Maler, LGI-IMAG, Bat D, B.P. 53x, 38041 Grenoble, France
Phone:  76635846     Fax: 76446675      e-mail: maler@imag.fr
===============================================================

Problem doesn't appear in Python 3.5.1

@lesteve

This comment has been minimized.

Member

lesteve commented Jan 20, 2016

I can reproduce. Looks like this bug was introduced in scikit-learn 0.17 and is still in master.

@GaelVaroquaux GaelVaroquaux added the Bug label Jan 20, 2016

@lesteve

This comment has been minimized.

Member

lesteve commented Jan 25, 2016

Actually @boskaiolo can you try removing ~/scikit_learn_data/20news-bydate.pkz and rerunning your snippet? It seems like this fixes the problem for me.

From some little investigation I believe this is related to #4600 that changes the Bunch class. So basically if you start from a fresh ~/scikit_learn_data, run your snippet with scikit-learn 0.16.1 (meaning you download the data) and then run your snippet with 0.17 (meaning you load the data from the saved joblib pickle) you'll see the problem. If you download with 0.17 everything works fine.

@boskaiolo

This comment has been minimized.

boskaiolo commented Jan 25, 2016

I can confirm: after removing the file, downloaded with scikit-learn (<0.17), the same code works.

@lesteve

This comment has been minimized.

Member

lesteve commented Jan 26, 2016

I reckon this one should be closed. This looks like a gotcha that we will have to live with when people transition from 0.16 to 0.17 I am afraid ...

@lesteve

This comment has been minimized.

Member

lesteve commented Jan 26, 2016

For completeness, here is a snippet I use to investigate more. As noted in #4600 (comment) there is some interesting interaction between the changes in the Bunch class and what happens when we load a Bunch from a pickle which was generated before theses changes.

import pickle

import sklearn
from sklearn.datasets.base import Bunch


def write():
    filename = '/tmp/test_{}.pkl'.format(sklearn.__version__)
    bunch = Bunch(key='original')
    pickle.dump(bunch, open(filename, 'wb'))


def read():
    print('Bunch pickles read with scikit-learn {}\n{}'.format(
        sklearn.__version__, 80*'-'))
    for sklearn_version_used_for_pickling in ('0.16.1', '0.17'):
        print('Bunch pickled with scikit-learn version: {}'.format(
            sklearn_version_used_for_pickling))
        filename = '/tmp/test_{}.pkl'.format(
            sklearn_version_used_for_pickling)
        bunch = pickle.load(open(filename, 'rb'))
        bunch.key = 'changed'
        print('bunch.key: {}'.format(bunch.key))
        print("bunch['key']: {}".format(bunch['key']))


# write()
read()

You'll need to write the pickles for both sklearn 0.16 and 0.17 first.

Output for sklearn 0.16:

Bunch pickles read with scikit-learn 0.16.1
--------------------------------------------------------------------------------
Bunch pickled with scikit-learn version: 0.16.1
bunch.key: changed
bunch['key']: original
Bunch pickled with scikit-learn version: 0.17
bunch.key: changed
bunch['key']: original

This was the bug fixed in #4600: when the Bunch is loaded from a pickle changing the attribute of the bunch doesn't change the value associated to the key.

Output for sklearn 0.17:

Bunch pickles read with scikit-learn 0.17
--------------------------------------------------------------------------------
Bunch pickled with scikit-learn version: 0.16.1
bunch.key: original
bunch['key']: changed
Bunch pickled with scikit-learn version: 0.17
bunch.key: changed
bunch['key']: changed

In the case "Bunch pickled with 0.16 and read with 0.17" the huge WTF is that you do bunch.key = 'changed' but then you still have bunch.key = 'original' ! This is pretty much the problem happening in this issue. Note the problem is fixed when the bunch has been pickled with 0.17 and read with 0.17.

@lesteve

This comment has been minimized.

Member

lesteve commented Jan 26, 2016

I understood the problem a bit more: 0.16 Bunch objects have a non empty __dict__. This non empty __dict__ interferes with the 0.17 Bunch implementation of __setattr__: reading bunch.key uses __dict__ whereas assigning into bunch.key uses __setattr__ which only changes self[key].

I opened a PR with a fix.

For completeness here is the Bunch class code for 0.16.1 and 0.17.

lesteve added a commit to lesteve/scikit-learn that referenced this issue Jan 26, 2016

ogrisel added a commit that referenced this issue Jan 28, 2016

glemaitre added a commit to glemaitre/scikit-learn that referenced this issue Feb 13, 2016

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this issue Feb 19, 2016

Merge tag '0.17.1' into releases
* tag '0.17.1': (29 commits)
  Release 0.17.1
  MAINT remove non-existing cache folder in 0.17.X branch
  FIX cythonize TSNE
  MAINT simplify freeing logic for Barnes-Hut SNE memory leak fix
  Fix memory leak in Barnes-Hut SNE
  FIX check_build_doc.py false positive detections
  MAINT more informative output to circle/check_build_doc.py
  FIX fetch_california_housing
  FIX in randomized_svd flip sign
  Updated examples and tests that use scipy's lena
  DOC whats_new entry for scikit-learn#6258
  fix joblib error in LatentDirichletAllocation
  MAINT fix / speedup travis on 0.17.X
  MAINT Upgrade pip in appveyor and display version
  DOC missing changelog entry for scikit-learn#5857
  DOC add fix for scikit-learn#6147 to the changelog
  FIX 6147: ensure that AUC is always a float
  TST non-regression test for scikit-learn#6147, roc_auc on memmap data
  Added changelog entry about scikit-learn#6196
  Fix reading of bunch pickles
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this issue Feb 19, 2016

Merge branch 'releases' into dfsg
* releases: (29 commits)
  Release 0.17.1
  MAINT remove non-existing cache folder in 0.17.X branch
  FIX cythonize TSNE
  MAINT simplify freeing logic for Barnes-Hut SNE memory leak fix
  Fix memory leak in Barnes-Hut SNE
  FIX check_build_doc.py false positive detections
  MAINT more informative output to circle/check_build_doc.py
  FIX fetch_california_housing
  FIX in randomized_svd flip sign
  Updated examples and tests that use scipy's lena
  DOC whats_new entry for scikit-learn#6258
  fix joblib error in LatentDirichletAllocation
  MAINT fix / speedup travis on 0.17.X
  MAINT Upgrade pip in appveyor and display version
  DOC missing changelog entry for scikit-learn#5857
  DOC add fix for scikit-learn#6147 to the changelog
  FIX 6147: ensure that AUC is always a float
  TST non-regression test for scikit-learn#6147, roc_auc on memmap data
  Added changelog entry about scikit-learn#6196
  Fix reading of bunch pickles
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this issue Feb 19, 2016

Merge branch 'dfsg' into debian
* dfsg: (29 commits)
  Release 0.17.1
  MAINT remove non-existing cache folder in 0.17.X branch
  FIX cythonize TSNE
  MAINT simplify freeing logic for Barnes-Hut SNE memory leak fix
  Fix memory leak in Barnes-Hut SNE
  FIX check_build_doc.py false positive detections
  MAINT more informative output to circle/check_build_doc.py
  FIX fetch_california_housing
  FIX in randomized_svd flip sign
  Updated examples and tests that use scipy's lena
  DOC whats_new entry for scikit-learn#6258
  fix joblib error in LatentDirichletAllocation
  MAINT fix / speedup travis on 0.17.X
  MAINT Upgrade pip in appveyor and display version
  DOC missing changelog entry for scikit-learn#5857
  DOC add fix for scikit-learn#6147 to the changelog
  FIX 6147: ensure that AUC is always a float
  TST non-regression test for scikit-learn#6147, roc_auc on memmap data
  Added changelog entry about scikit-learn#6196
  Fix reading of bunch pickles
  ...

mannby pushed a commit to mannby/scikit-learn that referenced this issue Apr 22, 2016

TomDLT added a commit to TomDLT/scikit-learn that referenced this issue Oct 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment