New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take over PR #7647 and #7627- Add a "filename" attribute to datasets that have a CSV file #9669

Closed
wants to merge 18 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@tobiasgaertner

tobiasgaertner commented Sep 1, 2017

Reference Issue

Fixes #7627
Fixes #7647

(original PR description)

  • As per issue description, this change provide filename for load_iris, load_boston, load_breast_cancer
  • data_filename and target_filenames are added to load_diabetes, load_linnerud
  • fix PEP8 issues in dataset/base.py and relevant test file

Any other comments?

  • Taking over old PR
  • Thank you to Loic for his support
@lesteve

This comment has been minimized.

Member

lesteve commented Sep 1, 2017

I think if you add a commit, the fix in paster will be picked up and this will pass.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 2, 2017

Remove Take over PR #7647 and #7627 from the title. Put it inside the description.
Prepend [WIP] if you think that you still have work to do on the PR and this is not ready for merge and [MRG] if you think this is ready for couple of reviews

@glemaitre

@lesteve Would you add an integration test to check that filename contains the supposed CSV file inside the string or unit testing in base is enough?

>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> print(boston.filename) # doctest: +SKIP

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

not sure that we need to skip. you could use ELLIPSIS instead.

>>> from sklearn.datasets import load_diabetes
>>> diabetes = load_diabetes()
>>> print(diabetes.data_filename) # doctest: +SKIP
(some-path)/sklearn/datasets/data/diabetes_data.csv.gz

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

ELLIPSIS instead of SKIP

>>> print(diabetes.data_filename) # doctest: +SKIP
(some-path)/sklearn/datasets/data/diabetes_data.csv.gz
>>> print(diabetes.target_filename) # doctest: +SKIP
(some-path)/sklearn/datasets/data/diabetes_target.csv.gz

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

ELLIPSIS instead of SKIP

``predict()`` returns a 2d array with multiple predicted labels for each instance.

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

I don't see the diff here

>>> import numpy as np
>>> boston_data = np.loadtxt(boston.filename, delimiter=",", skiprows=2)
>>> boston.data.shape # sklearn dataset

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

I would be a bit more explicit be cause I did not see it at first

>>> # in scikit-learn the data and target are stored separately
>>> boston.data.shape  # data
(506, 13)
>>> boston.target.shape  # target
(506,)
>>> # when reading using numpy, the data and target are concatenated
>>> boston_data.shape
(506, 14)
@@ -416,7 +419,8 @@ def load_breast_cancer(return_X_y=False):
'data', the data to learn, 'target', the classification labels,
'target_names', the meaning of the labels, 'feature_names', the
meaning of the features, and 'DESCR', the
full description of the dataset.
full description of the dataset, 'filename' (added in version 0.20),
the physical location of breast cancer csv dataset.

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

I am not in favor of (added in version 0.20) in the narrative.

I would add a directive on the bottom of data

        the physical location of iris csv dataset.

        .. versionadded:: 0.20
           The ``filename`` attribute.
'data', the data to learn, 'target', the regression target for each
sample, 'data_filename' (added in version 0.20), the physical location
of diabetes data csv dataset, and 'target_filename' (added in
version 0.20), the physical location of diabetes targets csv datataset.

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

I am not in favor of (added in version 0.20) in the narrative.

I would add a directive on the bottom of data

        the physical location of iris csv dataset.

        .. versionadded:: 0.20
           The ``filename`` attribute.
In addition, you will also have access to 'data_filename'
(added in version 0.20), the physical location of linnerud data csv
dataset, and 'target_filename' (added in version 0.20), the
physical location of linnerud targets csv datataset.

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

I am not in favor of (added in version 0.20) in the narrative.

I would add a directive on the bottom of data

        the physical location of iris csv dataset.

        .. versionadded:: 0.20
           The ``filename`` attribute.
In addition, you will also have access to 'data_filename'
(added in version 0.20), the physical location of linnerud data csv
dataset, and 'target_filename' (added in version 0.20), the
physical location of linnerud targets csv datataset.
(data, target) : tuple if ``return_X_y`` is True
.. versionadded:: 0.18
"""
base_dir = join(dirname(__file__), 'data/')

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

'data/' -> 'data'

and 'DESCR', the full description of the dataset.
'DESCR', the full description of the dataset,
and 'filename' (added in version 0.20), the physical location
of boston csv dataset.

This comment has been minimized.

@glemaitre

glemaitre Sep 2, 2017

Contributor

I am not in favor of (added in version 0.20) in the narrative.

I would add a directive on the bottom of data

        the physical location of iris csv dataset.

        .. versionadded:: 0.20
           The ``filename`` attribute.
@lesteve

This comment has been minimized.

Member

lesteve commented Dec 5, 2017

Closed by #9101.

@lesteve lesteve closed this Dec 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment