Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN Add parameter as_frame to fetch_covtype #17491

Merged

Conversation

tianchuliang
Copy link
Contributor

Reference Issues/PRs

Reference: #10733

What does this implement/fix? Explain your changes.

This adds as_frame argument for covtype dataset loader.

Any other comments?

@amueller
Copy link
Member

amueller commented Jun 6, 2020

Thanks for the PR. Generally looks good, please check the linting error.

@tianchuliang
Copy link
Contributor Author

@amueller yep, will resolve linter's complain :)

@reshamas
Copy link
Member

reshamas commented Jun 6, 2020

#DataUmbrella sprint

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @tianchuliang

@@ -80,6 +82,8 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,
If True, returns ``(data.data, data.target)`` instead of a Bunch
object.

as_frame : boolean, default=False.
If True, returns ``pandas.DataFrame`` instead of a Bunch object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result will still be a bunch object, it's the X and y which are a DataFrame instead of an ndarray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result is a bunch object, I added .frame attribute to the bunch object. The frame attribute is a combined dataframe of X and y, with properly specified columns. Please see discussion #10733 for reference, this is the design from that thread of discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but here you say If True, returns ``pandas.DataFrame`` instead of a Bunch object which is not correct.

You can see the documentation and the implementation https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_base.py#L297 as an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, got it. will adjust, thanks!

@@ -145,4 +152,32 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,
if return_X_y:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the if as_frame part should be before this if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted accordingly, thanks for pointing it out.

@adrinjalali
Copy link
Member

You may have not pushed your changes to this branch yet. Once you do I can give another review :)

@tianchuliang
Copy link
Contributor Author

@adrinjalali yep, it's been a busy week this week; but I will look at the codecov issue this weekend and try to get it pass all checks.

@amueller
Copy link
Member

the codecov error seems strange and unrelated, or I'm not looking at the right thing. Maybe don't worry about this one for now and just address the other comments.

@tianchuliang
Copy link
Contributor Author

@amueller just pushed changes to address previous comments; the codecov error is about some of the if statements not being fully tested out. It's a bit tedious to add in all the tests, let me know if they needed to be done.

@glemaitre glemaitre changed the title #10733 fetch covtype return df EHN Add parameter as_frame to fetch_covtype Jun 16, 2020
appropriate dtypes (numeric). The target is
a pandas DataFrame or Series depending on the number of target columns.
If `return_X_y` is True, then (`data`, `target`) will be pandas
DataFrames or Series as described below.
.. versionadded:: 0.20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version added was part of the previous parameter. You should add another versionadded for this parameter

.. versionadded:: 0.24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -98,6 +106,9 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,

(data, target) : tuple if ``return_X_y`` is True

dataframe: :class: `pandas.DataFrame`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added under DESCPR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

@@ -98,6 +106,9 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,

(data, target) : tuple if ``return_X_y`` is True

dataframe: :class: `pandas.DataFrame`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our guideline would say to write

frame : dataframe of shape (581012, 53)

@@ -98,6 +106,9 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,

(data, target) : tuple if ``return_X_y`` is True

dataframe: :class: `pandas.DataFrame`
Pandas dataframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pandas dataframe
Only present when `as_frame=True`. Contains `data` and `target`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

Column names reference:
https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info
"""
feat_cols = ["Elevation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define those as constant on the top of the file.

rename feat_cols by FEATURE_NAMES and target_col by TARGET_NAMES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

feat_cols += ['Soil_Type_'+str(i) for i in range(1, 41)]
target_col = ["Cover_Type"]

frame, X, y = _convert_data_dataframe("fetch_covtype", X, y,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call the function by passing only by keyword arguments it will be more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

@@ -142,7 +153,37 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,
with open(join(module_path, 'descr', 'covtype.rst')) as rst_file:
fdescr = rst_file.read()

if return_X_y:
return X, y
if as_frame or return_X_y:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can have something like

frame = None
if as_frame:
    _convert_to_frame(...)
if return_X_y:
    return data, target
return Bunch(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

import pytest
from sklearn.datasets.tests.test_common import check_return_X_y
from functools import partial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import pytest
from sklearn.datasets.tests.test_common import check_return_X_y
from functools import partial
from functools import partial
import pytest
from sklearn.datasets.tests.test_common import check_return_X_y

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

pd = pytest.importorskip('pandas')
bunch = fetch_covtype_fxt(as_frame=True)
frame = bunch.frame
assert hasattr(bunch, frame) is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert hasattr(bunch, frame) is True
assert hasattr(bunch, frame)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

def test_fetch_asframe(fetch_covtype_fxt):
pd = pytest.importorskip('pandas')
bunch = fetch_covtype_fxt(as_frame=True)
frame = bunch.frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert below will be useful because the test will already crash here if frame does not exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE.

@adrinjalali
Copy link
Member

It's a bit tedious to add in all the tests, let me know if they needed to be done.

Writing the tests is in many cases more work than fixing the issue or implementing a feature itself; yet we need them to ensure the correctness and the quality of the code, and in many cases writing tests will uncover a few bugs which were not seen before. So yes, please add the tests.

frame = bunch.frame
assert hasattr(bunch, frame) is True
assert frame.shape == (581012, 55)
assert isinstance(bunch.frame, pd.DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the same here. It should be in the common test regarding the type checking. You can only check the shape of the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will get rid of the type checking here.

@tianchuliang
Copy link
Contributor Author

@amueller @glemaitre hey guys, I am confused by the codecov/patch error; I have no idea what I am supposed to do because I can't interpret the diff <-> coverage change. As far as I can see, there are improved code coverages due to my diff. Please help, this codecov/patch error is quite irratating as it is not helping.

@tianchuliang
Copy link
Contributor Author

Screen Shot 2020-06-19 at 1 02 05 PM
@amueller @glemaitre As you can see, the two files where my code changes live show positive increased codecov. What's happening with this error?

@thomasjpfan
Copy link
Member

The fetch_* functions are skipped on the CI with PRs thus the coverage would not measured for them. Only the build on travis will run the fetch_* tests. You can activate them by adding a [scipy-dev] to an empty commit message and push to this PR. (There may be errors that are unrelated to this PR because [scipy-dev] test on the development version of our dependencies.

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 19, 2020

Going over our travis config, it looks we do not upload coverage there either. This was done to not have false positive with PRs that never run the fetch_* tests.

Given that, it looks like it did pass the travis ci.

@tianchuliang
Copy link
Contributor Author

@glemaitre @amueller @thomasjpfan is there a definitive answer as to what I should do with the codecov/patch issue? The amount of comments and rework on this PR is getting long IMO, I'd really appreciate it if there is a final set of clear check points, so we can close this?

@thomasjpfan
Copy link
Member

is there a definitive answer as to what I should do with the codecov/patch issue? T

I think we ignore it for now.

The amount of comments and rework on this PR is getting long IMO,

This is the norm for non-trival PRs. Every merged PR increases the maintenance of the project, thus we do reviews and ask for code changes to keep the code quality high.

Thank you for your patience.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good apart from some nitpicks!

@@ -80,7 +100,13 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,
If True, returns ``(data.data, data.target)`` instead of a Bunch
object.

.. versionadded:: 0.20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will add it back, thanks for pointing it out.

@@ -80,7 +100,13 @@ def fetch_covtype(*, data_home=None, download_if_missing=True,
If True, returns ``(data.data, data.target)`` instead of a Bunch
object.

.. versionadded:: 0.20
as_frame : bool, default=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that's missing in the other PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks

DESCR : str
Description of the forest covertype dataset.

(data, target) : tuple if ``return_X_y`` is True

.. versionadded:: 0.20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be changed, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing it back

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Enhancement|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

aliang added 2 commits June 20, 2020 22:09
@@ -53,6 +53,12 @@ Changelog
unless data is sparse.
:pr:`17396` by :user:`Jiaxiang <fujiaxiang>`.

- |Enhancement| :func:`datasets.fetch_covtype` now now supports optional
argument `as_frame`; when it is set to true, the returned Bunch object's
`data` and `target` members are in pandas DataFrame format, and the Bunch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, target is a pandas series.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this, will merge when test pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Thank you sir!

@thomasjpfan
Copy link
Member

Merging the tests that are failing are also failing on master. I'll work on a fix tomorrow if it is not fixed by then.

Thank you for the PR @tianchuliang !

@thomasjpfan thomasjpfan merged commit 6ef4487 into scikit-learn:master Jun 21, 2020
6 of 7 checks passed
dsandeep0138 pushed a commit to dsandeep0138/scikit-learn that referenced this pull request Jun 21, 2020
Co-authored-by: aliang <aliang@air>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@tianchuliang tianchuliang deleted the 10733_fetch_covtype_return_df branch June 21, 2020 18:31
rubywerman pushed a commit to MLH-Fellowship/scikit-learn that referenced this pull request Jun 24, 2020
Co-authored-by: aliang <aliang@air>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Co-authored-by: aliang <aliang@air>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Co-authored-by: aliang <aliang@air>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants