New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN Add parameter as_frame to fetch_covtype #17491
EHN Add parameter as_frame to fetch_covtype #17491
Conversation
Thanks for the PR. Generally looks good, please check the linting error. |
@amueller yep, will resolve linter's complain :) |
#DataUmbrella sprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @tianchuliang
sklearn/datasets/_covtype.py
Outdated
@@ -80,6 +82,8 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
If True, returns ``(data.data, data.target)`` instead of a Bunch | |||
object. | |||
|
|||
as_frame : boolean, default=False. | |||
If True, returns ``pandas.DataFrame`` instead of a Bunch object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result will still be a bunch object, it's the X
and y
which are a DataFrame instead of an ndarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result is a bunch object, I added .frame attribute to the bunch object. The frame attribute is a combined dataframe of X and y, with properly specified columns. Please see discussion #10733 for reference, this is the design from that thread of discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but here you say If True, returns ``pandas.DataFrame`` instead of a Bunch object
which is not correct.
You can see the documentation and the implementation https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_base.py#L297 as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, got it. will adjust, thanks!
@@ -145,4 +152,32 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
if return_X_y: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the if as_frame
part should be before this if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted accordingly, thanks for pointing it out.
You may have not pushed your changes to this branch yet. Once you do I can give another review :) |
@adrinjalali yep, it's been a busy week this week; but I will look at the codecov issue this weekend and try to get it pass all checks. |
the codecov error seems strange and unrelated, or I'm not looking at the right thing. Maybe don't worry about this one for now and just address the other comments. |
@amueller just pushed changes to address previous comments; the codecov error is about some of the if statements not being fully tested out. It's a bit tedious to add in all the tests, let me know if they needed to be done. |
sklearn/datasets/_covtype.py
Outdated
appropriate dtypes (numeric). The target is | ||
a pandas DataFrame or Series depending on the number of target columns. | ||
If `return_X_y` is True, then (`data`, `target`) will be pandas | ||
DataFrames or Series as described below. | ||
.. versionadded:: 0.20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version added was part of the previous parameter. You should add another versionadded for this parameter
.. versionadded:: 0.24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
sklearn/datasets/_covtype.py
Outdated
@@ -98,6 +106,9 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
|
|||
(data, target) : tuple if ``return_X_y`` is True | |||
|
|||
dataframe: :class: `pandas.DataFrame` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be added under DESCPR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
sklearn/datasets/_covtype.py
Outdated
@@ -98,6 +106,9 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
|
|||
(data, target) : tuple if ``return_X_y`` is True | |||
|
|||
dataframe: :class: `pandas.DataFrame` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our guideline would say to write
frame : dataframe of shape (581012, 53)
sklearn/datasets/_covtype.py
Outdated
@@ -98,6 +106,9 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
|
|||
(data, target) : tuple if ``return_X_y`` is True | |||
|
|||
dataframe: :class: `pandas.DataFrame` | |||
Pandas dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas dataframe | |
Only present when `as_frame=True`. Contains `data` and `target`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
sklearn/datasets/_covtype.py
Outdated
Column names reference: | ||
https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info | ||
""" | ||
feat_cols = ["Elevation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define those as constant on the top of the file.
rename feat_cols
by FEATURE_NAMES
and target_col
by TARGET_NAMES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
sklearn/datasets/_covtype.py
Outdated
feat_cols += ['Soil_Type_'+str(i) for i in range(1, 41)] | ||
target_col = ["Cover_Type"] | ||
|
||
frame, X, y = _convert_data_dataframe("fetch_covtype", X, y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call the function by passing only by keyword arguments it will be more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
sklearn/datasets/_covtype.py
Outdated
@@ -142,7 +153,37 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
with open(join(module_path, 'descr', 'covtype.rst')) as rst_file: | |||
fdescr = rst_file.read() | |||
|
|||
if return_X_y: | |||
return X, y | |||
if as_frame or return_X_y: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can have something like
frame = None
if as_frame:
_convert_to_frame(...)
if return_X_y:
return data, target
return Bunch(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
import pytest | ||
from sklearn.datasets.tests.test_common import check_return_X_y | ||
from functools import partial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pytest | |
from sklearn.datasets.tests.test_common import check_return_X_y | |
from functools import partial | |
from functools import partial | |
import pytest | |
from sklearn.datasets.tests.test_common import check_return_X_y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
pd = pytest.importorskip('pandas') | ||
bunch = fetch_covtype_fxt(as_frame=True) | ||
frame = bunch.frame | ||
assert hasattr(bunch, frame) is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert hasattr(bunch, frame) is True | |
assert hasattr(bunch, frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
def test_fetch_asframe(fetch_covtype_fxt): | ||
pd = pytest.importorskip('pandas') | ||
bunch = fetch_covtype_fxt(as_frame=True) | ||
frame = bunch.frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert below will be useful because the test will already crash here if frame does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
Writing the tests is in many cases more work than fixing the issue or implementing a feature itself; yet we need them to ensure the correctness and the quality of the code, and in many cases writing tests will uncover a few bugs which were not seen before. So yes, please add the tests. |
frame = bunch.frame | ||
assert hasattr(bunch, frame) is True | ||
assert frame.shape == (581012, 55) | ||
assert isinstance(bunch.frame, pd.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the same here. It should be in the common test regarding the type checking. You can only check the shape of the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will get rid of the type checking here.
@amueller @glemaitre hey guys, I am confused by the codecov/patch error; I have no idea what I am supposed to do because I can't interpret the diff <-> coverage change. As far as I can see, there are improved code coverages due to my diff. Please help, this codecov/patch error is quite irratating as it is not helping. |
|
The |
Going over our travis config, it looks we do not upload coverage there either. This was done to not have false positive with PRs that never run the Given that, it looks like it did pass the travis ci. |
@glemaitre @amueller @thomasjpfan is there a definitive answer as to what I should do with the codecov/patch issue? The amount of comments and rework on this PR is getting long IMO, I'd really appreciate it if there is a final set of clear check points, so we can close this? |
I think we ignore it for now.
This is the norm for non-trival PRs. Every merged PR increases the maintenance of the project, thus we do reviews and ask for code changes to keep the code quality high. Thank you for your patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good apart from some nitpicks!
@@ -80,7 +100,13 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
If True, returns ``(data.data, data.target)`` instead of a Bunch | |||
object. | |||
|
|||
.. versionadded:: 0.20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will add it back, thanks for pointing it out.
@@ -80,7 +100,13 @@ def fetch_covtype(*, data_home=None, download_if_missing=True, | |||
If True, returns ``(data.data, data.target)`` instead of a Bunch | |||
object. | |||
|
|||
.. versionadded:: 0.20 | |||
as_frame : bool, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that's missing in the other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks
DESCR : str | ||
Description of the forest covertype dataset. | ||
|
||
(data, target) : tuple if ``return_X_y`` is True | ||
|
||
.. versionadded:: 0.20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be changed, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please add an entry to the change log at |
…ng/scikit-learn into 10733_fetch_covtype_return_df
doc/whats_new/v0.24.rst
Outdated
@@ -53,6 +53,12 @@ Changelog | |||
unless data is sparse. | |||
:pr:`17396` by :user:`Jiaxiang <fujiaxiang>`. | |||
|
|||
- |Enhancement| :func:`datasets.fetch_covtype` now now supports optional | |||
argument `as_frame`; when it is set to true, the returned Bunch object's | |||
`data` and `target` members are in pandas DataFrame format, and the Bunch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, target
is a pandas series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this, will merge when test pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan Thank you sir!
Merging the tests that are failing are also failing on master. I'll work on a fix tomorrow if it is not fixed by then. Thank you for the PR @tianchuliang ! |
Co-authored-by: aliang <aliang@air> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: aliang <aliang@air> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: aliang <aliang@air> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: aliang <aliang@air> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
Reference: #10733
What does this implement/fix? Explain your changes.
This adds as_frame argument for covtype dataset loader.
Any other comments?