-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
EHN Add parameter as_frame to fetch_20newsgroups_vectorized #17499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @bsipocz !
#DataUmbrella sprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @bsipocz
sklearn/datasets/_base.py
Outdated
data_df = pd.DataFrame.sparse.from_spmatrix(data, | ||
columns=feature_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_spmatrix
is new in pandas=0.25.0. Our CI uses the latest pandas, our documentation says we need pandas>=0.18 for some examples. Should we bump that to 0.25? @NicolasHug @rth maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas 0.18 was released in 2016 while 0.25.3 was released less than a year ago. We can update a bit but I think 0.25 is too recent.
I think it would be OK to raise an exception there for older versions of pandas saying that pandas 0.25+ is required for this functionality, and then skip the corresponding tests (and check that the exception is raised).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we won't have a way to test for that error in any way on our CI though, since we don't test for our minimum supported pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we won't have a way to test for that error in any way on our CI though, since we don't test for our minimum supported pandas.
Fair enough, but we can still skip the test for pandas <0.25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think? Or odeally we'd actually have a CI which tests for the min supported pandas since we're doing more and more of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more than happy to add an exception (or go with the none sparse version for the older pandas case).
But, I also think there should be a test with the oldest supported versions of all the dependencies, having one of those certainly helped astropy to notice issues. Happy to add such a job, but that's probably best to do outside of this PR (and I also need to familiarize myself with your CI approaches first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning to update our CI and make sure we test our min dependencies this week.
(I want to update the CI before the global scikit-learn sprint.)
frame : pandas DataFrame | ||
Only present when `as_frame=True`. DataFrame with ``data`` and | ||
``target``. | ||
|
||
.. versionadded:: 0.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the frame
is a part of the Bunch
object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks. Copied this from the housing dataset, I'll open a PR to fix it there, too.
target: array, shape [n_samples] | ||
The target labels. | ||
target_names: list, length [n_classes] | ||
The names of target classes. | ||
If ``as_frame`` is True, ``target`` is a pandas object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably deserves a .. versionchanged
directive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I feel the versionchanged
is misleading, as target_names
isn't added in v0.24. The directive has already been added to the new as_frame
kwarg, so it should be unambiguous to users that this line belongs to that addition.
X = data | ||
y = target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this renaming not necessary
@@ -88,3 +90,16 @@ def test_20news_normalization(fetch_20newsgroups_vectorized_fxt): | |||
|
|||
assert_allclose_dense_sparse(X_norm, normalize(X)) | |||
assert np.allclose(np.linalg.norm(X_norm.todense(), axis=1), 1) | |||
|
|||
|
|||
def test_20news_asframe(fetch_20newsgroups_vectorized_fxt): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we somehow check the column names? Do we test for column names in other places where we've been returning a data frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some cases where it's tested, and there are others where it's not. I assume you suggest to check a subset of the column names, not all the 130k of them, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I guess that'd make sense.
doc/whats_new/v0.24.rst
Outdated
@@ -52,6 +52,10 @@ Changelog | |||
unless data is sparse. | |||
:pr:`17396` by :user:`Jiaxiang <fujiaxiang>`. | |||
|
|||
- |Enhancement| :func:`datasets.fetch_20newsgroups_vectorized` now supports | |||
heterogeneous data using pandas by setting `as_frame=True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the comment, the feature are always numeric, right? That's why it's vectorized? The target is strings but I wouldn't consider that heterogeneous as X is homogeneous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I just blindly copy pasted the changelog from the california housing dataset without thinking about the meaning of it. I'll fix it.
🤦♀️ apparently pipelines still don't know about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of nitpicks for the docstring otherwise LGTM.
if LooseVersion(pd.__version__) < '0.25': | ||
raise ValueError("Loading sparse datasets as a DataFrame requires " | ||
"Pandas v0.25+.") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove the else to gain an indentation level and put the next statement on a single line.
|
||
(data, target) : tuple if ``return_X_y`` is True | ||
|
||
.. versionadded:: 0.20 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should remove this line. Numpydoc will not be happy :)
@@ -232,6 +233,7 @@ def fetch_20newsgroups(*, data_home=None, subset='train', categories=None, | |||
|
|||
(data, target) : tuple if `return_X_y=True` | |||
.. versionadded:: 0.22 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should remove this line
@@ -467,10 +487,24 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None, | |||
with open(join(module_path, 'descr', 'twenty_newsgroups.rst')) as rst_file: | |||
fdescr = rst_file.read() | |||
|
|||
frame = None | |||
target_name = ['Category_class', ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not put a capital letter here if the feature_names
do not contain capital letter as well.
If True, the data is a pandas DataFrame including columns with | ||
appropriate dtypes (numeric, string or categorical). The target is | ||
a pandas DataFrame or Series depending on the number of target_columns. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A second thought. we could add a little note here to mention that the dataframe will be sparse and that it will require pandas 0.25+
DESCR: str | ||
The full description of the dataset. | ||
frame : pandas DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frame : pandas DataFrame | |
frame : dataframe of shape (n_samples, n_features + 1) |
frame = bunch.frame | ||
|
||
assert frame.shape == (11314, 130108) | ||
assert isinstance(bunch.data, pd.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should have a common test now for the type part. Could you only check the shape (and the sparse because this is really specific to this fetcher).
@bsipocz Would you be able to merge master into your branch and address the reviews? |
@bsipocz |
Yes, sorry, it fell through the cracks. I'll try to come back to it by the weekend. |
@reshamas, @bsipocz some work has already been done by @glemaitre in #18262. |
I did open a PR to address my own comments. It is actually awaiting for reviewing. Feel free to make a pass on it. |
This PR is to fix partially #10733
The one inconsistency I see is that rather than having all 20 category names as a column in the DF with 0/1 values, I added a single one as target_name for
convert_data_dataframe
, but it's inconsistent with thetarget_name
of the bunch.