Skip to content

Conversation

bsipocz
Copy link
Contributor

@bsipocz bsipocz commented Jun 6, 2020

This PR is to fix partially #10733

The one inconsistency I see is that rather than having all 20 category names as a column in the DF with 0/1 values, I added a single one as target_name for convert_data_dataframe, but it's inconsistent with the target_name of the bunch.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @bsipocz !

@reshamas
Copy link
Member

reshamas commented Jun 6, 2020

#DataUmbrella sprint

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @bsipocz

Comment on lines 77 to 78
data_df = pd.DataFrame.sparse.from_spmatrix(data,
columns=feature_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_spmatrix is new in pandas=0.25.0. Our CI uses the latest pandas, our documentation says we need pandas>=0.18 for some examples. Should we bump that to 0.25? @NicolasHug @rth maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas 0.18 was released in 2016 while 0.25.3 was released less than a year ago. We can update a bit but I think 0.25 is too recent.

I think it would be OK to raise an exception there for older versions of pandas saying that pandas 0.25+ is required for this functionality, and then skip the corresponding tests (and check that the exception is raised).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we won't have a way to test for that error in any way on our CI though, since we don't test for our minimum supported pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we won't have a way to test for that error in any way on our CI though, since we don't test for our minimum supported pandas.

Fair enough, but we can still skip the test for pandas <0.25.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think? Or odeally we'd actually have a CI which tests for the min supported pandas since we're doing more and more of it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more than happy to add an exception (or go with the none sparse version for the older pandas case).

But, I also think there should be a test with the oldest supported versions of all the dependencies, having one of those certainly helped astropy to notice issues. Happy to add such a job, but that's probably best to do outside of this PR (and I also need to familiarize myself with your CI approaches first).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning to update our CI and make sure we test our min dependencies this week.

(I want to update the CI before the global scikit-learn sprint.)

Comment on lines 424 to 428
frame : pandas DataFrame
Only present when `as_frame=True`. DataFrame with ``data`` and
``target``.

.. versionadded:: 0.24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the frame is a part of the Bunch object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks. Copied this from the housing dataset, I'll open a PR to fix it there, too.

target: array, shape [n_samples]
The target labels.
target_names: list, length [n_classes]
The names of target classes.
If ``as_frame`` is True, ``target`` is a pandas object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably deserves a .. versionchanged directive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I feel the versionchanged is misleading, as target_names isn't added in v0.24. The directive has already been added to the new as_frame kwarg, so it should be unambiguous to users that this line belongs to that addition.

Comment on lines 491 to 492
X = data
y = target
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this renaming not necessary

@@ -88,3 +90,16 @@ def test_20news_normalization(fetch_20newsgroups_vectorized_fxt):

assert_allclose_dense_sparse(X_norm, normalize(X))
assert np.allclose(np.linalg.norm(X_norm.todense(), axis=1), 1)


def test_20news_asframe(fetch_20newsgroups_vectorized_fxt):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we somehow check the column names? Do we test for column names in other places where we've been returning a data frame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some cases where it's tested, and there are others where it's not. I assume you suggest to check a subset of the column names, not all the 130k of them, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I guess that'd make sense.

@bsipocz bsipocz changed the title Adding as_frame kwarg to fetch_20newsgroups_vectorized [MRG] Adding as_frame kwarg to fetch_20newsgroups_vectorized Jun 12, 2020
@glemaitre glemaitre changed the title [MRG] Adding as_frame kwarg to fetch_20newsgroups_vectorized EHN Add parameter as_frame to fetch_20newsgroups_vectorized Jun 15, 2020
@@ -52,6 +52,10 @@ Changelog
unless data is sparse.
:pr:`17396` by :user:`Jiaxiang <fujiaxiang>`.

- |Enhancement| :func:`datasets.fetch_20newsgroups_vectorized` now supports
heterogeneous data using pandas by setting `as_frame=True`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the comment, the feature are always numeric, right? That's why it's vectorized? The target is strings but I wouldn't consider that heterogeneous as X is homogeneous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I just blindly copy pasted the changelog from the california housing dataset without thinking about the meaning of it. I'll fix it.

@bsipocz
Copy link
Contributor Author

bsipocz commented Jun 16, 2020

🤦‍♀️ apparently pipelines still don't know about [skip ci], I'm sorry of not reading the docs to use [ci skip] (other CI providers work with both).

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nitpicks for the docstring otherwise LGTM.

if LooseVersion(pd.__version__) < '0.25':
raise ValueError("Loading sparse datasets as a DataFrame requires "
"Pandas v0.25+.")
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the else to gain an indentation level and put the next statement on a single line.


(data, target) : tuple if ``return_X_y`` is True

.. versionadded:: 0.20

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should remove this line. Numpydoc will not be happy :)

@@ -232,6 +233,7 @@ def fetch_20newsgroups(*, data_home=None, subset='train', categories=None,

(data, target) : tuple if `return_X_y=True`
.. versionadded:: 0.22

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should remove this line

@@ -467,10 +487,24 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
with open(join(module_path, 'descr', 'twenty_newsgroups.rst')) as rst_file:
fdescr = rst_file.read()

frame = None
target_name = ['Category_class', ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not put a capital letter here if the feature_names do not contain capital letter as well.

If True, the data is a pandas DataFrame including columns with
appropriate dtypes (numeric, string or categorical). The target is
a pandas DataFrame or Series depending on the number of target_columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second thought. we could add a little note here to mention that the dataframe will be sparse and that it will require pandas 0.25+

DESCR: str
The full description of the dataset.
frame : pandas DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
frame : pandas DataFrame
frame : dataframe of shape (n_samples, n_features + 1)

frame = bunch.frame

assert frame.shape == (11314, 130108)
assert isinstance(bunch.data, pd.DataFrame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should have a common test now for the type part. Could you only check the shape (and the sparse because this is really specific to this fetcher).

@glemaitre
Copy link
Member

@bsipocz Would you be able to merge master into your branch and address the reviews?

@glemaitre glemaitre self-assigned this Aug 25, 2020
@glemaitre glemaitre added the Superseded PR has been replace by a newer PR label Aug 26, 2020
@reshamas
Copy link
Member

@bsipocz
Are you still working on this PR?

@bsipocz
Copy link
Contributor Author

bsipocz commented Oct 20, 2020

Yes, sorry, it fell through the cracks. I'll try to come back to it by the weekend.

@cmarmo
Copy link
Contributor

cmarmo commented Oct 20, 2020

@reshamas, @bsipocz some work has already been done by @glemaitre in #18262.

@glemaitre
Copy link
Member

I did open a PR to address my own comments. It is actually awaiting for reviewing. Feel free to make a pass on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:datasets Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants