EHN Add parameter as_frame to fetch_20newsgroups_vectorized #17499

bsipocz · 2020-06-06T19:34:25Z

This PR is to fix partially #10733

The one inconsistency I see is that rather than having all 20 category names as a column in the DF with 0/1 values, I added a single one as target_name for convert_data_dataframe, but it's inconsistent with the target_name of the bunch.

thomasjpfan

Thank you for the PR @bsipocz !

sklearn/datasets/_twenty_newsgroups.py

reshamas · 2020-06-06T22:58:36Z

#DataUmbrella sprint

adrinjalali

Thanks for the PR @bsipocz

adrinjalali · 2020-06-09T08:39:44Z

sklearn/datasets/_base.py

+        data_df = pd.DataFrame.sparse.from_spmatrix(data,
+                                                    columns=feature_names)


from_spmatrix is new in pandas=0.25.0. Our CI uses the latest pandas, our documentation says we need pandas>=0.18 for some examples. Should we bump that to 0.25? @NicolasHug @rth maybe?

Pandas 0.18 was released in 2016 while 0.25.3 was released less than a year ago. We can update a bit but I think 0.25 is too recent.

I think it would be OK to raise an exception there for older versions of pandas saying that pandas 0.25+ is required for this functionality, and then skip the corresponding tests (and check that the exception is raised).

we won't have a way to test for that error in any way on our CI though, since we don't test for our minimum supported pandas.

we won't have a way to test for that error in any way on our CI though, since we don't test for our minimum supported pandas.

Fair enough, but we can still skip the test for pandas <0.25.

yes, I think? Or odeally we'd actually have a CI which tests for the min supported pandas since we're doing more and more of it

I'm more than happy to add an exception (or go with the none sparse version for the older pandas case).

But, I also think there should be a test with the oldest supported versions of all the dependencies, having one of those certainly helped astropy to notice issues. Happy to add such a job, but that's probably best to do outside of this PR (and I also need to familiarize myself with your CI approaches first).

I am planning to update our CI and make sure we test our min dependencies this week.

(I want to update the CI before the global scikit-learn sprint.)

adrinjalali · 2020-06-09T08:47:32Z

sklearn/datasets/_twenty_newsgroups.py

+    frame : pandas DataFrame
+        Only present when `as_frame=True`. DataFrame with ``data`` and
+        ``target``.
+
+        .. versionadded:: 0.24


the frame is a part of the Bunch object

Indeed, thanks. Copied this from the housing dataset, I'll open a PR to fix it there, too.

adrinjalali · 2020-06-09T08:47:48Z

sklearn/datasets/_twenty_newsgroups.py

        target: array, shape [n_samples]
            The target labels.
        target_names: list, length [n_classes]
            The names of target classes.
+            If ``as_frame`` is True, ``target`` is a pandas object.


probably deserves a .. versionchanged directive

here I feel the versionchanged is misleading, as target_names isn't added in v0.24. The directive has already been added to the new as_frame kwarg, so it should be unambiguous to users that this line belongs to that addition.

adrinjalali · 2020-06-09T08:48:12Z

sklearn/datasets/_twenty_newsgroups.py

+    X = data
+    y = target


I find this renaming not necessary

adrinjalali · 2020-06-09T08:48:59Z

sklearn/datasets/tests/test_20news.py

@@ -88,3 +90,16 @@ def test_20news_normalization(fetch_20newsgroups_vectorized_fxt):

    assert_allclose_dense_sparse(X_norm, normalize(X))
    assert np.allclose(np.linalg.norm(X_norm.todense(), axis=1), 1)
+
+
+def test_20news_asframe(fetch_20newsgroups_vectorized_fxt):


should we somehow check the column names? Do we test for column names in other places where we've been returning a data frame?

There are some cases where it's tested, and there are others where it's not. I assume you suggest to check a subset of the column names, not all the 130k of them, right?

yes I guess that'd make sense.

amueller · 2020-06-15T16:49:46Z

doc/whats_new/v0.24.rst

@@ -52,6 +52,10 @@ Changelog
  unless data is sparse.
  :pr:`17396` by :user:`Jiaxiang <fujiaxiang>`.

+- |Enhancement| :func:`datasets.fetch_20newsgroups_vectorized` now supports
+  heterogeneous data using pandas by setting `as_frame=True`.


I'm not sure I understand the comment, the feature are always numeric, right? That's why it's vectorized? The target is strings but I wouldn't consider that heterogeneous as X is homogeneous.

oh, I just blindly copy pasted the changelog from the california housing dataset without thinking about the meaning of it. I'll fix it.

bsipocz · 2020-06-16T02:26:44Z

🤦‍♀️ apparently pipelines still don't know about [skip ci], I'm sorry of not reading the docs to use [ci skip] (other CI providers work with both).

glemaitre

Just a couple of nitpicks for the docstring otherwise LGTM.

glemaitre · 2020-06-15T15:26:08Z

sklearn/datasets/_base.py

+        if LooseVersion(pd.__version__) < '0.25':
+            raise ValueError("Loading sparse datasets as a DataFrame requires "
+                             "Pandas v0.25+.")
+        else:


we can remove the else to gain an indentation level and put the next statement on a single line.

glemaitre · 2020-06-15T15:28:45Z

sklearn/datasets/_twenty_newsgroups.py


    (data, target) : tuple if ``return_X_y`` is True

        .. versionadded:: 0.20
+


You should remove this line. Numpydoc will not be happy :)

glemaitre · 2020-06-15T15:28:56Z

sklearn/datasets/_twenty_newsgroups.py

@@ -232,6 +233,7 @@ def fetch_20newsgroups(*, data_home=None, subset='train', categories=None,

    (data, target) : tuple if `return_X_y=True`
        .. versionadded:: 0.22
+


You should remove this line

glemaitre · 2020-06-16T07:09:03Z

sklearn/datasets/_twenty_newsgroups.py

@@ -467,10 +487,24 @@ def fetch_20newsgroups_vectorized(*, subset="train", remove=(), data_home=None,
    with open(join(module_path, 'descr', 'twenty_newsgroups.rst')) as rst_file:
        fdescr = rst_file.read()

+    frame = None
+    target_name = ['Category_class', ]


I would not put a capital letter here if the feature_names do not contain capital letter as well.

glemaitre · 2020-06-16T07:33:09Z

sklearn/datasets/_twenty_newsgroups.py

+        If True, the data is a pandas DataFrame including columns with
+        appropriate dtypes (numeric, string or categorical). The target is
+        a pandas DataFrame or Series depending on the number of target_columns.
+


A second thought. we could add a little note here to mention that the dataframe will be sparse and that it will require pandas 0.25+

glemaitre · 2020-06-16T07:39:44Z

sklearn/datasets/_twenty_newsgroups.py

        DESCR: str
            The full description of the dataset.
+        frame : pandas DataFrame


Suggested change

frame : pandas DataFrame

frame : dataframe of shape (n_samples, n_features + 1)

glemaitre · 2020-06-16T12:46:23Z

sklearn/datasets/tests/test_20news.py

+        frame = bunch.frame
+
+        assert frame.shape == (11314, 130108)
+        assert isinstance(bunch.data, pd.DataFrame)


So we should have a common test now for the type part. Could you only check the shape (and the sparse because this is really specific to this fetcher).

glemaitre · 2020-08-21T18:52:06Z

@bsipocz Would you be able to merge master into your branch and address the reviews?

reshamas · 2020-10-20T15:35:53Z

@bsipocz
Are you still working on this PR?

bsipocz · 2020-10-20T16:33:47Z

Yes, sorry, it fell through the cracks. I'll try to come back to it by the weekend.

cmarmo · 2020-10-20T16:46:13Z

@reshamas, @bsipocz some work has already been done by @glemaitre in #18262.

glemaitre · 2020-10-21T10:09:52Z

I did open a PR to address my own comments. It is actually awaiting for reviewing. Feel free to make a pass on it.

github-actions bot added the module:datasets label Jun 6, 2020

thomasjpfan reviewed Jun 6, 2020

View reviewed changes

sklearn/datasets/_twenty_newsgroups.py Outdated Show resolved Hide resolved

bsipocz force-pushed the datasets_as_DF branch from 0c73bc8 to da88a3b Compare June 9, 2020 07:45

adrinjalali reviewed Jun 9, 2020

View reviewed changes

bsipocz added 4 commits June 12, 2020 01:35

Adding as_frame kwarg to fetch_20newsgroups_vectorized

f1f1ed1

Using SparseArray to store the data

5a6ae1d

Adding changelog

915b400

Adding pandas version dependency, and more tests

28f3dea

bsipocz force-pushed the datasets_as_DF branch from da88a3b to 28f3dea Compare June 12, 2020 08:37

bsipocz changed the title ~~Adding as_frame kwarg to fetch_20newsgroups_vectorized~~ [MRG] Adding as_frame kwarg to fetch_20newsgroups_vectorized Jun 12, 2020

glemaitre changed the title ~~[MRG] Adding as_frame kwarg to fetch_20newsgroups_vectorized~~ EHN Add parameter as_frame to fetch_20newsgroups_vectorized Jun 15, 2020

amueller reviewed Jun 15, 2020

View reviewed changes

DOC: rephrasing changelog [skip ci]

34b47bc

glemaitre reviewed Jun 16, 2020

View reviewed changes

glemaitre self-assigned this Aug 25, 2020

glemaitre mentioned this pull request Aug 26, 2020

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #18262

Merged

glemaitre added the Superseded PR has been replace by a newer PR label Aug 26, 2020

adrinjalali closed this in #18262 Nov 1, 2020

		data_df = pd.DataFrame.sparse.from_spmatrix(data,
		columns=feature_names)


		(data, target) : tuple if ``return_X_y`` is True

		.. versionadded:: 0.20

		@@ -232,6 +233,7 @@ def fetch_20newsgroups(*, data_home=None, subset='train', categories=None,

		(data, target) : tuple if `return_X_y=True`
		.. versionadded:: 0.22

	frame : pandas DataFrame
	frame : dataframe of shape (n_samples, n_features + 1)

Uh oh!

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #17499

EHN Add parameter as_frame to fetch_20newsgroups_vectorized #17499

Uh oh!

Conversation

bsipocz commented Jun 6, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reshamas commented Jun 6, 2020

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bsipocz commented Jun 16, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Aug 21, 2020

Uh oh!

reshamas commented Oct 20, 2020

Uh oh!

bsipocz commented Oct 20, 2020

Uh oh!

cmarmo commented Oct 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Oct 21, 2020

Uh oh!

cmarmo commented Oct 20, 2020 •

edited

Loading