Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Weights parameter of datasets.make_classification changed to array-like from list only - Issue 14760 #14764

Merged
merged 37 commits into from Sep 18, 2019

Conversation

@CatChenal
Copy link
Contributor

CatChenal commented Aug 24, 2019

Fixes #14760

What does this implement/fix?

The weights parameter can be a list or array, not just a list, e.g.
" weights is array-like or None " in docstring.

CatChenal added 21 commits Dec 18, 2018
…# doctest: +NORMALIZE_WHITESPACE +DONT_ACCEPT_BLANKLINE +ELLIPSIS for print statements.
…# doctest: +NORMALIZE_WHITESPACE +DONT_ACCEPT_BLANKLINE for print statements.
…BLANKLINE (@jnothmam, @reshamas)
@@ -162,14 +162,14 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
if n_informative < np.log2(n_classes * n_clusters_per_class):
raise ValueError("n_classes * n_clusters_per_class must"
" be smaller or equal 2 ** n_informative")
if weights and len(weights) not in [n_classes, n_classes - 1]:
if all(weights) and len(weights) not in [n_classes, n_classes - 1]:

This comment has been minimized.

Copy link
@amueller

amueller Aug 24, 2019

Member

maybe weights is not None would be more clear?

@@ -337,22 +337,37 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None):
Examples
--------
<<<<<<< HEAD

This comment has been minimized.

Copy link
@amueller

amueller Aug 24, 2019

Member

merge issues here

@CatChenal CatChenal changed the title Issue 14760 Weights parameter of datasets.make_classification changed to sequence from list - Issue 14760 Aug 24, 2019
@@ -162,17 +162,18 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
if n_informative < np.log2(n_classes * n_clusters_per_class):
raise ValueError("n_classes * n_clusters_per_class must"
" be smaller or equal 2 ** n_informative")
if weights and len(weights) not in [n_classes, n_classes - 1]:
raise ValueError("Weights specified but incompatible with number "
if not weights is None:

This comment has been minimized.

Copy link
@amueller

amueller Aug 24, 2019

Member

the idiomatic python way is weights is not None.
I would use a single if and and. You can put the whole condition in parenthesis and make it multi-line

This comment has been minimized.

Copy link
@CatChenal

CatChenal Aug 24, 2019

Author Contributor

ok.

@@ -337,22 +337,22 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None):
Examples
--------
>>> from sklearn.datasets import load_sample_image
>>> from sklearn.datasets import load_sample_images

This comment has been minimized.

Copy link
@amueller

amueller Aug 24, 2019

Member

this change seems unrelated

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 3, 2019

Contributor

ping @CatChenal can you please revert these changes

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Aug 24, 2019

Please add a non-regression test that would fail at master but pass in this PR.

Copy link
Contributor

NicolasHug left a comment

Thanks for the PR @CatChenal ,

made a few comments

@@ -91,7 +91,7 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
n_clusters_per_class : int, optional (default=2)
The number of clusters per class.
weights : list of floats or None (default=None)
weights : sequence of floats or None (default=None)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 26, 2019

Contributor

We call this an array-like.

Suggested change
weights : sequence of floats or None (default=None)
weights : array-like of shape (n_classes,) or (n_classes - 1,), default=None
@@ -162,17 +162,18 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
if n_informative < np.log2(n_classes * n_clusters_per_class):
raise ValueError("n_classes * n_clusters_per_class must"
" be smaller or equal 2 ** n_informative")
if weights and len(weights) not in [n_classes, n_classes - 1]:
w_ok = (weights is not None) and all(weights)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 26, 2019

Contributor

shouldn't an error be raised if all(weights) is false?


if weights is None:
if weights is not None:
if all(weights) and len(weights) == (n_classes - 1):

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 26, 2019

Contributor
Suggested change
if all(weights) and len(weights) == (n_classes - 1):
if len(weights) == (n_classes - 1):
if weights is not None:
if all(weights) and len(weights) == (n_classes - 1):
weights = weights + [1.0 - sum(weights)]
else:
weights = [1.0 / n_classes] * n_classes
weights[-1] = 1.0 - sum(weights[:-1])

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 26, 2019

Contributor

That's not you, but that line is useless ;)

This comment has been minimized.

Copy link
@CatChenal

CatChenal Aug 26, 2019

Author Contributor

Which line?
Line 175 resizes the (n-classes - 1) array with the missing weight, so it makes sense.
Line 178 recalculate to last position of weights according to values set on line 177;
Line 178 is the useless, no?

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 26, 2019

Contributor
weights = [1.0 / n_classes] * n_classes
weights[-1] = 1.0 - sum(weights[:-1])  # <-- this one
# w as array: should pass in PR_14764, fail in master
w = np.array([0.25, 0.75])
X, y = make_classification(weights=w)
assert X.shape == (100, 20), "X shape mismatch"

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 26, 2019

Contributor

We like to parametrize these kind of tests. You can look for some inspiration in e.g. this test

Let us know if you need any help

This comment has been minimized.

Copy link
@CatChenal

CatChenal Aug 27, 2019

Author Contributor

None of the tests in test_samples_generator.py are parametrized. Do you want me to parametrize all of them or just the tests for make_classification()?

This comment has been minimized.

Copy link
@CatChenal

CatChenal Aug 27, 2019

Author Contributor

Oops, found one: test_make_blobs_n_samples_centers_none()

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 27, 2019

Contributor

I was just suggesting to parametrize the test you wrote

This comment has been minimized.

Copy link
@CatChenal

CatChenal Aug 27, 2019

Author Contributor

Good...
I just found out one wrong way to do it:

@pytest.mark.parametrize(
    'params, err_msg',
    [({'weights': 0}, "object of type 'int' has no len()"),
     ({'weights': -1}, "object of type 'int' has no len()"),
     ({'weights': []}, "Weights specified but incompatible with number of classes."),
     ({'weights': [.25,.75,.1]}, "Weights specified but incompatible with number of classes."),
     ({'weights': np.array([])},"Weights specified but incompatible with number of classes."),
     ({'weights': np.array([.25,.75,.1])},"Weights specified but incompatible with number of classes.")]
)
def test_make_classification_weights_type(params, err_msg):
    make = partial(make_classification,
                   n_samples=100,
                   n_features=20,
                   n_informative=2,
                   n_redundant=2,
                   n_repeated=0,
                   n_classes=2,
                   n_clusters_per_class=2,
                   flip_y=0.01,
                   class_sep=1.0,
                   hypercube=True,
                   shift=0.0,
                   scale=1.0,
                   shuffle=True,
                   random_state=0)
    
    for i in range(len(params)):
        with pytest.raises(ValueError, match=err_msg[i]):
            make(weights=params[i]['weights'])

The first problem is that the mark.parametrize statement is incorrect: the weights in the partial functions are not split and I have not found out how to fix it yet.
The other problem is likely the iteration of the context manager (should not be needed...).
Thanks for pointing me in the right direction.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 28, 2019

Contributor

You're almost there. Here is a basic example:

@pytest.mark.parametrize(
    'weights, err_msg',
    [
        ([1, 2, 3], "incompatible with number of classes"),
        # add other test cases here
    ]
)
def test_make_classification_weights_type(weights, err_msg):

    with pytest.raises(ValueError, match=err_msg):
        make_classification(weights=weights)
Copy link
Contributor

NicolasHug left a comment

Thanks @CatChenal , I made a few more comments but mostly looks good. Could you also please add a very simple test that makes sure passing e.g. [1, 2, 3] gives the same result as passing np.array([1, 2. 3]). Thanks!

@@ -337,22 +337,22 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None):
Examples
--------
>>> from sklearn.datasets import load_sample_image
>>> from sklearn.datasets import load_sample_images

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 3, 2019

Contributor

ping @CatChenal can you please revert these changes

@@ -380,7 +390,7 @@ def sample_example():
X_indices = array.array('i')
X_indptr = array.array('i', [0])
Y = []
for i in range(n_samples):
for _ in range(n_samples):

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 3, 2019

Contributor

Please avoid unrelated changes

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 3, 2019

Author Contributor

ok

@@ -1237,7 +1247,7 @@ def make_spd_matrix(n_dim, random_state=None):
generator = check_random_state(random_state)

A = generator.rand(n_dim, n_dim)
U, s, V = linalg.svd(np.dot(A.T, A))
U, _, V = linalg.svd(np.dot(A.T, A))

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 3, 2019

Contributor

same here

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 3, 2019

Author Contributor

ok

@@ -11,6 +11,8 @@
from sklearn.utils.testing import assert_array_almost_equal
from sklearn.utils.testing import assert_raise_message

from sklearn.utils.validation import assert_all_finite

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 3, 2019

Contributor

same here

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 3, 2019

Author Contributor

ok

n_informative, 2**n_informative))

if weights is not None:
if isinstance(weights, int):

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 3, 2019

Contributor

I don't think we need a specific check for int (else it means we would need specific checks for pretty much every type). I guess a safe way is to convert the weights to a numpy array. You can then just check the length as you do below, and use np.sum everywhere.

if len(weights) not in [n_classes, n_classes - 1]:
raise ValueError("Weights specified but incompatible with number "
"of classes.")
if len(weights) == (n_classes - 1):

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 3, 2019

Contributor
Suggested change
if len(weights) == (n_classes - 1):
if len(weights) == n_classes - 1:

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 3, 2019

Author Contributor

ok

CatChenal added 2 commits Sep 3, 2019
Copy link
Contributor

NicolasHug left a comment

Minor comment about test coverage, but LGTM anyway. Thanks @CatChenal !

if isinstance(weights, list):
weights = weights + [1.0 - sum(weights)]
else:
weights = np.resize(weights, n_classes)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Sep 4, 2019

Contributor

That part isn't covered by the tests. I think you can cover it easily by setting n_classes=3 in test_make_classification_weights_array_or_list_ok.

…ts_array_or_list_ok` as per @NicolasHug.
Copy link
Member

thomasjpfan left a comment

Thank you @CatChenal for working on this!

random_state=0)
X2, y2 = make_classification(weights=np.array([.1, .9]),
random_state=0)
assert (X1.all() == X2.all()) and (y1.all() == y2.all())

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 4, 2019

Member

X1.all() returns True if X1 is all non-zero. Is this assertion to do the following:

assert_almost_equal(X1, X2)
assert_almost_equal(y1, y2)
make_classification(weights=weights)


def test_make_classification_weights_array_or_list_ok():

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 4, 2019

Member

This can be parametrized:

@pytest.mark.parametrize("kwargs", [{}, {"n_classes": 3, "n_informative": 3}])
def test_make_classification_weights_array_or_list_ok(kwargs):
    X1, y1 = make_classification(weights=[.1, .9],
                                 random_state=0, **kwargs)
    X2, y2 = make_classification(weights=np.array([.1, .9]),
                                 random_state=0, **kwargs)
    assert_almost_equal(X1, X2)
    assert_almost_equal(y1, y2)

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 4, 2019

Author Contributor

Thank you!

@@ -91,7 +91,8 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
n_clusters_per_class : int, optional (default=2)
The number of clusters per class.
weights : list of floats or None (default=None)
weights : array-like of shape (n_classes,) or (n_classes - 1,),

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 5, 2019

Member

Currently this is not rendered nicely.

To render nicely:

Suggested change
weights : array-like of shape (n_classes,) or (n_classes - 1,),
weights : array-like of shape (n_classes,) or (n_classes - 1,),\

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 5, 2019

Author Contributor

Thanks, @thomasjpfan.
Would you please document how you reached that end-point to verify the rendering? My doc tree does not have a /modules/generated/ path.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 5, 2019

Member

When you build the html documentation using these instructions, there will be a new folder: doc/_build which contains doc/_build/html/stable/index.html which is the landing page of the scikit-learn. From there you can navigate to the make_classification docs by going to the API page.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 5, 2019

Please add Enhancement entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@@ -85,6 +85,11 @@ Changelog
:func:`datasets.fetch_20newsgroups` and :func:`datasets.fetch_olivetti_faces`
. :pr:`14259` by :user:`Sourav Singh <souravsingh>`.

- |Enhancement| `make_classification` in :func:`datasets.samples_generator` now

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 5, 2019

Member
Suggested change
- |Enhancement| `make_classification` in :func:`datasets.samples_generator` now
- |Enhancement| :func:`datasets.make_classification` now

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 5, 2019

Author Contributor

Of course!

- |Enhancement| `make_classification` in :func:`datasets.samples_generator` now
accepts array-like `weights` parameter, i.e. list or numpy.array, instead of
list only.
:pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to `WiMLDS <WiMLDS>`.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 5, 2019

Member
Suggested change
:pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to `WiMLDS <WiMLDS>`.
:pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to *WiMLDS*.

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 5, 2019

Author Contributor

Certainly not. That would downgrade @WiMLDS's contributions to Opensource & Scikit-learn in particular.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 5, 2019

Member

This would link you and @WiMLDS

- |Enhancement| :func:`datasets.make_classification` now accepts array-like
  `weights` parameter, i.e. list or numpy.array, instead of list only.
  :pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to
  :user:`WiMLDS <WiMLDS>`.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 6, 2019

Member

Certainly not. That would downgrade @WiMLDS's contributions to Opensource & Scikit-learn in particular.

Sorry, I misunderstood the intend of the string. I can see now you were trying to link to the organization on github. The above snippet should correctly link to their organization.

This comment has been minimized.

Copy link
@rth

rth Sep 8, 2019

Member

I'm not sure about this, we usually acknowledge individuals not organizations in release notes. For funding organizations are typically mentioned in https://scikit-learn.org/stable/about.html#funding). I think we could maybe add a section for WiMLDS and similar partner non-profit organizations there? The problem with acknowledgements of organizations in release notes is that most contributions have some sort of organization behind it ( Numfocus, conference where the sprint happened, company who allowed its employee to contribute during work time, etc), and then deciding to acknowledge some but not others is tricky.

This comment has been minimized.

Copy link
@CatChenal

CatChenal Sep 8, 2019

Author Contributor

Thanks for the information. The mention is then out of place in the release notes. I will remove it & add #WiMLDS in the final commit.

@reshamas

This comment has been minimized.

Copy link
Contributor

reshamas commented Sep 9, 2019

@NicolasHug
Does this title need to include "MRG" ?

cc: @kellycarmody

@NicolasHug NicolasHug changed the title Weights parameter of datasets.make_classification changed to sequence from list - Issue 14760 [MRG] Weights parameter of datasets.make_classification changed to sequence from list - Issue 14760 Sep 9, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 9, 2019

@reshamas

This comment has been minimized.

Copy link
Contributor

reshamas commented Sep 9, 2019

@jnothman

Is WiMLDS being listed here as a sponsor, or rather a way that the contributor was able to learn to contribute? I like the WiMLDS mention in the change log.

WiMLDS contributed in the following ways:

  • organized the event: provided a way for people to contribute
  • sponsored the event: funding
  • sprint contributors
  • donated to NumFOCUS, for scikit-learn

Any way that is acknowledged would be cool.

@CatChenal CatChenal changed the title [MRG] Weights parameter of datasets.make_classification changed to sequence from list - Issue 14760 [MRG] Weights parameter of datasets.make_classification changed to array-like from list only - Issue 14760 Sep 9, 2019
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Sep 18, 2019

@thomasjpfan @rth comments were addressed it seems, let's merge?

@thomasjpfan thomasjpfan merged commit 8720684 into scikit-learn:master Sep 18, 2019
18 checks passed
18 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.9%)
Details
codecov/project Absolute coverage decreased by -0.2% but relative coverage increased by +3.09% compared to af2bad4
Details
scikit-learn.scikit-learn Build #20190908.16 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Sep 18, 2019

Thanks @CatChenal !!

@CatChenal

This comment has been minimized.

Copy link
Contributor Author

CatChenal commented Sep 18, 2019

Thank you @NicolasHug and @thomasjpfan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.