New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Weights parameter of datasets.make_classification changed to array-like from list only - Issue 14760 #14764
Conversation
…addition of # doctest: +NORMALIZE_WHITESPACE +DONT_ACCEPT_BLANKLINE +ELLIPSIS for print statements.
…addition of # doctest: +NORMALIZE_WHITESPACE +DONT_ACCEPT_BLANKLINE for print statements.
…DONT_ACCEPT_BLANKLINE (@jnothmam, @reshamas)
@@ -162,14 +162,14 @@ def make_classification(n_samples=100, n_features=20, n_informative=2, | |||
if n_informative < np.log2(n_classes * n_clusters_per_class): | |||
raise ValueError("n_classes * n_clusters_per_class must" | |||
" be smaller or equal 2 ** n_informative") | |||
if weights and len(weights) not in [n_classes, n_classes - 1]: | |||
if all(weights) and len(weights) not in [n_classes, n_classes - 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe weights is not None
would be more clear?
sklearn/feature_extraction/image.py
Outdated
@@ -337,22 +337,37 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None): | |||
Examples | |||
-------- | |||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge issues here
@@ -162,17 +162,18 @@ def make_classification(n_samples=100, n_features=20, n_informative=2, | |||
if n_informative < np.log2(n_classes * n_clusters_per_class): | |||
raise ValueError("n_classes * n_clusters_per_class must" | |||
" be smaller or equal 2 ** n_informative") | |||
if weights and len(weights) not in [n_classes, n_classes - 1]: | |||
raise ValueError("Weights specified but incompatible with number " | |||
if not weights is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idiomatic python way is weights is not None
.
I would use a single if
and and
. You can put the whole condition in parenthesis and make it multi-line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
sklearn/feature_extraction/image.py
Outdated
@@ -337,22 +337,22 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None): | |||
Examples | |||
-------- | |||
>>> from sklearn.datasets import load_sample_image | |||
>>> from sklearn.datasets import load_sample_images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change seems unrelated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @CatChenal can you please revert these changes
Please add a non-regression test that would fail at master but pass in this PR. |
@@ -91,7 +91,7 @@ def make_classification(n_samples=100, n_features=20, n_informative=2, | |||
n_clusters_per_class : int, optional (default=2) | |||
The number of clusters per class. | |||
weights : list of floats or None (default=None) | |||
weights : sequence of floats or None (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call this an array-like.
weights : sequence of floats or None (default=None) | |
weights : array-like of shape (n_classes,) or (n_classes - 1,), default=None |
@@ -162,17 +162,18 @@ def make_classification(n_samples=100, n_features=20, n_informative=2, | |||
if n_informative < np.log2(n_classes * n_clusters_per_class): | |||
raise ValueError("n_classes * n_clusters_per_class must" | |||
" be smaller or equal 2 ** n_informative") | |||
if weights and len(weights) not in [n_classes, n_classes - 1]: | |||
w_ok = (weights is not None) and all(weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't an error be raised if all(weights)
is false?
|
||
if weights is None: | ||
if weights is not None: | ||
if all(weights) and len(weights) == (n_classes - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if all(weights) and len(weights) == (n_classes - 1): | |
if len(weights) == (n_classes - 1): |
if weights is not None: | ||
if all(weights) and len(weights) == (n_classes - 1): | ||
weights = weights + [1.0 - sum(weights)] | ||
else: | ||
weights = [1.0 / n_classes] * n_classes | ||
weights[-1] = 1.0 - sum(weights[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not you, but that line is useless ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which line?
Line 175 resizes the (n-classes - 1) array with the missing weight, so it makes sense.
Line 178 recalculate to last position of weights according to values set on line 177;
Line 178 is the useless, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights = [1.0 / n_classes] * n_classes
weights[-1] = 1.0 - sum(weights[:-1]) # <-- this one
# w as array: should pass in PR_14764, fail in master | ||
w = np.array([0.25, 0.75]) | ||
X, y = make_classification(weights=w) | ||
assert X.shape == (100, 20), "X shape mismatch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We like to parametrize these kind of tests. You can look for some inspiration in e.g. this test
Let us know if you need any help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the tests in test_samples_generator.py
are parametrized. Do you want me to parametrize all of them or just the tests for make_classification()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, found one: test_make_blobs_n_samples_centers_none()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just suggesting to parametrize the test you wrote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good...
I just found out one wrong way to do it:
@pytest.mark.parametrize(
'params, err_msg',
[({'weights': 0}, "object of type 'int' has no len()"),
({'weights': -1}, "object of type 'int' has no len()"),
({'weights': []}, "Weights specified but incompatible with number of classes."),
({'weights': [.25,.75,.1]}, "Weights specified but incompatible with number of classes."),
({'weights': np.array([])},"Weights specified but incompatible with number of classes."),
({'weights': np.array([.25,.75,.1])},"Weights specified but incompatible with number of classes.")]
)
def test_make_classification_weights_type(params, err_msg):
make = partial(make_classification,
n_samples=100,
n_features=20,
n_informative=2,
n_redundant=2,
n_repeated=0,
n_classes=2,
n_clusters_per_class=2,
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=0)
for i in range(len(params)):
with pytest.raises(ValueError, match=err_msg[i]):
make(weights=params[i]['weights'])
The first problem is that the mark.parametrize
statement is incorrect: the weights in the partial functions are not split and I have not found out how to fix it yet.
The other problem is likely the iteration of the context manager (should not be needed...).
Thanks for pointing me in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're almost there. Here is a basic example:
@pytest.mark.parametrize(
'weights, err_msg',
[
([1, 2, 3], "incompatible with number of classes"),
# add other test cases here
]
)
def test_make_classification_weights_type(weights, err_msg):
with pytest.raises(ValueError, match=err_msg):
make_classification(weights=weights)
Thanks @CatChenal , I made a few more comments but mostly looks good. Could you also please add a very simple test that makes sure passing e.g. [1, 2, 3]
gives the same result as passing np.array([1, 2. 3])
. Thanks!
sklearn/feature_extraction/image.py
Outdated
@@ -337,22 +337,22 @@ def extract_patches_2d(image, patch_size, max_patches=None, random_state=None): | |||
Examples | |||
-------- | |||
>>> from sklearn.datasets import load_sample_image | |||
>>> from sklearn.datasets import load_sample_images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @CatChenal can you please revert these changes
@@ -380,7 +390,7 @@ def sample_example(): | |||
X_indices = array.array('i') | |||
X_indptr = array.array('i', [0]) | |||
Y = [] | |||
for i in range(n_samples): | |||
for _ in range(n_samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid unrelated changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -1237,7 +1247,7 @@ def make_spd_matrix(n_dim, random_state=None): | |||
generator = check_random_state(random_state) | |||
|
|||
A = generator.rand(n_dim, n_dim) | |||
U, s, V = linalg.svd(np.dot(A.T, A)) | |||
U, _, V = linalg.svd(np.dot(A.T, A)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -11,6 +11,8 @@ | |||
from sklearn.utils.testing import assert_array_almost_equal | |||
from sklearn.utils.testing import assert_raise_message | |||
|
|||
from sklearn.utils.validation import assert_all_finite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
n_informative, 2**n_informative)) | ||
|
||
if weights is not None: | ||
if isinstance(weights, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a specific check for int (else it means we would need specific checks for pretty much every type). I guess a safe way is to convert the weights to a numpy array. You can then just check the length as you do below, and use np.sum everywhere.
if len(weights) not in [n_classes, n_classes - 1]: | ||
raise ValueError("Weights specified but incompatible with number " | ||
"of classes.") | ||
if len(weights) == (n_classes - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(weights) == (n_classes - 1): | |
if len(weights) == n_classes - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Minor comment about test coverage, but LGTM anyway. Thanks @CatChenal !
if isinstance(weights, list): | ||
weights = weights + [1.0 - sum(weights)] | ||
else: | ||
weights = np.resize(weights, n_classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part isn't covered by the tests. I think you can cover it easily by setting n_classes=3 in test_make_classification_weights_array_or_list_ok
.
…ts_array_or_list_ok` as per @NicolasHug.
random_state=0) | ||
X2, y2 = make_classification(weights=np.array([.1, .9]), | ||
random_state=0) | ||
assert (X1.all() == X2.all()) and (y1.all() == y2.all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X1.all()
returns True if X1 is all non-zero. Is this assertion to do the following:
assert_almost_equal(X1, X2)
assert_almost_equal(y1, y2)
make_classification(weights=weights) | ||
|
||
|
||
def test_make_classification_weights_array_or_list_ok(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be parametrized:
@pytest.mark.parametrize("kwargs", [{}, {"n_classes": 3, "n_informative": 3}])
def test_make_classification_weights_array_or_list_ok(kwargs):
X1, y1 = make_classification(weights=[.1, .9],
random_state=0, **kwargs)
X2, y2 = make_classification(weights=np.array([.1, .9]),
random_state=0, **kwargs)
assert_almost_equal(X1, X2)
assert_almost_equal(y1, y2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@@ -91,7 +91,8 @@ def make_classification(n_samples=100, n_features=20, n_informative=2, | |||
n_clusters_per_class : int, optional (default=2) | |||
The number of clusters per class. | |||
weights : list of floats or None (default=None) | |||
weights : array-like of shape (n_classes,) or (n_classes - 1,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is not rendered nicely.
To render nicely:
weights : array-like of shape (n_classes,) or (n_classes - 1,), | |
weights : array-like of shape (n_classes,) or (n_classes - 1,),\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @thomasjpfan.
Would you please document how you reached that end-point to verify the rendering? My doc tree does not have a /modules/generated/
path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you build the html documentation using these instructions, there will be a new folder: doc/_build
which contains doc/_build/html/stable/index.html
which is the landing page of the scikit-learn. From there you can navigate to the make_classification
docs by going to the API page.
Please add |
doc/whats_new/v0.22.rst
Outdated
@@ -85,6 +85,11 @@ Changelog | |||
:func:`datasets.fetch_20newsgroups` and :func:`datasets.fetch_olivetti_faces` | |||
. :pr:`14259` by :user:`Sourav Singh <souravsingh>`. | |||
|
|||
- |Enhancement| `make_classification` in :func:`datasets.samples_generator` now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- |Enhancement| `make_classification` in :func:`datasets.samples_generator` now | |
- |Enhancement| :func:`datasets.make_classification` now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course!
doc/whats_new/v0.22.rst
Outdated
- |Enhancement| `make_classification` in :func:`datasets.samples_generator` now | ||
accepts array-like `weights` parameter, i.e. list or numpy.array, instead of | ||
list only. | ||
:pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to `WiMLDS <WiMLDS>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to `WiMLDS <WiMLDS>`. | |
:pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to *WiMLDS*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly not. That would downgrade @wimlds's contributions to Opensource & Scikit-learn in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would link you and @wimlds
- |Enhancement| :func:`datasets.make_classification` now accepts array-like
`weights` parameter, i.e. list or numpy.array, instead of list only.
:pr:`14764` by :user:`Cat Chenal <CatChenal>`, with thanks to
:user:`WiMLDS <WiMLDS>`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly not. That would downgrade @wimlds's contributions to Opensource & Scikit-learn in particular.
Sorry, I misunderstood the intend of the string. I can see now you were trying to link to the organization on github. The above snippet should correctly link to their organization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, we usually acknowledge individuals not organizations in release notes. For funding organizations are typically mentioned in https://scikit-learn.org/stable/about.html#funding). I think we could maybe add a section for WiMLDS and similar partner non-profit organizations there? The problem with acknowledgements of organizations in release notes is that most contributions have some sort of organization behind it ( Numfocus, conference where the sprint happened, company who allowed its employee to contribute during work time, etc), and then deciding to acknowledge some but not others is tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information. The mention is then out of place in the release notes. I will remove it & add #WiMLDS in the final commit.
@NicolasHug cc: @kellycarmody |
Is WiMLDS being listed here as a sponsor, or rather a way that the
contributor was able to learn to contribute? I like the WiMLDS mention in
the change log.
|
WiMLDS contributed in the following ways:
Any way that is acknowledged would be cool. |
@thomasjpfan @rth comments were addressed it seems, let's merge? |
Thanks @CatChenal !! |
Thank you @NicolasHug and @thomasjpfan! |
Fixes #14760
What does this implement/fix?
The weights parameter can be a list or array, not just a list, e.g.
"
weights
is array-like or None " in docstring.