Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Weights parameter of datasets.make_classification changed to array-like from list only - Issue 14760 #14764

Merged
merged 37 commits into from
Sep 18, 2019
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
5cb4bda
Finalizes fix for #12202 from abandonned PR by @parul-l
CatChenal Dec 18, 2018
a39baed
Completes 12202 fix abandoned by @parul-l
CatChenal Dec 18, 2018
a5b73bb
Completes 12202 fix abandoned by @parul-l
CatChenal Dec 18, 2018
e9f9594
Extends 12202 fix over feature_extraction/image.py
CatChenal Dec 18, 2018
35da14a
Closes #12202; white space removal
CatChenal Dec 18, 2018
0730c7a
Closes #12202; white space removal2
CatChenal Dec 18, 2018
f81beda
Closes #12202; 3.5 compliance; added >>> in docstring code.
CatChenal Dec 18, 2018
294ce61
Closes #12202; indentation discrep.
CatChenal Dec 18, 2018
4f12330
Closes #12202; indentation discrep.2
CatChenal Dec 18, 2018
2a8d782
Example output formating; @jnotham
CatChenal Dec 21, 2018
d14b3b6
Example output formating; forgot flake8
CatChenal Dec 21, 2018
3f8b0b1
Closes #12202; Removed excessive indentation in docstring (#wimlds)
CatChenal Jan 5, 2019
5fc5496
conflict resolution?
CatChenal Jan 5, 2019
236dafb
Closes #12202; Fixed inconsistent indentation in docstring (#wimlds)
CatChenal Jan 5, 2019
0bbbea4
Closes #12202 (#wimlds); intentation, v3.5 compliance
CatChenal Jan 9, 2019
f173bf6
Closes #12202 (#wimlds); Output format issue solved with addition of …
CatChenal Jan 16, 2019
aeae3b0
Closes #12202 (#wimlds); Output format issue solved with addition of …
CatChenal Jan 16, 2019
e38bca9
Closes #12202 (#wimlds); Testing doctest direc.: removed DONT_ACCEPT_…
CatChenal Jan 25, 2019
f9352ab
Closes #12202 (#wimlds); Removed blank lines in doctest example.
CatChenal Jan 30, 2019
de75f0b
weigts in make_classification as sequence not list (#wimlds)
CatChenal Aug 24, 2019
106916f
resolved merge
CatChenal Aug 24, 2019
ec833ed
fix conflicts with upstream/master
CatChenal Aug 24, 2019
8e3cb1b
split if-statement
CatChenal Aug 24, 2019
776e74e
added test `test_make_classification_weights_type` in test_samples_ge…
CatChenal Aug 24, 2019
4b700b8
fixed flake8 & pylint errors
CatChenal Aug 26, 2019
0c7fec4
flake8 err in test file
CatChenal Aug 26, 2019
cac527a
Added parametrized tests for weights type.
CatChenal Aug 28, 2019
6110038
Added untrapped TypeError in samples_generator.py and tests.
CatChenal Aug 28, 2019
7946476
Added untrapped TypeError in samples_generator.py and tests.
CatChenal Aug 28, 2019
fd2eae6
Minor changes as per @NicolasHug
CatChenal Sep 3, 2019
a6dd8d9
Corrected `assert` statement in `test_make_multilabel_classification_…
CatChenal Sep 3, 2019
e9f89bf
Added coverage for weiths resizing in `test_make_classification_weigh…
CatChenal Sep 4, 2019
0c2a124
Corrected `test_make_classification_weights_array_or_list_ok` as per …
CatChenal Sep 4, 2019
b185474
Prettified docstr + updated whats_new/v0.22.rst.
CatChenal Sep 5, 2019
9e7c60b
Fixed rst problem in whats_new/v0.22.rst. :func:package.module.method…
CatChenal Sep 5, 2019
d0bf043
Fixed rst :func: ref as per @thomasjpfan.
CatChenal Sep 5, 2019
11a179d
Removed organization link in release notes. #WiMLDS`
CatChenal Sep 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 19 additions & 13 deletions sklearn/datasets/samples_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,8 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
n_clusters_per_class : int, optional (default=2)
The number of clusters per class.

weights : list of floats or None (default=None)
weights : array-like of shape (n_classes,) or (n_classes - 1,),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is not rendered nicely.

To render nicely:

Suggested change
weights : array-like of shape (n_classes,) or (n_classes - 1,),
weights : array-like of shape (n_classes,) or (n_classes - 1,),\

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @thomasjpfan.
Would you please document how you reached that end-point to verify the rendering? My doc tree does not have a /modules/generated/ path.

Copy link
Member

@thomasjpfan thomasjpfan Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you build the html documentation using these instructions, there will be a new folder: doc/_build which contains doc/_build/html/stable/index.html which is the landing page of the scikit-learn. From there you can navigate to the make_classification docs by going to the API page.

(default=None)
The proportions of samples assigned to each class. If None, then
classes are balanced. Note that if ``len(weights) == n_classes - 1``,
then the last class weight is automatically inferred.
Expand Down Expand Up @@ -160,22 +161,27 @@ def make_classification(n_samples=100, n_features=20, n_informative=2,
" features")
# Use log2 to avoid overflow errors
if n_informative < np.log2(n_classes * n_clusters_per_class):
raise ValueError("n_classes * n_clusters_per_class must"
" be smaller or equal 2 ** n_informative")
if weights and len(weights) not in [n_classes, n_classes - 1]:
raise ValueError("Weights specified but incompatible with number "
"of classes.")
msg = "n_classes({}) * n_clusters_per_class({}) must be"
msg += " smaller or equal 2**n_informative({})={}"
raise ValueError(msg.format(n_classes, n_clusters_per_class,
n_informative, 2**n_informative))

if weights is not None:
if len(weights) not in [n_classes, n_classes - 1]:
raise ValueError("Weights specified but incompatible with number "
"of classes.")
if len(weights) == n_classes - 1:
if isinstance(weights, list):
weights = weights + [1.0 - sum(weights)]
else:
weights = np.resize(weights, n_classes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part isn't covered by the tests. I think you can cover it easily by setting n_classes=3 in test_make_classification_weights_array_or_list_ok.

weights[-1] = 1.0 - sum(weights[:-1])
else:
weights = [1.0 / n_classes] * n_classes

n_useless = n_features - n_informative - n_redundant - n_repeated
n_clusters = n_classes * n_clusters_per_class

if weights and len(weights) == (n_classes - 1):
weights = weights + [1.0 - sum(weights)]

if weights is None:
weights = [1.0 / n_classes] * n_classes
weights[-1] = 1.0 - sum(weights[:-1])

# Distribute samples among clusters by weight
n_samples_per_cluster = [
int(n_samples * weights[k % n_classes] / n_clusters_per_class)
Expand Down
30 changes: 30 additions & 0 deletions sklearn/datasets/tests/test_samples_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,36 @@ def test_make_classification_informative_features():
n_clusters_per_class=2)


@pytest.mark.parametrize(
'weights, err_type, err_msg',
[
([], ValueError,
"Weights specified but incompatible with number of classes."),
([.25, .75, .1], ValueError,
"Weights specified but incompatible with number of classes."),
(np.array([]), ValueError,
"Weights specified but incompatible with number of classes."),
(np.array([.25, .75, .1]), ValueError,
"Weights specified but incompatible with number of classes."),
(np.random.random(3), ValueError,
"Weights specified but incompatible with number of classes.")
]
)
def test_make_classification_weights_type(weights, err_type, err_msg):
with pytest.raises(err_type, match=err_msg):
make_classification(weights=weights)


@pytest.mark.parametrize("kwargs", [{}, {"n_classes": 3, "n_informative": 3}])
def test_make_classification_weights_array_or_list_ok(kwargs):
X1, y1 = make_classification(weights=[.1, .9],
random_state=0, **kwargs)
X2, y2 = make_classification(weights=np.array([.1, .9]),
random_state=0, **kwargs)
assert_almost_equal(X1, X2)
assert_almost_equal(y1, y2)


def test_make_multilabel_classification_return_sequences():
for allow_unlabeled, min_length in zip((True, False), (0, 1)):
X, Y = make_multilabel_classification(n_samples=100, n_features=20,
Expand Down