New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Use resample to compute the small training set in HistGBT #14194
[MRG] Use resample to compute the small training set in HistGBT #14194
Conversation
return X_binned_small_train, y_small_train | ||
indices = np.arange(X_binned_train.shape[0]) | ||
indices = resample(indices, n_samples=subsample_size, | ||
replace=False, random_state=seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If self
is a classifier (if is_classifier(self)
) and y_train
is a set of integer class labels, y_train
should be passed to the stratify
kwarg of resample
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I think that y_train
is always encoded using LabelEncoder()
when self
is a classifier, so there's no need to check that y_train
is a set of integer class labels:
y = self._encode_y(y) |
scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
Lines 949 to 962 in 5310b87
def _encode_y(self, y): | |
# encode classes into 0 ... n_classes - 1 and sets attributes classes_ | |
# and n_trees_per_iteration_ | |
check_classification_targets(y) | |
label_encoder = LabelEncoder() | |
encoded_y = label_encoder.fit_transform(y) | |
self.classes_ = label_encoder.classes_ | |
n_classes = self.classes_.shape[0] | |
# only 1 tree for binary classification. For multiclass classification, | |
# we build 1 tree per class. | |
self.n_trees_per_iteration_ = 1 if n_classes <= 2 else n_classes | |
encoded_y = encoded_y.astype(Y_DTYPE, copy=False) | |
return encoded_y |
The codecov failures reveals that this branch of the if condition is never tested. I think we need a smoke test with the smallest dataset that can trigger this branch (e.g. I am not sure if there is an easy way to check that stratification is enabled on the small training set. If you don't see any, don't bother. Just a smoke test to ensure that this code is covered is enough. |
def test_stratification_small_trainset(): | ||
# Make sure that the small trainset is stratified | ||
X, y = make_classification(n_samples=20000, n_features=2, | ||
n_informative=2, n_redundant=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y is balanced by default. I think you should update the test to use an imbalanced y.
n_samples =20000
rng = np.random.RandomState(42)
X = rng.randn(n_samples).reshape(-1, 1)
y = (rng.rand(n_samples) > 0.9).astype(np.int32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really important? If stratify=None
, only the expected value of the mean of y_train
will be equal to the mean of y
(according to the law of large numbers), whereas if stratify=y
, the mean of y_train
will be equal to the mean of y
(excluding rounding approximation -- that's why I set decimal=3
in the test).
I don't think that changing the balance of y
will change the tested behavior, because it is independent of the balance of y
imo.
I made an imbalanced dataset with a deterministic ratio (10% for the minority class), so that we can test that the balance is strictly equal to the expected balance (instead of using an approximation because of rounding issues from a finite sample). |
class_one_prop = 0.1 | ||
rng = np.random.RandomState(42) | ||
X = rng.randn(n_samples).reshape(n_samples, 1) | ||
y = np.asarray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.random.binomial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly I don't like that it's not shuffled. If you really want an exact proportion (though I doubt it's really usefull) maybe use something like
y = np.zeros(...)
y[:n_samples * class_one_prop] = 1
shuffle(y)
def test_small_trainset(GradientBoosting, data): | ||
# Make sure that a small trainset has the expected length (10k samples) | ||
X, y = data | ||
gb = GradientBoosting(random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can merge this test into the other one, we don't really need to tests both classes. The classifier is enough (others might disagree?).
also nit: do you really need to set the random state of the estimator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the random state of the estimator is not needed (I don't think that it is used, the dataset is small enough that the binning process will use the whole dataset).
) | ||
gb = HistGradientBoostingClassifier(random_state=42) | ||
X_small_train, y_small_train = gb._get_small_trainset(X, y, seed=42) | ||
np.testing.assert_equal(y_small_train.mean(), class_one_prop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert y_small_train.mean() == pytest.approx(class_one_prop)
There is still randomness in the subsampling though right? |
Yes! Which samples end up in the small training set is random, but not the balance of the classes. |
Imho one issue with using a random proportion for To tackle this issue, having a fixed proportion with a fixed number of samples makes it possible to check a strict equality, instead of an approximation, which makes the test better. Edit: I use y = np.zeros(...)
y[:int(n_samples * class_one_prop)] = 1
shuffle(y) |
class_one_prop = 0.1 | ||
rng = np.random.RandomState(42) | ||
X = rng.randn(n_samples).reshape(n_samples, 1) | ||
y = rng.binomial(1, p=0.1, size=n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y = rng.binomial(1, p=0.1, size=n_samples) | |
y = rng.binomial(1, p=class_one_prop, size=n_samples) |
Yeah checking stratification is always tricky because of the issue you mentioned. I'm OK with a fixed proportion. |
We could increase the number of classes to increase the variance in the class distribution in the small training set when there is no stratification also.
… Le 1 juil. 2019 à 17:50, Nicolas Hug ***@***.***> a écrit :
Yeah checking stratification is always tricky because of the issue you mentioned. I'm OK with a fixed proportion.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I added more classes with a fixed distribution: original_prop = {0: 0.1, 1: 0.2, 2: 0.3, 3: 0.4} and the test compares the original distribution with the distribution in the small training set: unique, counts = np.unique(y_small, return_counts=True)
small_prop = {class_: count / 10000 for (class_, count) in zip(unique, counts)}
assert small_prop == original_prop Edit: I updated the code to use |
Thanks @johannfaouzi ! |
What does this implement/fix? Explain your changes.
This small PR replaces the use of numpy.random.choice with sklearn.utils.resample in order to compute the small training set in Histogram Gradient Boosting Trees.