Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Use resample to compute the small training set in HistGBT #14194

Merged
merged 12 commits into from Jul 8, 2019

Conversation

@johannfaouzi
Copy link
Contributor

johannfaouzi commented Jun 26, 2019

What does this implement/fix? Explain your changes.

This small PR replaces the use of numpy.random.choice with sklearn.utils.resample in order to compute the small training set in Histogram Gradient Boosting Trees.

@johannfaouzi johannfaouzi changed the title Use resample to compute the small training set in HistGBT [MRG] Use resample to compute the small training set in HistGBT Jun 26, 2019
return X_binned_small_train, y_small_train
indices = np.arange(X_binned_train.shape[0])
indices = resample(indices, n_samples=subsample_size,
replace=False, random_state=seed)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jul 1, 2019

Member

If self is a classifier (if is_classifier(self)) and y_train is a set of integer class labels, y_train should be passed to the stratify kwarg of resample.

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jul 1, 2019

Author Contributor

Good catch!

I think that y_train is always encoded using LabelEncoder() when self is a classifier, so there's no need to check that y_train is a set of integer class labels:

def _encode_y(self, y):
# encode classes into 0 ... n_classes - 1 and sets attributes classes_
# and n_trees_per_iteration_
check_classification_targets(y)
label_encoder = LabelEncoder()
encoded_y = label_encoder.fit_transform(y)
self.classes_ = label_encoder.classes_
n_classes = self.classes_.shape[0]
# only 1 tree for binary classification. For multiclass classification,
# we build 1 tree per class.
self.n_trees_per_iteration_ = 1 if n_classes <= 2 else n_classes
encoded_y = encoded_y.astype(Y_DTYPE, copy=False)
return encoded_y

johannfaouzi added 4 commits Jul 1, 2019
@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Jul 1, 2019

The codecov failures reveals that this branch of the if condition is never tested. I think we need a smoke test with the smallest dataset that can trigger this branch (e.g. make_classification(n_samples=10001, n_features=2, n_informative=2, n_redundant=0) and make_regression(n_samples=10001, n_features=2, n_informative=2)).

I am not sure if there is an easy way to check that stratification is enabled on the small training set. If you don't see any, don't bother. Just a smoke test to ensure that this code is covered is enough.

def test_stratification_small_trainset():
# Make sure that the small trainset is stratified
X, y = make_classification(n_samples=20000, n_features=2,
n_informative=2, n_redundant=0)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jul 1, 2019

Member

y is balanced by default. I think you should update the test to use an imbalanced y.

n_samples =20000
rng = np.random.RandomState(42)
X = rng.randn(n_samples).reshape(-1, 1)
y = (rng.rand(n_samples) > 0.9).astype(np.int32)

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jul 1, 2019

Author Contributor

Is it really important? If stratify=None, only the expected value of the mean of y_train will be equal to the mean of y (according to the law of large numbers), whereas if stratify=y, the mean of y_train will be equal to the mean of y (excluding rounding approximation -- that's why I set decimal=3 in the test).

I don't think that changing the balance of y will change the tested behavior, because it is independent of the balance of y imo.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jul 1, 2019

I made an imbalanced dataset with a deterministic ratio (10% for the minority class), so that we can test that the balance is strictly equal to the expected balance (instead of using an approximation because of rounding issues from a finite sample).

class_one_prop = 0.1
rng = np.random.RandomState(42)
X = rng.randn(n_samples).reshape(n_samples, 1)
y = np.asarray(

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jul 1, 2019

Contributor

use np.random.binomial?

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jul 1, 2019

Contributor

Mainly I don't like that it's not shuffled. If you really want an exact proportion (though I doubt it's really usefull) maybe use something like

y = np.zeros(...)
y[:n_samples * class_one_prop] = 1
shuffle(y)
def test_small_trainset(GradientBoosting, data):
# Make sure that a small trainset has the expected length (10k samples)
X, y = data
gb = GradientBoosting(random_state=42)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jul 1, 2019

Contributor

I think you can merge this test into the other one, we don't really need to tests both classes. The classifier is enough (others might disagree?).

also nit: do you really need to set the random state of the estimator?

This comment has been minimized.

Copy link
@johannfaouzi

johannfaouzi Jul 1, 2019

Author Contributor

Indeed, the random state of the estimator is not needed (I don't think that it is used, the dataset is small enough that the binning process will use the whole dataset).

)
gb = HistGradientBoostingClassifier(random_state=42)
X_small_train, y_small_train = gb._get_small_trainset(X, y, seed=42)
np.testing.assert_equal(y_small_train.mean(), class_one_prop)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jul 1, 2019

Contributor

use assert y_small_train.mean() == pytest.approx(class_one_prop)

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jul 1, 2019

we can test that the balance is strictly equal to the expected balance

There is still randomness in the subsampling though right?

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jul 1, 2019

we can test that the balance is strictly equal to the expected balance

There is still randomness in the subsampling though right?

Yes! Which samples end up in the small training set is random, but not the balance of the classes.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jul 1, 2019

Imho one issue with using a random proportion for y instead of fixed one is that the law of large numbers implies that the expected value of the proportion of y_train is equal to the proportion of y. Since the number of samples is quite large, even if stratify=None, the proportion of y_train will probably be close to the proportion of y.

To tackle this issue, having a fixed proportion with a fixed number of samples makes it possible to check a strict equality, instead of an approximation, which makes the test better.

Edit: I use np.random.binomial right now, but I think it would be better with a fixed proportion. If you agree, I will change it and use your code:

y = np.zeros(...)
y[:int(n_samples * class_one_prop)] = 1
shuffle(y)
Copy link
Contributor

NicolasHug left a comment

Thanks @johannfaouzi !

class_one_prop = 0.1
rng = np.random.RandomState(42)
X = rng.randn(n_samples).reshape(n_samples, 1)
y = rng.binomial(1, p=0.1, size=n_samples)

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Jul 1, 2019

Contributor
Suggested change
y = rng.binomial(1, p=0.1, size=n_samples)
y = rng.binomial(1, p=class_one_prop, size=n_samples)
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jul 1, 2019

Yeah checking stratification is always tricky because of the issue you mentioned. I'm OK with a fixed proportion.

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jul 1, 2019

@johannfaouzi

This comment has been minimized.

Copy link
Contributor Author

johannfaouzi commented Jul 2, 2019

I added more classes with a fixed distribution:

original_prop = {0: 0.1, 1: 0.2, 2: 0.3, 3: 0.4}

and the test compares the original distribution with the distribution in the small training set:

unique, counts = np.unique(y_small, return_counts=True)
small_prop = {class_: count / 10000 for (class_, count) in zip(unique, counts)}
assert small_prop == original_prop

Edit: I updated the code to use pytest.approx() when comparing the dictionaries.

@ogrisel
ogrisel approved these changes Jul 8, 2019
@ogrisel ogrisel merged commit db2342f into scikit-learn:master Jul 8, 2019
12 of 14 checks passed
12 of 14 checks passed
ci/circleci: doc CircleCI is running your tests
Details
ci/circleci: doc-min-dependencies CircleCI is running your tests
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn Build #20190708.4 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_np_atlas_32bit) Linux32 py35_np_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Jul 8, 2019

Thanks @johannfaouzi !

@johannfaouzi johannfaouzi deleted the johannfaouzi:use_resample_in_HistGBT branch Jul 8, 2019
koenvandevelde added a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.