Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Use resample to compute the small training set in HistGBT #14194

Merged
merged 12 commits into from Jul 8, 2019

Conversation

johannfaouzi
Copy link
Contributor

@johannfaouzi johannfaouzi commented Jun 26, 2019

What does this implement/fix? Explain your changes.

This small PR replaces the use of numpy.random.choice with sklearn.utils.resample in order to compute the small training set in Histogram Gradient Boosting Trees.

@johannfaouzi johannfaouzi changed the title Use resample to compute the small training set in HistGBT [MRG] Use resample to compute the small training set in HistGBT Jun 26, 2019
return X_binned_small_train, y_small_train
indices = np.arange(X_binned_train.shape[0])
indices = resample(indices, n_samples=subsample_size,
replace=False, random_state=seed)
Copy link
Member

@ogrisel ogrisel Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self is a classifier (if is_classifier(self)) and y_train is a set of integer class labels, y_train should be passed to the stratify kwarg of resample.

Copy link
Contributor Author

@johannfaouzi johannfaouzi Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I think that y_train is always encoded using LabelEncoder() when self is a classifier, so there's no need to check that y_train is a set of integer class labels:

def _encode_y(self, y):
# encode classes into 0 ... n_classes - 1 and sets attributes classes_
# and n_trees_per_iteration_
check_classification_targets(y)
label_encoder = LabelEncoder()
encoded_y = label_encoder.fit_transform(y)
self.classes_ = label_encoder.classes_
n_classes = self.classes_.shape[0]
# only 1 tree for binary classification. For multiclass classification,
# we build 1 tree per class.
self.n_trees_per_iteration_ = 1 if n_classes <= 2 else n_classes
encoded_y = encoded_y.astype(Y_DTYPE, copy=False)
return encoded_y

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Jul 1, 2019

The codecov failures reveals that this branch of the if condition is never tested. I think we need a smoke test with the smallest dataset that can trigger this branch (e.g. make_classification(n_samples=10001, n_features=2, n_informative=2, n_redundant=0) and make_regression(n_samples=10001, n_features=2, n_informative=2)).

I am not sure if there is an easy way to check that stratification is enabled on the small training set. If you don't see any, don't bother. Just a smoke test to ensure that this code is covered is enough.

def test_stratification_small_trainset():
# Make sure that the small trainset is stratified
X, y = make_classification(n_samples=20000, n_features=2,
n_informative=2, n_redundant=0)
Copy link
Member

@ogrisel ogrisel Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y is balanced by default. I think you should update the test to use an imbalanced y.

n_samples =20000
rng = np.random.RandomState(42)
X = rng.randn(n_samples).reshape(-1, 1)
y = (rng.rand(n_samples) > 0.9).astype(np.int32)

Copy link
Contributor Author

@johannfaouzi johannfaouzi Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really important? If stratify=None, only the expected value of the mean of y_train will be equal to the mean of y (according to the law of large numbers), whereas if stratify=y, the mean of y_train will be equal to the mean of y (excluding rounding approximation -- that's why I set decimal=3 in the test).

I don't think that changing the balance of y will change the tested behavior, because it is independent of the balance of y imo.

@johannfaouzi
Copy link
Contributor Author

@johannfaouzi johannfaouzi commented Jul 1, 2019

I made an imbalanced dataset with a deterministic ratio (10% for the minority class), so that we can test that the balance is strictly equal to the expected balance (instead of using an approximation because of rounding issues from a finite sample).

class_one_prop = 0.1
rng = np.random.RandomState(42)
X = rng.randn(n_samples).reshape(n_samples, 1)
y = np.asarray(
Copy link
Member

@NicolasHug NicolasHug Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.random.binomial?

Copy link
Member

@NicolasHug NicolasHug Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly I don't like that it's not shuffled. If you really want an exact proportion (though I doubt it's really usefull) maybe use something like

y = np.zeros(...)
y[:n_samples * class_one_prop] = 1
shuffle(y)

def test_small_trainset(GradientBoosting, data):
# Make sure that a small trainset has the expected length (10k samples)
X, y = data
gb = GradientBoosting(random_state=42)
Copy link
Member

@NicolasHug NicolasHug Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can merge this test into the other one, we don't really need to tests both classes. The classifier is enough (others might disagree?).

also nit: do you really need to set the random state of the estimator?

Copy link
Contributor Author

@johannfaouzi johannfaouzi Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the random state of the estimator is not needed (I don't think that it is used, the dataset is small enough that the binning process will use the whole dataset).

)
gb = HistGradientBoostingClassifier(random_state=42)
X_small_train, y_small_train = gb._get_small_trainset(X, y, seed=42)
np.testing.assert_equal(y_small_train.mean(), class_one_prop)
Copy link
Member

@NicolasHug NicolasHug Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert y_small_train.mean() == pytest.approx(class_one_prop)

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jul 1, 2019

we can test that the balance is strictly equal to the expected balance

There is still randomness in the subsampling though right?

@johannfaouzi
Copy link
Contributor Author

@johannfaouzi johannfaouzi commented Jul 1, 2019

we can test that the balance is strictly equal to the expected balance

There is still randomness in the subsampling though right?

Yes! Which samples end up in the small training set is random, but not the balance of the classes.

@johannfaouzi
Copy link
Contributor Author

@johannfaouzi johannfaouzi commented Jul 1, 2019

Imho one issue with using a random proportion for y instead of fixed one is that the law of large numbers implies that the expected value of the proportion of y_train is equal to the proportion of y. Since the number of samples is quite large, even if stratify=None, the proportion of y_train will probably be close to the proportion of y.

To tackle this issue, having a fixed proportion with a fixed number of samples makes it possible to check a strict equality, instead of an approximation, which makes the test better.

Edit: I use np.random.binomial right now, but I think it would be better with a fixed proportion. If you agree, I will change it and use your code:

y = np.zeros(...)
y[:int(n_samples * class_one_prop)] = 1
shuffle(y)

Copy link
Member

@NicolasHug NicolasHug left a comment

Thanks @johannfaouzi !

class_one_prop = 0.1
rng = np.random.RandomState(42)
X = rng.randn(n_samples).reshape(n_samples, 1)
y = rng.binomial(1, p=0.1, size=n_samples)
Copy link
Member

@NicolasHug NicolasHug Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y = rng.binomial(1, p=0.1, size=n_samples)
y = rng.binomial(1, p=class_one_prop, size=n_samples)

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jul 1, 2019

Yeah checking stratification is always tricky because of the issue you mentioned. I'm OK with a fixed proportion.

@johannfaouzi
Copy link
Contributor Author

@johannfaouzi johannfaouzi commented Jul 1, 2019

@johannfaouzi
Copy link
Contributor Author

@johannfaouzi johannfaouzi commented Jul 2, 2019

I added more classes with a fixed distribution:

original_prop = {0: 0.1, 1: 0.2, 2: 0.3, 3: 0.4}

and the test compares the original distribution with the distribution in the small training set:

unique, counts = np.unique(y_small, return_counts=True)
small_prop = {class_: count / 10000 for (class_, count) in zip(unique, counts)}
assert small_prop == original_prop

Edit: I updated the code to use pytest.approx() when comparing the dictionaries.

ogrisel
ogrisel approved these changes Jul 8, 2019
@ogrisel ogrisel merged commit db2342f into scikit-learn:master Jul 8, 2019
12 of 14 checks passed
@ogrisel
Copy link
Member

@ogrisel ogrisel commented Jul 8, 2019

Thanks @johannfaouzi !

@johannfaouzi johannfaouzi deleted the use_resample_in_HistGBT branch Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants