Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] LogisticRegression convert to float64 (newton-cg) #8835

Merged
merged 26 commits into from Jun 7, 2017
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
520a2b3
Add a test to ensure not changing the input's data type
massich May 1, 2017
0d332bf
[WIP] Force X to remain float32. (self.coef_ remains float64 even if …
massich May 2, 2017
9ec920b
[WIP] ensure self.coef_ same type as X
massich May 5, 2017
d3cf91c
keep the np.float32 when multi_class='multinomial'
massich May 5, 2017
cb56481
Avoid hardcoded type for multinomial
massich May 5, 2017
14450a2
pass flake8
massich May 19, 2017
7adbb24
Ensure that the results in 32bits are the same as in 64
massich May 19, 2017
c0f6930
Address Gael's comments for multi_class=='ovr'
massich May 30, 2017
e6a3c23
Add multi_class=='multinominal' to test
massich May 30, 2017
4ac33e8
Add support for multi_class=='multinominal'
massich May 30, 2017
2f194b6
prefer float64 to float32
massich May 30, 2017
8ea7682
Force X and y to have the same type
massich May 30, 2017
75dc160
Revert "Add support for multi_class=='multinominal'"
massich May 30, 2017
03b6a57
remvert more stuff
massich May 30, 2017
18a5ac3
clean up some commmented code
massich May 30, 2017
05b750e
allow class_weight to take advantage of float32
massich May 30, 2017
4bef1da
Add a test where X.dtype is different of y.dtype
massich May 31, 2017
f63ef77
Address @raghavrv comments
massich Jun 2, 2017
1a29e0b
address the rest of @raghavrv's comments
massich Jun 2, 2017
366f751
Revert class_weight
massich Jun 2, 2017
093f25a
Avoid copying if dtype matches
massich Jun 2, 2017
fb545de
Address alex comment to the cast from inside _multinomial_loss_grad
massich Jun 6, 2017
15d5079
address alex comment
massich Jun 6, 2017
057e2e0
add sparsity test
Jun 6, 2017
0a07a2e
Merge pull request #3 from Henley13/is/8769
massich Jun 6, 2017
b80cd39
Addressed Tom comment of checking that we keep the 64 aswell
massich Jun 6, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 14 additions & 7 deletions sklearn/linear_model/logistic.py
Expand Up @@ -337,10 +337,12 @@ def _multinomial_loss_grad(w, X, Y, alpha, sample_weight):
n_classes = Y.shape[1]
n_features = X.shape[1]
fit_intercept = (w.size == n_classes * (n_features + 1))
grad = np.zeros((n_classes, n_features + bool(fit_intercept)))
grad = np.zeros((n_classes, n_features + bool(fit_intercept)),
dtype=X.dtype)
loss, p, w = _multinomial_loss(w, X, Y, alpha, sample_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_multinomial_loss does not return float32 if dtype is float32 for X and y?
that would avoid the diff = diff.astype(X.dtype, copy=False) below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acutally, _multinomial_loss does keep all the types as float32. The problem is that Y in diff = sample_weight * (p - Y)(here) is of type int64 and therefore diff become float64.

Y is set (here) as the second parameter of args, based on target which is set at its time by y_bin or Y_multi. The former is fine, while the later is determined by the transform() of LabelBinarizer or LabelEncoder. (here, here)

We could transform target to X.dtype as in the saga case (here), in the following manner:

target = target.astype(X.dtype)
args = (X, target, 1. / C, sample_weight)

or we could propagate the change inside transform_fit.

any thoughts @agramfort ?

cc: @Henley13, @GaelVaroquaux, @raghavrv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would convert Y to dtype on the top

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could transform target to X.dtype as in the saga case (here), in the following manner:

target = target.astype(X.dtype)
args = (X, target, 1. / C, sample_weight)

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort
y is already in the correnct form.
are you proposing:

Y_multi = le.fit_transform(y).astype(X.dtype, copy=False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

sample_weight = sample_weight[:, np.newaxis]
diff = sample_weight * (p - Y)
diff = diff.astype(X.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line introduces a memory copy. I think that we should do it only if X.dtype != diff.dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. we should set copy=False to avoid copying if it is of the same dtype already...

grad[:, :n_features] = safe_sparse_dot(diff.T, X)
grad[:, :n_features] += alpha * w
if fit_intercept:
Expand Down Expand Up @@ -608,10 +610,10 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
# and check length
# Otherwise set them to 1 for all examples
if sample_weight is not None:
sample_weight = np.array(sample_weight, dtype=np.float64, order='C')
sample_weight = np.array(sample_weight, dtype=X.dtype, order='C')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be y.dtype?

cc: @agramfort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were discussing with @glemaitre (and @GaelVaroquaux) to force X.dtype and y.dtype to be the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the idea should be that the dtype of X conditions the dtype of the computation.

We should be an RFC about this, and include it in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #8976

check_consistent_length(y, sample_weight)
else:
sample_weight = np.ones(X.shape[0])
sample_weight = np.ones(X.shape[0], dtype=X.dtype)

# If class_weights is a dict (provided by the user), the weights
# are assigned to the original labels. If it is "balanced", then
Expand All @@ -624,10 +626,10 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
# For doing a ovr, we need to mask the labels first. for the
# multinomial case this is not necessary.
if multi_class == 'ovr':
w0 = np.zeros(n_features + int(fit_intercept))
w0 = np.zeros(n_features + int(fit_intercept), dtype=X.dtype)
mask_classes = np.array([-1, 1])
mask = (y == pos_class)
y_bin = np.ones(y.shape, dtype=np.float64)
y_bin = np.ones(y.shape, dtype=X.dtype)
y_bin[~mask] = -1.
# for compute_class_weight

Expand All @@ -648,7 +650,7 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
Y_multi = le.fit_transform(y)

w0 = np.zeros((classes.size, n_features + int(fit_intercept)),
order='F')
order='F', dtype=X.dtype)

if coef is not None:
# it must work both giving the bias term and not
Expand Down Expand Up @@ -1203,7 +1205,12 @@ def fit(self, X, y, sample_weight=None):
raise ValueError("Tolerance for stopping criteria must be "
"positive; got (tol=%r)" % self.tol)

X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64,
if self.solver in ['newton-cg']:
_dtype = [np.float64, np.float32]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I am missing something, but why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that previously check_X_y was converting X and y into np.float64. This is fine, if the user passes a list as X, but if a user passes a np.float32 willingly converting it to np.float64 penalizes them in memory and speed.

Therefore, we are trying to keep the data in np.float32 if the user provides the data in such type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that @raghavrv asks a question tells us that a short comment explaining the logic should probably be useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @raghavrv was more concerned in the fact that we were passing a list rather than forcing one or the other. Once we checked that check_X_y was taking care of it, he was ok with it.

@raghavrv any comments?

else:
_dtype = np.float64

X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype,
order="C")
check_classification_targets(y)
self.classes_ = np.unique(y)
Expand Down
22 changes: 22 additions & 0 deletions sklearn/linear_model/tests/test_logistic.py
Expand Up @@ -1136,3 +1136,25 @@ def test_saga_vs_liblinear():
liblinear.fit(X, y)
# Convergence for alpha=1e-3 is very slow
assert_array_almost_equal(saga.coef_, liblinear.coef_, 3)


def test_dtype_match():
# Test that np.float32 input data is not cast to np.float64 when possible

X_32 = np.array(X).astype(np.float32)
y_32 = np.array(Y1).astype(np.float32)
X_64 = np.array(X).astype(np.float64)
y_64 = np.array(Y1).astype(np.float64)

for solver in ['newton-cg']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all solvers are safe with cast to float64? if so it is documented somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all the solvers are safe with cast to float32. The idea is to have X_64 and X_32 add the not safe solver, brake the test and track the failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do one solver at a time but I think it's doable for all solvers

for multi_class in ['ovr', 'multinomial']:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this new line

# Check type consistency
lr_32 = LogisticRegression(solver=solver, multi_class=multi_class)
lr_32.fit(X_32, y_32)
assert_equal(lr_32.coef_.dtype, X_32.dtype)

# Check accuracy consistency
lr_64 = LogisticRegression(solver=solver, multi_class=multi_class)
lr_64.fit(X_64, y_64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add:

assert_equal(lr_64.coef_.dtype, X_64.dtype)

otherwise this test passes when we transform everything to 32 bits

assert_almost_equal(lr_32.coef_, lr_64.coef_.astype(np.float32))
8 changes: 5 additions & 3 deletions sklearn/utils/class_weight.py
Expand Up @@ -41,12 +41,14 @@ def compute_class_weight(class_weight, classes, y):
# Import error caused by circular imports.
from ..preprocessing import LabelEncoder

_dtype = y.dtype

if set(y) - set(classes):
raise ValueError("classes should include all valid labels that can "
"be in y")
if class_weight is None or len(class_weight) == 0:
# uniform class weights
weight = np.ones(classes.shape[0], dtype=np.float64, order='C')
weight = np.ones(classes.shape[0], dtype=_dtype, order='C')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, isn't there a risk here to be casting to integer dtypes, which could later create numerical errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually class_weight doesn't need to be modified. I though I had fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I disagree. You will have an operation between sample_weight and class_weight for instance there. Because you make a multiplication in-place, I agree that sample_weight will be the expected type. However, class_weight should not always be casted to np.float64 if the type of sample_weight is np.float32. It will imply a conversion when making the array conversion.

What I mean is something like that:

x = np.random.random((1000000, 1)).astype(np.float32)
y = np.random.random((1000000, 1)).astype(np.float32)
%timeit x * y
807 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
x = np.random.random((1000000, 1)).astype(np.float32)
y = np.random.random((1000000, 1)).astype(np.float64)
%timeit x * y
1.83 ms ± 8.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Does it make sense?

elif class_weight == 'balanced':
# Find the weight of each class as present in y.
le = LabelEncoder()
Expand All @@ -55,11 +57,11 @@ def compute_class_weight(class_weight, classes, y):
raise ValueError("classes should have valid labels that are in y")

recip_freq = len(y) / (len(le.classes_) *
bincount(y_ind).astype(np.float64))
bincount(y_ind).astype(_dtype))
weight = recip_freq[le.transform(classes)]
else:
# user-defined dictionary
weight = np.ones(classes.shape[0], dtype=np.float64, order='C')
weight = np.ones(classes.shape[0], dtype=_dtype, order='C')
if not isinstance(class_weight, dict):
raise ValueError("class_weight must be dict, 'balanced', or None,"
" got: %r" % class_weight)
Expand Down