New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GradientBoostingClassifier: sparse data and warmstarting do not work together #9991

Closed
mfeurer opened this Issue Oct 24, 2017 · 8 comments

Comments

Projects
None yet
4 participants
@mfeurer
Contributor

mfeurer commented Oct 24, 2017

Description

Using warmstarting and sparse data for gradient boosting do not work together.

Steps/Code to Reproduce

import numpy as np
import scipy.sparse
import sklearn.datasets
import sklearn.ensemble


def get_dataset(dataset='iris', make_sparse=False):
    iris = getattr(sklearn.datasets, "load_%s" % dataset)()
    X = iris.data.astype(np.float32)
    Y = iris.target
    rs = np.random.RandomState(42)
    indices = np.arange(X.shape[0])
    train_size = int(len(indices) / 3. * 2.)
    rs.shuffle(indices)
    X = X[indices]
    Y = Y[indices]
    X_train = X[:train_size]
    Y_train = Y[:train_size]
    X_test = X[train_size:]
    Y_test = Y[train_size:]

    if make_sparse:
        X_train[:, 0] = 0
        X_train[rs.random_sample(X_train.shape) > 0.5] = 0
        X_train = scipy.sparse.csc_matrix(X_train)
        X_train.eliminate_zeros()
        X_test[:, 0] = 0
        X_test[rs.random_sample(X_test.shape) > 0.5] = 0
        X_test = scipy.sparse.csc_matrix(X_test)
        X_test.eliminate_zeros()

    return X_train, Y_train, X_test, Y_test


X_train, Y_train, _, _ = get_dataset(dataset='iris')
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingClassifier(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted dense data', flush=True)

X_train, Y_train, _, _ = get_dataset(dataset='iris', make_sparse=True)
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingClassifier(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted sparse data', flush=True)

Expected Results

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>
Fitted Sparse data

Actual Results

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

Versions

Linux-4.4.0-96-generic-x86_64-with-debian-stretch-sid
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.13.3
SciPy 0.19.1
Scikit-Learn 0.19.1

Update

The same issue happens for regression:

import numpy as np
import scipy.sparse
import sklearn.datasets
import sklearn.ensemble


def get_dataset(dataset='boston', make_sparse=False):
    iris = getattr(sklearn.datasets, "load_%s" % dataset)()
    X = iris.data.astype(np.float32)
    Y = iris.target
    rs = np.random.RandomState(42)
    indices = np.arange(X.shape[0])
    train_size = int(len(indices) / 3. * 2.)
    rs.shuffle(indices)
    X = X[indices]
    Y = Y[indices]
    X_train = X[:train_size]
    Y_train = Y[:train_size]
    X_test = X[train_size:]
    Y_test = Y[train_size:]

    if make_sparse:
        X_train[:, 0] = 0
        X_train[rs.random_sample(X_train.shape) > 0.5] = 0
        X_train = scipy.sparse.csc_matrix(X_train)
        X_train.eliminate_zeros()
        X_test[:, 0] = 0
        X_test[rs.random_sample(X_test.shape) > 0.5] = 0
        X_test = scipy.sparse.csc_matrix(X_test)
        X_test.eliminate_zeros()

    return X_train, Y_train, X_test, Y_test


X_train, Y_train, _, _ = get_dataset(dataset='boston')
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingRegressor(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted dense data', flush=True)

X_train, Y_train, _, _ = get_dataset(dataset='boston', make_sparse=True)
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingRegressor(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted sparse data', flush=True)

@mfeurer mfeurer closed this Oct 24, 2017

@mfeurer mfeurer reopened this Oct 24, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 24, 2017

Member

What's the difference between your expected and actual results?

Member

jnothman commented Oct 24, 2017

What's the difference between your expected and actual results?

@mfeurer

This comment has been minimized.

Show comment
Hide comment
@mfeurer

mfeurer Oct 24, 2017

Contributor

My bad, there was none in the snippet I pasted here. I fixed the snippet. That's what the script results in:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

vs what I expect:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>
Fitted Sparse data
Contributor

mfeurer commented Oct 24, 2017

My bad, there was none in the snippet I pasted here. I fixed the snippet. That's what the script results in:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

vs what I expect:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>
Fitted Sparse data
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 24, 2017

Member

So there's a silent crash??

Member

jnothman commented Oct 24, 2017

So there's a silent crash??

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 24, 2017

Member

yes!? (could reproduce)

Member

amueller commented Oct 24, 2017

yes!? (could reproduce)

@amueller amueller added the Bug label Oct 24, 2017

@amueller amueller added this to the 0.20 milestone Oct 24, 2017

@mfeurer

This comment has been minimized.

Show comment
Hide comment
@mfeurer

mfeurer Oct 25, 2017

Contributor

@jnothman please excuse my description missing THE relevant information. Here's the full output:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Contributor

mfeurer commented Oct 25, 2017

@jnothman please excuse my description missing THE relevant information. Here's the full output:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 26, 2017

Member
Member

jnothman commented Oct 26, 2017

@srajanpaliwal

This comment has been minimized.

Show comment
Hide comment
@srajanpaliwal

srajanpaliwal Oct 26, 2017

Contributor

Hi,
I will try to solve this.
Thanks

Contributor

srajanpaliwal commented Oct 26, 2017

Hi,
I will try to solve this.
Thanks

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 27, 2017

Member
Member

jnothman commented Oct 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment