Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GradientBoostingClassifier: sparse data and warmstarting do not work together #9991

Closed
mfeurer opened this issue Oct 24, 2017 · 8 comments
Closed
Milestone

Comments

@mfeurer
Copy link
Contributor

@mfeurer mfeurer commented Oct 24, 2017

Description

Using warmstarting and sparse data for gradient boosting do not work together.

Steps/Code to Reproduce

import numpy as np
import scipy.sparse
import sklearn.datasets
import sklearn.ensemble


def get_dataset(dataset='iris', make_sparse=False):
    iris = getattr(sklearn.datasets, "load_%s" % dataset)()
    X = iris.data.astype(np.float32)
    Y = iris.target
    rs = np.random.RandomState(42)
    indices = np.arange(X.shape[0])
    train_size = int(len(indices) / 3. * 2.)
    rs.shuffle(indices)
    X = X[indices]
    Y = Y[indices]
    X_train = X[:train_size]
    Y_train = Y[:train_size]
    X_test = X[train_size:]
    Y_test = Y[train_size:]

    if make_sparse:
        X_train[:, 0] = 0
        X_train[rs.random_sample(X_train.shape) > 0.5] = 0
        X_train = scipy.sparse.csc_matrix(X_train)
        X_train.eliminate_zeros()
        X_test[:, 0] = 0
        X_test[rs.random_sample(X_test.shape) > 0.5] = 0
        X_test = scipy.sparse.csc_matrix(X_test)
        X_test.eliminate_zeros()

    return X_train, Y_train, X_test, Y_test


X_train, Y_train, _, _ = get_dataset(dataset='iris')
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingClassifier(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted dense data', flush=True)

X_train, Y_train, _, _ = get_dataset(dataset='iris', make_sparse=True)
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingClassifier(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted sparse data', flush=True)

Expected Results

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>
Fitted Sparse data

Actual Results

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

Versions

Linux-4.4.0-96-generic-x86_64-with-debian-stretch-sid
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.13.3
SciPy 0.19.1
Scikit-Learn 0.19.1

Update

The same issue happens for regression:

import numpy as np
import scipy.sparse
import sklearn.datasets
import sklearn.ensemble


def get_dataset(dataset='boston', make_sparse=False):
    iris = getattr(sklearn.datasets, "load_%s" % dataset)()
    X = iris.data.astype(np.float32)
    Y = iris.target
    rs = np.random.RandomState(42)
    indices = np.arange(X.shape[0])
    train_size = int(len(indices) / 3. * 2.)
    rs.shuffle(indices)
    X = X[indices]
    Y = Y[indices]
    X_train = X[:train_size]
    Y_train = Y[:train_size]
    X_test = X[train_size:]
    Y_test = Y[train_size:]

    if make_sparse:
        X_train[:, 0] = 0
        X_train[rs.random_sample(X_train.shape) > 0.5] = 0
        X_train = scipy.sparse.csc_matrix(X_train)
        X_train.eliminate_zeros()
        X_test[:, 0] = 0
        X_test[rs.random_sample(X_test.shape) > 0.5] = 0
        X_test = scipy.sparse.csc_matrix(X_test)
        X_test.eliminate_zeros()

    return X_train, Y_train, X_test, Y_test


X_train, Y_train, _, _ = get_dataset(dataset='boston')
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingRegressor(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted dense data', flush=True)

X_train, Y_train, _, _ = get_dataset(dataset='boston', make_sparse=True)
print(type(X_train))

classifier = sklearn.ensemble.GradientBoostingRegressor(warm_start=True)
classifier.fit(X_train, Y_train)
classifier.n_estimators += 1
classifier.fit(X_train, Y_train)
print('Fitted sparse data', flush=True)
@mfeurer mfeurer closed this Oct 24, 2017
@mfeurer mfeurer reopened this Oct 24, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Oct 24, 2017

What's the difference between your expected and actual results?

@mfeurer
Copy link
Contributor Author

@mfeurer mfeurer commented Oct 24, 2017

My bad, there was none in the snippet I pasted here. I fixed the snippet. That's what the script results in:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

vs what I expect:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>
Fitted Sparse data
@jnothman
Copy link
Member

@jnothman jnothman commented Oct 24, 2017

So there's a silent crash??

@amueller
Copy link
Member

@amueller amueller commented Oct 24, 2017

yes!? (could reproduce)

@amueller amueller added the Bug label Oct 24, 2017
@amueller amueller added this to the 0.20 milestone Oct 24, 2017
@mfeurer
Copy link
Contributor Author

@mfeurer mfeurer commented Oct 25, 2017

@jnothman please excuse my description missing THE relevant information. Here's the full output:

<class 'numpy.ndarray'>
Fitted dense data
<class 'scipy.sparse.csc.csc_matrix'>

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
@jnothman
Copy link
Member

@jnothman jnothman commented Oct 26, 2017

@srajanpaliwal
Copy link
Contributor

@srajanpaliwal srajanpaliwal commented Oct 26, 2017

Hi,
I will try to solve this.
Thanks

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants