Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

LIBLINEAR models' raw_coef_ variable takes up extra space when saving models #3413

Closed
mheilman opened this Issue · 5 comments

4 participants

@mheilman

It appears that the raw_coef_ variable in LIBLINEAR models is only used when training to set the final model parameters (coef_ and intercept_). It seems like this doesn't need to be kept around as a member variable. Keeping it essentially causes the parameters to be saved twice when writing to disk using joblib (see code below).

I noticed this after removing columns of zeros from a logistic regression fit with an L1 penalty. Even though I'd removed all but 5% of the features, the model on disk ended up still being about 50% of the size because raw_coef_ was still large.

See https://github.com/scikit-learn/scikit-learn/blob/9c51bc954718146cb1108f1d8c0a7483d7d6da8d/sklearn/svm/base.py#L697).

import numpy as np
from sklearn.externals import joblib
from sklearn import linear_model, datasets

iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, Y)
joblib.dump(logreg, 'original_model.pkl') 

logreg.raw_coef_ = None
joblib.dump(logreg, 'smaller_model.pkl')
@agramfort
Owner
@jnothman
Owner
@mheilman

Ah, good point. In my case, I actually wanted to remove zero-weighted features from a DictVectorizer object in addition to removing them from the model. The vectorizer's outputs were representing a lot of features for which the model had weights of zero, which was slowing down my application.

Just calling sparsify on the model would result in the column indices not matching up (the coef_ matrix would be sparse but still high dimensional, whereas the DictVectorizer outputs would have dimensions only for nonzero features).

I ended up doing it the following way. If you have any suggestions, I'd be glad to hear them. Regardless, thanks for looking into this.

    nonzero_feat_mask = ~np.all(model.coef_ == 0, axis=0)
    model.coef_ = model.coef_[:, nonzero_feat_mask]
    model.feat_vectorizer.restrict(nonzero_feat_mask)  # this method is awesome!
    model.raw_coef_ = None  # this step won't be necessary in the future

(sorry, this is getting a bit off topic now.)

@jnothman
Owner
@mblondel
Owner

Fixed by @MechCoder in #3416.

@mblondel mblondel closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.