You can clone with
HTTPS or Subversion.
It appears that the raw_coef_ variable in LIBLINEAR models is only used when training to set the final model parameters (coef_ and intercept_). It seems like this doesn't need to be kept around as a member variable. Keeping it essentially causes the parameters to be saved twice when writing to disk using joblib (see code below).
I noticed this after removing columns of zeros from a logistic regression fit with an L1 penalty. Even though I'd removed all but 5% of the features, the model on disk ended up still being about 50% of the size because raw_coef_ was still large.
import numpy as np
from sklearn.externals import joblib
from sklearn import linear_model, datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
logreg = linear_model.LogisticRegression(C=1e5)
logreg.raw_coef_ = None
Ah, good point. In my case, I actually wanted to remove zero-weighted features from a DictVectorizer object in addition to removing them from the model. The vectorizer's outputs were representing a lot of features for which the model had weights of zero, which was slowing down my application.
Just calling sparsify on the model would result in the column indices not matching up (the coef_ matrix would be sparse but still high dimensional, whereas the DictVectorizer outputs would have dimensions only for nonzero features).
I ended up doing it the following way. If you have any suggestions, I'd be glad to hear them. Regardless, thanks for looking into this.
nonzero_feat_mask = ~np.all(model.coef_ == 0, axis=0)
model.coef_ = model.coef_[:, nonzero_feat_mask]
model.feat_vectorizer.restrict(nonzero_feat_mask) # this method is awesome!
model.raw_coef_ = None # this step won't be necessary in the future
(sorry, this is getting a bit off topic now.)
Fixed by @MechCoder in #3416.