### How does feature scaling affect logistic regression coefficients with L2 regularization?

Consider a text classification problem, where each document is represented by a binary feature vector.

If we don't scale the feature matrix, then multiplying one column by some scalar $W$ should have the effect of increasing the impact that feature has on classification. The reason is that the L2 penalty will have less of an effect on that coefficient -- since the feature is larger, the coefficient can be smaller, so the L2 penalty is (relatively) smaller.

Below, we illustrate this on a 20 newsgroups dataset. We alter the value $W$ and print both the learned coefficient, as well as the posterior probability of a document containing that single term. We observe that the posterior indeed increases for this document, indicating that the term has a greater impact on classification.

In [1]:
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn.metrics import f1_score

In [2]:
# get data.
categories = ['alt.atheism','talk.religion.misc']
remove = ('headers', 'footers', 'quotes')

data = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)
print('%d instances' % len(data.data))

857 instances


In [3]:
# featurize.
vec = CountVectorizer(binary=True, min_df=2)
data.X = vec.fit_transform(data.data)
data.X.shape

(857, 6427)

In [4]:
type(data)

sklearn.datasets.base.Bunch

In [5]:
# cross validation experiment.
def expt(data, vec, term, term_weight, C):
    """
    data.............sklearn Bunch dataset.
    vec..............CountVectorizer
    term.............The term to inflate.
    term_weight......How much to inflate this term.
    C................Inverse of L2 regularization strength, for logistic regression
    """
    vocab = {f:i for i, f in enumerate(vec.get_feature_names())}
    f1s = []
    X = data.X.copy()
    idx = vocab[term]
    X[:,idx] *= term_weight
    for train, test in KFold(len(data.data), 5, random_state=42):
        clf = LogisticRegression(C=C)
        clf.fit(X[train], data.target[train])
        preds = clf.predict(X[test])
        f1s.append(f1_score(data.target[test], preds))
    clf.fit(X, data.target)
    xx = np.zeros(len(vocab))
    xx[idx] = term_weight
    proba = clf.predict_proba(csr_matrix(xx))[0][1]
    print('%20s\tcoef\tposterior' % 'term')
    print('%20s\t%.3f\t%.3f' % (term, clf.coef_[0][idx], proba))
    # return np.mean(f1s)

In [6]:
print('vanilla logreg, C=1')
expt(data, vec, 'order', 1, 1)
print('term weight 5, C=1')
expt(data, vec, 'order', 5, 1)
print('term weight 10, C=1')
expt(data, vec, 'order', 10, 1)
print('term weight 100, C=1')
expt(data, vec, 'order', 100, 1)


print('\n\nvanilla logreg, C=.1')
expt(data, vec, 'order', 1, .1)
print('term weight 10, C=.1')
expt(data, vec, 'order', 10, .1)


vanilla logreg, C=1
                term	coef	posterior
               order	1.124	0.793
term weight 5, C=1




                term	coef	posterior
               order	0.728	0.980
term weight 10, C=1
                term	coef	posterior
               order	0.397	0.985
term weight 100, C=1
                term	coef	posterior
               order	0.041	0.987


vanilla logreg, C=.1
                term	coef	posterior
               order	0.444	0.626
term weight 10, C=.1
                term	coef	posterior
               order	0.229	0.915


In [7]:
def expt2(data, vec, term, term_weight, C):
    """
    Repeat, reducing X matrix to just one column (e.g., each document has a single feature.)
    """
    vocab = {f:i for i, f in enumerate(vec.get_feature_names())}
    f1s = []
    X = data.X.copy()
    idx = vocab[term]
    X[:,idx] *= term_weight
    X = X[:,idx]
    for train, test in KFold(len(data.data), 5, random_state=42):
        clf = LogisticRegression(C=C)
        clf.fit(X[train], data.target[train])
        preds = clf.predict(X[test])
        f1s.append(f1_score(data.target[test], preds))
    clf.fit(X, data.target)
    proba = clf.predict_proba([[term_weight]])[0][1]
    print('%20s\tcoef\tposterior' % 'term')
    print('%20s\t%.3f\t%.3f' % (term, clf.coef_[0][0], proba))

In [8]:
print('vanilla logreg, C=1')
expt2(data, vec, 'order', 1, 1)
print('term weight 5, C=1')
expt2(data, vec, 'order', 5, 1)
print('term weight 10, C=1')
expt2(data, vec, 'order', 10, 1)
print('term weight 100, C=1')
expt2(data, vec, 'order', 100, 1)


print('\n\nvanilla logreg, C=.1')
expt2(data, vec, 'order', 1, .1)
print('term weight 10, C=.1')
expt2(data, vec, 'order', 10, .1)

vanilla logreg, C=1
                term	coef	posterior
               order	1.079	0.685
term weight 5, C=1
                term	coef	posterior
               order	0.236	0.705
term weight 10, C=1
                term	coef	posterior
               order	0.118	0.706
term weight 100, C=1
                term	coef	posterior
               order	0.012	0.706


vanilla logreg, C=.1




                term	coef	posterior
               order	0.612	0.586
term weight 10, C=.1
                term	coef	posterior
               order	0.116	0.704


We can see that in all cases, increasing the feature value increases the resulting posterior for a document containing only that term.