Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Allow f_regression to accept a sparse matrix with centering #8065

Merged
merged 3 commits into from Dec 20, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/whats_new.rst
Expand Up @@ -99,6 +99,9 @@ Enhancements
A ``TypeError`` will be raised for any other kwargs. :issue:`8028`
by :user:`Alexander Booth <alexandercbooth>`.

- Added ability to use sparse matrices in :func:`feature_selection.f_regression`
with ``center=True``. :issue:`8065` by :user:`Daniel LeJeune <acadiansith>`.

Bug fixes
.........

Expand Down
6 changes: 6 additions & 0 deletions sklearn/feature_selection/tests/test_feature_select.py
Expand Up @@ -92,6 +92,12 @@ def test_f_regression():
assert_true((pv[:5] < 0.05).all())
assert_true((pv[5:] > 1.e-4).all())

# with centering, compare with sparse
F, pv = f_regression(X, y, center=True)
F_sparse, pv_sparse = f_regression(sparse.csr_matrix(X), y, center=True)
assert_array_almost_equal(F_sparse, F)
assert_array_almost_equal(pv_sparse, pv)

# again without centering, compare with sparse
F, pv = f_regression(X, y, center=False)
F_sparse, pv_sparse = f_regression(sparse.csr_matrix(X), y, center=False)
Expand Down
20 changes: 15 additions & 5 deletions sklearn/feature_selection/univariate_selection.py
Expand Up @@ -266,17 +266,27 @@ def f_regression(X, y, center=True):
f_classif: ANOVA F-value between label/feature for classification tasks.
chi2: Chi-squared stats of non-negative features for classification tasks.
"""
if issparse(X) and center:
raise ValueError("center=True only allowed for dense data")
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64)
n_samples = X.shape[0]

# compute centered values
# note that E[(x - mean(x))*(y - mean(y))] = E[x*(y - mean(y))], so we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is applicable only when y is mean centered correct? In which case it would be E[x*y]? (Sorry if I'm misunderstanding)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is applicable even when y is not centered.
Yet you are right, here we compute E[x*(y - mean(y)) by first centering y and then computing E[x*y].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes sorry my math is a bit rusty... Thanks heaps for the clarification (online and offline)!

# need not center X
if center:
y = y - np.mean(y)
X = X.copy('F') # faster in fortran
X -= X.mean(axis=0)
if issparse(X):
X_means = X.mean(axis=0).getA1()
else:
X_means = X.mean(axis=0)
# compute the scaled standard deviations via moments
X_norms = np.sqrt(row_norms(X.T, squared=True) -
n_samples * X_means ** 2)
else:
X_norms = row_norms(X.T)

# compute the correlation
corr = safe_sparse_dot(y, X)
corr /= row_norms(X.T)
corr /= X_norms
corr /= norm(y)

# convert to p-value
Expand Down