[WIP] Implement PCA on sparse noncentered data #24415

andportnoy · 2022-09-10T15:44:23Z

Will fix #12794 when complete.

TODOs

andportnoy · 2022-09-10T15:47:57Z

This test is expected to fail at the moment. I will expand test coverage in the future.

andportnoy · 2022-09-11T00:52:14Z

Ran into an issue with the @ operator and LinearOperator when testing randomized_svd: #18689 (comment).

This is an intermediate commit with a lot of debug print code. All tests are passing though.

andportnoy · 2022-10-08T16:58:49Z

I dodged the issue by using the transpose identity $AB = ((AB)^T)^T = (B^TA^T)^T$, where $A$ is a NumPy array and $B$ is a LinearOperator, to force $B$ to be on the LHS.

That enables randomized SVD in addition to ARPACK.

I'll need to squash these intermediate commits later.

ogrisel

Is this PR still WIP? What remains to be done?

Here are some suggestions to move it forward:

test with larger data than iris (e.g. a few hundred data points and features);
use the global_random_seed fixture in the new test (see Improve tests by using global_random_seed fixture to make them less seed-sensitive #22827 for more details);
parametrize the new test to also check with whiten set to True;
please also check that transforming a batch of random test data points (ideally not from the training set) yields the same result with assert_allclose;
check that it's possible to call transform on dense array of points on a model that was trained with sparse data and vice versa;
document the change in the changelog for 1.2 (we will move it to 1.3 is the PR is not ready to merge by then).

sklearn/decomposition/tests/test_pca.py

andportnoy · 2022-10-23T19:07:30Z

@ogrisel Thank you so much for taking a look and for the suggestions, I will implement those.

I was also planning to add support for LOBPCG and PROPACK as sparse SVD methods. That could go in via this PR or as a follow up.

When is the merge window closing for 1.2?

ogrisel · 2022-10-23T19:23:09Z

Soonish I think :) /cc @jeremiedbb

andportnoy · 2022-10-23T21:05:28Z

Uh oh. A couple of days?

andportnoy · 2022-10-25T21:30:57Z

@ogrisel Let me know if I interpreted the suggestions correctly, I put a TODO list at the top of the PR.

andportnoy · 2022-11-05T11:00:43Z

(re the force push) Had to kill some unwanted commits pulled from main directly as opposed to via a merge commit.

andportnoy · 2022-11-05T12:45:12Z

@ogrisel Only 2080 out of 16000 tests are passing when testing on 400x300 random sparse matrices of varying densities across the 100 global_random_seed's.

Command:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest -v --tb=no -n `nproc --all` sklearn/decomposition/tests/test_pca.py::test_pca_sparse

Test matrix:

SPARSE_M, SPARSE_N = 400, 300
PCA_SOLVERS = ["full", "arpack", "randomized", "auto"]

@pytest.mark.parametrize("density", [0.01, 0.05, 0.10, 0.30])
@pytest.mark.parametrize("n_components", [1, 2, 3, 10, min(SPARSE_M, SPARSE_N)])
@pytest.mark.parametrize("format", ["csr", "csc"])
@pytest.mark.parametrize("svd_solver", PCA_SOLVERS)

Looking at some of the results manually, the errors are due to 1-2% elements mismatching, I'll try to gather better statistics on that in particular. Below is a high level breakdown of the pass rate by parameter.
A plurality of seeds has no tests passing.

Plot repro

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest -v --tb=no -n `nproc --all` sklearn/decomposition/tests/test_pca.py::test_pca_sparse > test-pca-sparse-all-seeds.log
grep -P 'PASSED|FAILED' test-pca-sparse-all-seeds.log | sed -E -e 's/^.*(FAILED|PASSED).*\[(.*)\]/\2 \1/' -e 's/-/ /g' -e 's/ $//' -e 's/ /,/g' > test-sparse-pca-all-seeds.csv

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv(
    'test-pca-sparse-all-seeds.csv',
    header=None,
    names=['seed', 'solver', 'layout', 'ncomp', 'density', 'outcome']
)

df['pass'] = df.outcome.apply(lambda x: True if x=='PASSED' else False)

def passrate_by(x):
    passes = df.groupby(x)['pass']
    counts = passes.count()
    sums = passes.sum()
    return sums / counts

fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=200)

seed = passrate_by('seed').hist(ax=axes[0][0])
seed.set_title('pass rate by seed (histogram)')
seed.set_xlabel('pass rate')
seed.set_ylabel('seed count')
seed.set_ylim(top=100)
seed.set_xlim(right=1)

solver = passrate_by('solver').plot.bar(ax=axes[0][1])
solver.set_title('pass rate by solver')
solver.set_xlabel('solver')
solver.set_ylabel('pass rate')

density = passrate_by('density').plot.bar(ax=axes[1][0])
density.set_title('pass rate by density')
density.set_xlabel('density')
density.set_ylabel('pass rate')

ncomp = passrate_by('ncomp').plot.bar(ax=axes[1][1])
ncomp.set_title('pass rate by number of components')
ncomp.set_xlabel('# components')
ncomp.set_ylabel('pass rate')

for bp in (solver, density, ncomp):
    bp.set_xticklabels(bp.get_xticklabels(), rotation=0)
    bp.set_ylim(top=1)
fig.tight_layout()
fig.savefig('test-pca-sparse-pass-rate.png', facecolor='white', transparent=False)

test_pca_sparse

test_pca_sparse [azure parallel]

test_pca_sparse

… seeds] test_pca_sparse

andportnoy · 2022-12-17T20:07:54Z

Updates are posted in the linked issue #12794.

Previously it was completely ignored and as a result defaulted to 0.01.

Add small test of PCA on sparse data

225a848

github-actions bot added the module:decomposition label Sep 10, 2022

andportnoy mentioned this pull request Sep 10, 2022

PCA on sparse, noncentered data #12794

Open

andportnoy added 7 commits September 24, 2022 11:12

Merge branch 'main' into pca-on-sparse-noncentered-data

1520c5f

Merge branch 'main' into pca-on-sparse-noncentered-data

d4e7daf

Merge branch 'main' into pca-on-sparse-noncentered-data

bce3d62

Merge branch 'main' into pca-on-sparse-noncentered-data

e4490f9

Merge branch 'main' into pca-on-sparse-noncentered-data

3bf477e

Merge branch 'main' into pca-on-sparse-noncentered-data

f5a30e4

Add support for PCA on sparse matrices using ARPACK + randomized SVD

460c368

This is an intermediate commit with a lot of debug print code. All tests are passing though.

andportnoy added 6 commits October 8, 2022 16:12

Blacken PCA on sparse data code

ead8bf7

I'll need to squash these intermediate commits later.

Merge branch 'main' into pca-on-sparse-noncentered-data

ef071f3

Merge branch 'main' into pca-on-sparse-noncentered-data

331ba6f

Merge branch 'main' into pca-on-sparse-noncentered-data

d8b5283

PCA/helpers: remove debug prints from _center_implicitly

5fb7aa6

PCA/helpers: remove redundant variable from _center_implicitly

b1528cd

ogrisel reviewed Oct 22, 2022

View reviewed changes

sklearn/decomposition/tests/test_pca.py Show resolved Hide resolved

andportnoy added 3 commits October 25, 2022 17:35

Merge branch 'main' into pca-on-sparse-noncentered-data

9f9b8c8

Merge branch 'main' into pca-on-sparse-noncentered-data

bfd7a0e

PCA/tests: test PCA on larger random sparse matrix

1dff900

andportnoy force-pushed the pca-on-sparse-noncentered-data branch from 76f7f32 to 1dff900 Compare November 5, 2022 10:56

andportnoy added 9 commits November 26, 2022 09:28

Merge branch 'main' into pca-on-sparse-noncentered-data

fa670b4

Merge branch 'main' into pca-on-sparse-noncentered-data

647b735

PCA: add LOBPCG support for sparse data

1b9a851

CI [all random seeds]

4786e88

test_pca_sparse

Merge branch 'main' into pca-on-sparse-noncentered-data

7cbf497

Merge branch 'main' into pca-on-sparse-noncentered-data

480aa5b

PCA/tests: parametrize test_pca_sparse on rtol [all random seeds]

b2f8d64

test_pca_sparse [azure parallel]

CI [azure parallel] [all random seeds]

ef18778

test_pca_sparse

PCA/tests: leave only default rtol value [azure parallel] [all random…

73dd609

… seeds] test_pca_sparse

andportnoy added 10 commits February 23, 2023 15:44

Merge branch 'main' into pca-on-sparse-noncentered-data

8a66dfd

Merge branch 'main' into pca-on-sparse-noncentered-data

58c7b90

PCA/tests: use density parameter

0884861

Previously it was completely ignored and as a result defaulted to 0.01.

PCA/tests: check in directory with debug scripts

8569def

PCA/debug: mkdir data and plot directories if necessary

b1edffb

PCA/tests: use 300 dpi in plots

c80f883

Merge branch 'main' into pca-on-sparse-noncentered-data

d9ca26d

Merge branch 'main' into pca-on-sparse-noncentered-data

600f2f1

Merge branch 'main' into pca-on-sparse-noncentered-data

2c01c70

Merge branch 'main' into pca-on-sparse-noncentered-data

8d6c90e

andportnoy mentioned this pull request May 23, 2023

ENH Allow fitting PCA on sparse X with arpack solvers #18689

Merged

2 tasks

Merge branch 'main' into pca-on-sparse-noncentered-data

3f52d5c

ogrisel mentioned this pull request Sep 27, 2023

Solve PCA via np.linalg.eigh(X_centered.T @ X_centered) instead of np.linalg.svd(X_centered) when X.shape[1] is small enough. #27483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement PCA on sparse noncentered data #24415

[WIP] Implement PCA on sparse noncentered data #24415

andportnoy commented Sep 10, 2022 •

edited

andportnoy commented Sep 10, 2022

andportnoy commented Sep 11, 2022

andportnoy commented Oct 8, 2022 •

edited

ogrisel left a comment •

edited

andportnoy commented Oct 23, 2022

ogrisel commented Oct 23, 2022

andportnoy commented Oct 23, 2022

andportnoy commented Oct 25, 2022

andportnoy commented Nov 5, 2022

andportnoy commented Nov 5, 2022 •

edited

andportnoy commented Dec 17, 2022

[WIP] Implement PCA on sparse noncentered data #24415

Are you sure you want to change the base?

[WIP] Implement PCA on sparse noncentered data #24415

Conversation

andportnoy commented Sep 10, 2022 • edited

TODOs

andportnoy commented Sep 10, 2022

andportnoy commented Sep 11, 2022

andportnoy commented Oct 8, 2022 • edited

ogrisel left a comment • edited

Choose a reason for hiding this comment

andportnoy commented Oct 23, 2022

ogrisel commented Oct 23, 2022

andportnoy commented Oct 23, 2022

andportnoy commented Oct 25, 2022

andportnoy commented Nov 5, 2022

andportnoy commented Nov 5, 2022 • edited

andportnoy commented Dec 17, 2022

andportnoy commented Sep 10, 2022 •

edited

andportnoy commented Oct 8, 2022 •

edited

ogrisel left a comment •

edited

andportnoy commented Nov 5, 2022 •

edited