Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement PCA on sparse noncentered data #24415

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

andportnoy
Copy link
Contributor

@andportnoy andportnoy commented Sep 10, 2022

Will fix #12794 when complete.

TODOs

  • test on large random data
    • $min(m, n)$ at least a few hundred
    • parametrize on whiten
    • use global_random_seed
      • test locally using SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest sklearn/decomposition/tests/test_pca.py::test_pca_sparse
  • test .transform
    • on random inputs
    • sparse model on dense inputs
    • dense model on sparse inputs
  • make a changelog entry
  • implement support for LOBPCG
  • implement support for PROPACK

@andportnoy
Copy link
Contributor Author

This test is expected to fail at the moment. I will expand test coverage in the future.

@andportnoy
Copy link
Contributor Author

Ran into an issue with the @ operator and LinearOperator when testing randomized_svd: #18689 (comment).

@andportnoy
Copy link
Contributor Author

andportnoy commented Oct 8, 2022

I dodged the issue by using the transpose identity $AB = ((AB)^T)^T = (B^TA^T)^T$, where $A$ is a NumPy array and $B$ is a LinearOperator, to force $B$ to be on the LHS.

That enables randomized SVD in addition to ARPACK.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR still WIP? What remains to be done?

Here are some suggestions to move it forward:

  • test with larger data than iris (e.g. a few hundred data points and features);
  • use the global_random_seed fixture in the new test (see Improve tests by using global_random_seed fixture to make them less seed-sensitive #22827 for more details);
  • parametrize the new test to also check with whiten set to True;
  • please also check that transforming a batch of random test data points (ideally not from the training set) yields the same result with assert_allclose;
  • check that it's possible to call transform on dense array of points on a model that was trained with sparse data and vice versa;
  • document the change in the changelog for 1.2 (we will move it to 1.3 is the PR is not ready to merge by then).

sklearn/decomposition/tests/test_pca.py Show resolved Hide resolved
@andportnoy
Copy link
Contributor Author

@ogrisel Thank you so much for taking a look and for the suggestions, I will implement those.

I was also planning to add support for LOBPCG and PROPACK as sparse SVD methods. That could go in via this PR or as a follow up.

When is the merge window closing for 1.2?

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2022

Soonish I think :) /cc @jeremiedbb

@andportnoy
Copy link
Contributor Author

Uh oh. A couple of days?

@andportnoy
Copy link
Contributor Author

@ogrisel Let me know if I interpreted the suggestions correctly, I put a TODO list at the top of the PR.

@andportnoy andportnoy force-pushed the pca-on-sparse-noncentered-data branch from 76f7f32 to 1dff900 Compare November 5, 2022 10:56
@andportnoy
Copy link
Contributor Author

(re the force push) Had to kill some unwanted commits pulled from main directly as opposed to via a merge commit.

@andportnoy
Copy link
Contributor Author

andportnoy commented Nov 5, 2022

@ogrisel Only 2080 out of 16000 tests are passing when testing on 400x300 random sparse matrices of varying densities across the 100 global_random_seed's.

Command:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest -v --tb=no -n `nproc --all` sklearn/decomposition/tests/test_pca.py::test_pca_sparse

Test matrix:

SPARSE_M, SPARSE_N = 400, 300
PCA_SOLVERS = ["full", "arpack", "randomized", "auto"]

@pytest.mark.parametrize("density", [0.01, 0.05, 0.10, 0.30])
@pytest.mark.parametrize("n_components", [1, 2, 3, 10, min(SPARSE_M, SPARSE_N)])
@pytest.mark.parametrize("format", ["csr", "csc"])
@pytest.mark.parametrize("svd_solver", PCA_SOLVERS)

Looking at some of the results manually, the errors are due to 1-2% elements mismatching, I'll try to gather better statistics on that in particular. Below is a high level breakdown of the pass rate by parameter.
A plurality of seeds has no tests passing.

Plot repro
SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest -v --tb=no -n `nproc --all` sklearn/decomposition/tests/test_pca.py::test_pca_sparse > test-pca-sparse-all-seeds.log
grep -P 'PASSED|FAILED' test-pca-sparse-all-seeds.log | sed -E -e 's/^.*(FAILED|PASSED).*\[(.*)\]/\2 \1/' -e 's/-/ /g' -e 's/ $//' -e 's/ /,/g' > test-sparse-pca-all-seeds.csv
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv(
    'test-pca-sparse-all-seeds.csv',
    header=None,
    names=['seed', 'solver', 'layout', 'ncomp', 'density', 'outcome']
)

df['pass'] = df.outcome.apply(lambda x: True if x=='PASSED' else False)

def passrate_by(x):
    passes = df.groupby(x)['pass']
    counts = passes.count()
    sums = passes.sum()
    return sums / counts

fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=200)

seed = passrate_by('seed').hist(ax=axes[0][0])
seed.set_title('pass rate by seed (histogram)')
seed.set_xlabel('pass rate')
seed.set_ylabel('seed count')
seed.set_ylim(top=100)
seed.set_xlim(right=1)

solver = passrate_by('solver').plot.bar(ax=axes[0][1])
solver.set_title('pass rate by solver')
solver.set_xlabel('solver')
solver.set_ylabel('pass rate')

density = passrate_by('density').plot.bar(ax=axes[1][0])
density.set_title('pass rate by density')
density.set_xlabel('density')
density.set_ylabel('pass rate')

ncomp = passrate_by('ncomp').plot.bar(ax=axes[1][1])
ncomp.set_title('pass rate by number of components')
ncomp.set_xlabel('# components')
ncomp.set_ylabel('pass rate')

for bp in (solver, density, ncomp):
    bp.set_xticklabels(bp.get_xticklabels(), rotation=0)
    bp.set_ylim(top=1)
fig.tight_layout()
fig.savefig('test-pca-sparse-pass-rate.png', facecolor='white', transparent=False)

test-pca-sparse-pass-rate

@andportnoy
Copy link
Contributor Author

Updates are posted in the linked issue #12794.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PCA on sparse, noncentered data
2 participants