Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable PCA on data to exploit sparseness #2333

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

jan-glx
Copy link
Contributor

@jan-glx jan-glx commented Nov 18, 2019

This PR

  • adds a slot parameter to RunPCA.Seurat and RunPCA.Assay
    • this allows to run RunPCA on raw (counts) or normalized data (data) without running ScaleData first
  • adds scale and center parameters to simplify use of implicit scaling/centering in RunPCA
    • Performing PCA with implicit scaling on a sparse matrix allows for a significant speedup (~6x) over explicit scaling through ScaleData (converts to a dense matrix) (see comment below for details)
  • is based on Fix weight.by.var for approx=FALSE #2330 - which should be merged first
  • is not yet ready for merge because:
    1. Should center, scale and slot parameter values be saved in the dimReduce object? Where?
    2. To be more consistent with ScaleData scale and center might be better names do.scale and do.center, scale and scale. and center parameters values should the be used as default if supplied, to keep existing code using these (through ....) functional
    3. for approx=FALSE only, RunPCA has by default performed centering, while this is only a problem if the user used do.center=FALSE in ScaleData, it is inconsistent and should be changed
    4. Implementation could be simplified by switching from irlba to prcmp_irlba if simplify and and optimize prcomp_irlba bwlewis/irlba#52 gets merged
    5. ?

control behavior through option, default: old behavior, warning
simplify implementation
adds an additional argument `slot` that can be used to specify the sparse `'data'` matrix instead of the full rank `'scale.data'` matrix. Sparseness can be exploited using the `irlba` package providing the centering and scaling factors as additional arguments. This gives an ~ 6x speedup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant