Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA of Raster data [Feature Request] #1361

Closed
ailich opened this issue Nov 30, 2023 · 23 comments
Closed

PCA of Raster data [Feature Request] #1361

ailich opened this issue Nov 30, 2023 · 23 comments

Comments

@ailich
Copy link
Contributor

ailich commented Nov 30, 2023

I used to use RSToolbox for conducting PCA's on raster data, but it relies on the raster package and is no longer actively maintained (I actually can't get it to install anymore). Since PCA is a common procedure it might be nice to have directly in terra like how kmeans was recently added. I've taken the code from their function and converted it to terra code and it appears to work with the development version.

#' Principal Component Analysis for Rasters
#'
#' Calculates R-mode PCA for SpatRasters and returns a named list containing the PCA model and the SpatRaster with the principal component scores.
#'
#' Internally rasterPCA relies on the use of \link[stats]{princomp} (R-mode PCA). If nSamples is given the PCA will be calculated
#' based on a random sample of pixels and then predicted for the full raster. If nSamples is NULL then the covariance matrix will be calculated
#' first and will then be used to calculate princomp and predict the full raster. The latter is more precise, since it considers all pixels,
#' however, it may be slower than calculating the PCA only on a subset of pixels.
#'
#' Pixels with missing values in one or more bands will be set to NA. The built-in check for such pixels can lead to a slow-down of rasterPCA.
#' However, if you make sure or know beforehand that all pixels have either only valid values or only NAs throughout all layers you can disable this check
#' by setting maskCheck=FALSE which speeds up the computation.
#'
#' Standardised PCA (SPCA) can be useful if imagery or bands of different dynamic ranges are combined. SPC uses the correlation matrix instead of the covariance matrix, which
#' has the same effect as using normalised bands of unit variance.
#'
#' @param x SpatRaster
#' @param nSamples Integer or NULL. Number of pixels to sample for PCA fitting. If NULL, all pixels will be used.
#' @param nComp Integer. Number of PCA components to return.
#' @param spca Logical. If \code{TRUE}, perform standardized PCA. Corresponds to centered and scaled input image. This is usually beneficial for equal weighting of all layers. (\code{FALSE} by default)
#' @param maskCheck Logical. Masks all pixels which have at least one NA (default TRUE is reccomended but introduces a slow-down, see Details when it is wise to disable maskCheck).
#' Takes effect only if nSamples is NULL.
#' @param ... further arguments to be passed to \link[terra]{writeRaster}, e.g. filename.
#' @return Returns a named list containing the PCA model object ($model) and the SpatRaster with the principal component layers ($map).
#' @details
#' This code is largely from the rasterPCA function from the [RSToolbox package](https://bleutner.github.io/RStoolbox/), but has been adapted to work with the terra package.
#' @export
#' @examples
#'

rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE,  maskCheck = TRUE, ...){
  if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
  ellip <- list(...)

  if(nComp > nlyr(x)) nComp <- nlyr(x)

  if(!is.null(nSamples)){
    trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
    if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
    model <- princomp(trainData, scores = FALSE, cor = spca)
  } else {
    if(maskCheck) {
      totalMask <- !sum(app(x, is.na))
      if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
      x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
    }
    covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
    model  <- princomp(covmat = covMat[[1]], cor=spca)
    model$center <- diag(covMat$mean)
    model$n.obs  <- global(!any(is.na(x)), sum)$sum
    if(spca) {
      ## Calculate scale as population sd like in in princomp
      S <- diag(covMat$covariance)
      model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
    }
  }
  ## Predict
  out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
  names(out) <- paste0("PC", 1:nComp)
  return(structure(list(call = match.call(), model = model, map = out)))
}
@ailich
Copy link
Contributor Author

ailich commented Nov 30, 2023

Here's an example of how it could be used

library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62

rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE,  maskCheck = TRUE, ...){
  if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
  ellip <- list(...)
  
  if(nComp > nlyr(x)) nComp <- nlyr(x)
  
  if(!is.null(nSamples)){
    trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
    if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
    model <- princomp(trainData, scores = FALSE, cor = spca)
  } else {
    if(maskCheck) {
      totalMask <- !sum(app(x, is.na))
      if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
      x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
    }
    covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
    model  <- princomp(covmat = covMat[[1]], cor=spca)
    model$center <- diag(covMat$mean)
    model$n.obs  <- global(!any(is.na(x)), sum)$sum
    if(spca) {
      ## Calculate scale as population sd like in in princomp
      S <- diag(covMat$covariance)
      model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
    }
  }
  ## Predict
  out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
  names(out) <- paste0("PC", 1:nComp)
  return(structure(list(call = match.call(), model = model, map = out)))
}


#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's

plot(r)

PCA<- rasterPCA(r, spca = TRUE)
summary(PCA$model)
#> Importance of components:
#>                          Comp.1   Comp.2    Comp.3    Comp.4    Comp.5
#> Standard deviation     1.065035 1.060924 1.0353062 1.0232500 1.0026015
#> Proportion of Variance 0.113430 0.112556 0.1071859 0.1047041 0.1005210
#> Cumulative Proportion  0.113430 0.225986 0.3331719 0.4378759 0.5383969
#>                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
#> Standard deviation     0.9866337 0.98439181 0.96219138 0.95073612 0.91861112
#> Proportion of Variance 0.0973446 0.09690272 0.09258123 0.09038992 0.08438464
#> Cumulative Proportion  0.6357415 0.73264422 0.82522545 0.91561536 1.00000000
plot(PCA$map)

Created on 2023-11-30 with reprex v2.0.2

@rhijmans
Copy link
Member

rhijmans commented Nov 30, 2023

I am not opposed but not convinced either given the simplicity of this example in ?terra::predict

library(terra)
logo <- rast(system.file("ex/logo.tif", package="terra"))   
### principal components of a SpatRaster
pca <- princomp(logo)

# or use sampling if you have a large raster 
# and cannot process all cell values
sr <- spatSample(logo, 100000, "regular")
pca <- prcomp(sr)

x <- predict(logo, pca)
plot(x)

@ailich
Copy link
Contributor Author

ailich commented Nov 30, 2023

@rhijmans, I think this method doesn't require sampling for large rasters because it calls layerCor and then works off the covariance matrix. That is unless layerCor needs to read in the whole dataset to memory.

rhijmans added a commit that referenced this issue Dec 1, 2023
@rhijmans
Copy link
Member

rhijmans commented Dec 1, 2023

I added a princomp<SpatRaster> method. Please have a look. I think it is best to have model building and prediction in separate steps as one might want to inspect the model and because this is standard R practice.

rhijmans added a commit that referenced this issue Dec 1, 2023
@ailich
Copy link
Contributor Author

ailich commented Dec 1, 2023

Thanks @rhijmans, I think splitting it up into model and prediction makes sense and like the idea of adding a dedicated princomp method in terra. I do have a few comments though.

  1. I think the call needs to stats::princomp needs to have cor=cor to link to what the user specified, and it should have ... for other arguments that can be passed to it. Since currently that line only says model <- princomp(covmat = xcov$covariance), cor and other user specified arguments will not be passed along.
  2. I believe NA values need to be masked from all layers prior to calculating the covariance matrix with LayerCor. In RSToolbox the comment said "NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite"
  3. Assign $n.obs in model output (should be number of non-NA cells)
library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62

rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE,  maskCheck = TRUE, ...){
  if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
  ellip <- list(...)
  
  if(nComp > nlyr(x)) nComp <- nlyr(x)
  
  if(!is.null(nSamples)){
    trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
    if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
    model <- princomp(trainData, scores = FALSE, cor = spca)
  } else {
    if(maskCheck) {
      totalMask <- !sum(app(x, is.na))
      if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
      x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
    }
    covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
    model  <- princomp(covmat = covMat[[1]], cor=spca)
    model$center <- diag(covMat$mean)
    model$n.obs  <- global(!any(is.na(x)), sum)$sum
    if(spca) {
      ## Calculate scale as population sd like in in princomp
      S <- diag(covMat$covariance)
      model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
    }
  }
  ## Predict
  out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
  names(out) <- paste0("PC", 1:nComp)
  return(structure(list(call = match.call(), model = model, map = out)))
}


princomp_edit<- function(x, cor=FALSE, ...) {
            stopifnot(nlyr(x) > 1)
            xcov <- layerCor(x, fun="cov", na.rm=TRUE)
            model <- princomp(covmat = xcov$covariance, cor=cor, ...)
            model$center <- diag(xcov$mean)
            if (cor) {
              ## Calculate scale as population sd like in in princomp
              S <- diag(xcov$covariance)
              n <- diag(xcov$n)
              model$scale <- sqrt(S * (n-1) / n)
            }
            model
          }

#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's

r_df<- as.data.frame(r)
r_df<- na.omit(r_df) # If you do not omit NA's from the dataframe, you get Error in cov.wt(z) : 'x' must contain finite values only

# PCA cor=FALSE
m1<- princomp(r_df) # This is on a dataframe so it serves as a reference to the correct output
m1$n.obs
#> [1] 859
m1$sdev
#>    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
#> 1.0826671 1.0704703 1.0429269 1.0255518 1.0028366 0.9855091 0.9820035 0.9674348 
#>    Comp.9   Comp.10 
#> 0.9487587 0.9176803

m2<- rasterPCA(r)$model
m2$n.obs
#> [1] 859
m2$sdev
#>    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
#> 1.0832979 1.0710939 1.0435344 1.0261493 1.0034208 0.9860832 0.9825756 0.9679984 
#>    Comp.9   Comp.10 
#> 0.9493114 0.9182149

m3<- terra::princomp(r)
m3$n.obs
#> [1] NA
m3$sdev
#>    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
#> 1.0745181 1.0494377 1.0409172 1.0199349 1.0037053 1.0022756 0.9830014 0.9800315 
#>    Comp.9   Comp.10 
#> 0.9710888 0.9527138

# PCA cor=TRUE
m4<- terra::princomp(r, cor=TRUE)
m4$sdev==m3$sdev #cor= TRUE did not change result
#>  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
#>    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE

m5<- princomp_edit(r, cor=TRUE)
m5$sdev==m3$sdev #Edited version properly passes along cor argument
#>  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
#>   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE

Created on 2023-12-01 with reprex v2.0.2

@rhijmans
Copy link
Member

rhijmans commented Dec 1, 2023

I think the [call needs to stats::princomp] needs to have cor=cor

Yes, thanks, fixed that.

it should have ... for other arguments

The only other meaningful argument is fix_sign, so I have added that explicitly.

I believe NA values need to be masked from all layers prior to calculating the covariance matrix with LayerCor.

I do not think that is strictly necessary. Our best estimate of the covariance matrix would use all pairs of data that are not NA. We can add a note in the manual that you can do mask(x, anyNA(x), maskvalue=TRUE) if you want the same number of observations for all layers.

It may affect the computation of the scale (if cov=TRUE) as that uses (n-1) / n. In most cases n is very large with raster data so I am not worried about that. But again that can be pointed out.

Assign $n.obs in model output (should be number of non-NA cells)

I put it back in. However, there is not one n as it can vary between layers. So you now get a "dist" object that shows that.

@ailich
Copy link
Contributor Author

ailich commented Dec 5, 2023

Thank you!

@ailich ailich closed this as completed Dec 5, 2023
@ailich
Copy link
Contributor Author

ailich commented Dec 5, 2023

@rhijmans, although RSToolbox used the population standard deviation to calculate scale, on further investigation that may have not been the correct choice (See example and discussion here). That being said though, as you mentioned rasters tend to have a lot of observations, so it likely doesn't make much difference.

@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

I would prefer to use model$scale <- sqrt(S) over model$scale <- sqrt(S * (n-1) / n) because the value of n is ambiguous if the number of non-NA cells varies.

The population covariance can be computed with asSample=FALSE.

xcov <- layerCor(x, fun="cov", na.rm=TRUE, asSample=FALSE)

I think it generally makes more sense to to treat raster data as the population and not as a sample. It could also be an option that may lead to oh so small differences for small rasters.

Anyway, this is confusing stuff, and I would appreciate your guidance.

@ailich
Copy link
Contributor Author

ailich commented Dec 6, 2023

@rhijmans, since the "scale" should be the standard deviation and the S (the diagonal of the covariance matrix) is the variance, the square root of that would be appropriate. The pairwise n's must still be used to calculate the covariance matrix so it's not really circumventing that, but we it does avoid multiplying by (n-1)/n which was just there to convert the sample standard deviation to the population standard deviation since

$$\text{Cov Population (x,y)}=\frac{\sum{(x_{i}-\bar{x})(y_{i}-\bar{y}})}{N}$$

$$\text{Cov Sample(x,y)}=\frac{\sum{(x_{i}-\bar{x})(y_{i}-\bar{y}})}{N-1}$$

That being said, I think it is slightly better as you suggested to use asSample=FALSE and scale=sqrt(S) since the covariance matrix and the scale are both calculated with the population formula rather than having a covariance matrix based on the sample formula and a scale based on the population formula.

I am unsure however why results of layerCor(asSample=TRUE) multiplied by (n-1)/n doesn't more closely match the results of layerCor(asSample=FALSE)

library(terra)
#> terra 1.7.62

cov_pop <- function(data) {
  # Calculate the means of each variable
  means <- colMeans(data)
  # Number of observations
  n_obs <- nrow(data)
  # Number of variables
  n_vars <- ncol(data)
  # Initialize an empty covariance matrix
  covariance_matrix <- matrix(0, nrow = n_vars, ncol = n_vars)
  # Calculate the covariance for each pair of variables
  for (i in 1:n_vars) {
    for (j in 1:n_vars) {
      covariance_matrix[i, j] <- sum((data[, i] - means[i]) * (data[, j] - means[j])) / n_obs
    }
  }
  return(covariance_matrix)
}

r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
r_df<- as.data.frame(r)
n<- ncell(r) #No NA's in this example

cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSamle=TRUE)
cov_pop_terra<-  layerCor(r, fun="cov", na.rm=TRUE, asSamle=FALSE)

cov_samp_R<- cov(r_df)
cov_pop_R<- cov_pop(r_df)

# Terra has differences on order of e-05
max((cov_samp_terra$covariance*(n-1)/n)-cov_pop_terra$covariance)
#> [1] 3.426207e-05

# R has differences on order of e-16
max((cov_samp_R*(n-1)/n)-cov_pop_R)
#> [1] 2.220446e-16

Created on 2023-12-06 with reprex v2.0.2

@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

b <- rast(system.file("ex/logo.tif", package="terra"))   
layerCor(b, "pearson")

x <- layerCor(b, "cov")
y <- layerCor(b, "cov", asSample=F)

x[["covariance"]] * (x[["n"]]-1)/x[["n"]] == y[["covariance"]]
#       red green blue
#red   TRUE  TRUE TRUE
#green TRUE  TRUE TRUE
#blue  TRUE  TRUE TRUE

or with your approach using asSample (and not asSamle ;)

cov_pop_terra<-  layerCor(r, fun="cov", na.rm=TRUE, asSample=FALSE)
cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=TRUE)
max((cov_samp_terra$covariance*(n-1)/n)-cov_pop_terra$covariance)
#[1] 2.220446e-16

rhijmans added a commit that referenced this issue Dec 6, 2023
@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

but it does avoid multiplying by (n-1)/n

Yes, that is the n I was thinking about. In the covariance computation, n is computed for each pair.

@ailich
Copy link
Contributor Author

ailich commented Dec 6, 2023

Oh that was dumb of me 🤦‍♂️🤦‍♂️🤦‍♂️. Running it again, it all looks good! Since you created the n matrix they're both pairwise and the scale will come out the same. I think it's cleaner and more consistent though to use asSample=FALSE and $scale<- sqrt(S) as you mentioned.

library(terra)
#> terra 1.7.62

#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's

cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=TRUE)
cov_pop_terra<-  layerCor(r, fun="cov", na.rm=TRUE, asSample=FALSE)

n_mat<- cov_samp_terra$n
n<- diag(n_mat)
max((cov_samp_terra$covariance*(n_mat-1)/n_mat) - cov_pop_terra$covariance)
#> [1] 6.938894e-18

n # n is a vector of pairwise sample sizes corresponding to the length of S
#>  [1] 1497 1497 1509 1496 1493 1489 1506 1505 1507 1501
S1<- diag(cov_samp_terra$covariance)
scale1<- sqrt(S1*(n-1)/n)
S2<- diag(cov_pop_terra$covariance)
scale2<- sqrt(S2)

scale1==scale2
#>  lyr.1  lyr.2  lyr.3  lyr.4  lyr.5  lyr.6  lyr.7  lyr.8  lyr.9 lyr.10 
#>   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE

Created on 2023-12-06 with reprex v2.0.2

@ailich
Copy link
Contributor Author

ailich commented Dec 6, 2023

Thanks so much for adding this!

@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

Thank you!

I have changed argument na.rm=TRUE/FALSE to use="everything"/"complete.obs"/ "pairwise.complete.obs"/"masked.obs"

@ailich
Copy link
Contributor Author

ailich commented Dec 7, 2023

@rhijmans, an added bonus is I believe this also fixed a bug in layerCor when calculating "cov" with maxcell specified. In the latest commit I get approximately the same results whereas in the CRAN and a few commits ago I get half the covariance if sampling half the cells so it seems like the number of cells rather than the sample size was used.

Older Version

library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62
r<- rast(nrow=40, ncol=40, nlyrs=2)
set.seed(5)
values(r[[1]])<- rnorm(ncell(r))
values(r[[2]])<- values(r[[1]])*2+rnorm(ncell(r))

cov_full<- layerCor(r, fun="cov", maxcell=Inf)
cov_half<- layerCor(r, fun="cov", maxcell=ncell(r)/2)

cov_half$covariance
#>          lyr.1    lyr.2
#> lyr.1 0.541334 1.058303
#> lyr.2 1.058303 2.621351
cov_full$covariance
#>          lyr.1    lyr.2
#> lyr.1 1.019472 2.029023
#> lyr.2 2.029023 5.057638

cov_half$covariance/cov_full$covariance
#>           lyr.1     lyr.2
#> lyr.1 0.5309946 0.5215826
#> lyr.2 0.5215826 0.5182955

df_half<- spatSample(r, 800, method="regular")
cov(df_half)
#>          lyr.1    lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929

Created on 2023-12-07 with reprex v2.0.2

Latest Commit

library(terra)
#> terra 1.7.62
r<- rast(nrow=40, ncol=40, nlyrs=2)
set.seed(5)
values(r[[1]])<- rnorm(ncell(r))
values(r[[2]])<- values(r[[1]])*2+rnorm(ncell(r))

cov_full<- layerCor(r, fun="cov", maxcell=Inf)
cov_half<- layerCor(r, fun="cov", maxcell=ncell(r)/2)

cov_half$covariance
#>          lyr.1    lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929
cov_full$covariance
#>          lyr.1    lyr.2
#> lyr.1 1.019472 2.029023
#> lyr.2 2.029023 5.057638

cov_half$covariance/cov_full$covariance
#>           lyr.1     lyr.2
#> lyr.1 1.0107862 0.9928697
#> lyr.2 0.9928697 0.9866125

df_half<- spatSample(r, 800, method="regular")
cov(df_half)
#>          lyr.1    lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929

Created on 2023-12-07 with reprex v2.0.2

@ailich
Copy link
Contributor Author

ailich commented Dec 7, 2023

Since "cov" and maxcell seems to be working, the only extra thing I'd add to this function is allowing a specification of maxcell in princomp. By default that would be maxcell=Inf but could be changed. The only thing that would need to be done is adding a maxcell argument to princomp and passing that to layerCor via model <- princomp(covmat=xcov$covariance, cor=cor, fix_sign=fix_sign, maxcell=maxcell).

@rhijmans
Copy link
Member

rhijmans commented Dec 8, 2023

I did not include the maxcell argument because I figured that you might as well use princomp<data.frame> at that point; but I have added it now anyway.

It is interesting to see how few you cells you need with this (unrealistically simple) example

library(terra)
f <- system.file("ex/logo.tif", package = "terra")
r <- rast(f)
pca <- princomp(r)
pca100 <- princomp(r, maxcell=100)
plot(predict(r, pca), predict(r, pca100))

@Nowosad
Copy link
Contributor

Nowosad commented Dec 18, 2023

@rhijmans, that for adding this function -- I think it will be helpful for many.

Do you also plan to add some updates to prcomp? Currently, princomp works for data with NAs, but prcomp not:

library(terra)
#> terra 1.7.65
f <- system.file("ex/logo.tif", package = "terra")
r <- rast(f)
r[100] <- NA

# works
pca <- princomp(r)

# fails (due to NA's)
pca2 <- prcomp(r)
#> Error in svd(x, nu = 0, nv = k): infinite or missing values in 'x'

From the princomp help file (?princomp), it appears that the prcomp should be preferred, in general:

"The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp."

@ailich
Copy link
Contributor Author

ailich commented Dec 18, 2023

@Nowosad, the point about prcomp vs princomp was raised in bleutner/RStoolbox#49 where it was stated that "prcomp is not a suitable choice for large remote sensing data sets as it operates on X as opposed to princomp which operates on the correlation or covariance matrix, both of which can be computed in a piecewise fashion. I.e. for prcomp you'd have to load all data into memory at once, while for princomp you can first compute the covariance matrix and the run eigen on it, which is not only faster, but more importantly memory-safe. In the end the results of both methods are equal. While numerical differences can exist, with SVD being more accurate, this is generally negligible compared to the measurement uncertainty inherent in typical remote sensing data."

@rhijmans
Copy link
Member

rhijmans commented Dec 18, 2023

If there are no NAs in SpatRaster x you can do

prcomp(x) 

Otherwise you need something like

maxcell <- Inf
v <- na.omit(spatSample(x, maxcell, "regular"))
prcomp(v)

I could add a prcomp method that implements that to avoid that people trip over the NA issue.

@ailich
Copy link
Contributor Author

ailich commented Dec 19, 2023

@rhijmans @Nowosad, it could be useful to add a prcomp method to make it a bit smoother for those who want to use that version of PCA. Seems like SpatRasters work with prcomp by default since objects are automatically coerced to a matrix in prcomp.

rhijmans added a commit that referenced this issue Dec 20, 2023
@Nowosad
Copy link
Contributor

Nowosad commented Dec 20, 2023

@ailich thanks for the explanation -- I was suspecting something like this, but was unable to find any discussion on the topic. And yes -- my thinking was to make prcomp and princomp more consistent (e.g., both allowing for NA values in input rasters and both having the maxcell argument). I believe this was done a few hours ago by @rhijmans -- thank you a lot. I plan to test it in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants