Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.data.frame defaults to na.rm = T for spatraster with many layers #792

Closed
ECarnell opened this issue Sep 5, 2022 · 8 comments
Closed

Comments

@ECarnell
Copy link

ECarnell commented Sep 5, 2022

The argument na.rm now defaults to TRUE in the as.data.frame function.

While this makes sense for single layer rasters its really unhelpful for multi-band rasters.

This is also a change from the previous raster package, where the default for na.rm was FALSE

require(terra)
r <- rast(ext = ext(c(xmin = 0, xmax = 2, ymin = 0, ymax = 2)),  res = 1)
r1 <- setValues(r, c(1, rep(NA,3)))
r2 <- setValues(r, c(rep(NA,3),1))

## no rows
c(r1,r2) |> as.data.frame(xy = T) 
#[1] x     y     lyr.1 lyr.1
#<0 rows> (or 0-length row.names)

# all results 
c(r1,r2) |> as.data.frame(xy = T, na.rm = F) 
#    x   y lyr.1 lyr.1
#1 0.5 1.5     1    NA
#2 1.5 1.5    NA    NA
#3 0.5 0.5    NA    NA
#4 1.5 0.5    NA     1
@rhijmans
Copy link
Member

rhijmans commented Sep 5, 2022

With trepidation I changed the default to FALSE. That may cause pain in existing scripts so I am a bit on the fence about this.
I have also added a third option na.rm=NA to only remove records (cells) where all layers are NA.

c(r1,r2) |> as.data.frame(xy = T, na.rm=NA) 
#    x   y lyr.1 lyr.1
#1 0.5 1.5     1    NA
#4 1.5 0.5    NA     1

That could also be sensible default, as it would not change the behavior for a single layer SpatRaster

@kadyb
Copy link
Contributor

kadyb commented Feb 7, 2023

What is the status of this issue? In 57ab669 you changed na.rm=FALSE to na.rm=NA. Personally, I would vote to set this argument to FALSE by default.

@kadyb
Copy link
Contributor

kadyb commented Feb 12, 2023

Or if missing values are to be removed by default, perhaps there should be message that some observations have been removed from the data frame?

@rhijmans
Copy link
Member

I will check how many revdeps break with that (but of course there is other code that will break that is not in dependent packages). But whether they do or not, why you would you want records with only NAs. I suppose that would only be on the rare case that you would want to manipulate the data.frame and then recreate a SpatRaster from that. Is that necessary?

@rhijmans
Copy link
Member

To better express myself: I would hope and think that going from a SpatRaster to a data.frame and back would be rare. Also, it would be easy to detect the problem when creating the new SpatRaster as the number of rows in the data.frame won't match ncell of the SpatRaster.

@kadyb
Copy link
Contributor

kadyb commented Feb 13, 2023

I suppose that would only be on the rare case that you would want to manipulate the data.frame and then recreate a SpatRaster from that. Is that necessary?

Yes, exactly. I have some workflows where I convert the raster to data frame and then use kmeans clustering or convert to xgboost::xgb.DMatrix class for machine learning. Later I convert back the results to SpatRaster, but I need to know where the missing values were. BTW: In values() function the default argument is na.rm = FALSE.

Here is example:

library("terra")
r = rast(system.file("ex/elev.tif", package = "terra"))
df = as.data.frame(r, na.rm = FALSE)
mdl = kmeans(na.omit(df), centers = 3)
output = rep(NA, ncell(r))
output[complete.cases(df)] = mdl$cluster
clustering = rast(r, vals = output)

### if `as.data.frame(r, na.rm = NA)` there is warning
### but I didn't know why after updating terra this code stopped working
# Warning message:
# In output[complete.cases(data)] = mdl$cluster :
#   number of items to replace is not a multiple of replacement length

@Hongwuliang
Copy link

The predict function of the terra package cannot be combined with models trained with XGBoost. The raster data is extensive, and cannot be converted to a data frame, which is causing me a great deal of distress.

@rhijmans
Copy link
Member

@Hongwuliang: I think it can. See: https://stackoverflow.com/a/74330713/635245

Please create a minimal, self-contained, reproducible example and use that for a question on stackoverflow.com. This github site is not for answering this type of "how do I do this?" questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants