Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper function for creating file targets with multiple files #257

Closed
tiernanmartin opened this issue Feb 16, 2018 · 14 comments
Closed

Helper function for creating file targets with multiple files #257

tiernanmartin opened this issue Feb 16, 2018 · 14 comments

Comments

@tiernanmartin
Copy link
Contributor

I have a command that creates multiple files every time it runs. The command writes a spatial object in the shapefile format which results in the creation of four files:

st_write(spatial_data,"spatial_data.shp", driver = "ESRI Shapefile")

## Creates:
##   - "spatial_data.shp"
##   - "spatial_data.shx"
##   - "spatial_data.prj"
##   - "spatial_data.dbf"

All shapefiles need these four file types in order to work properly (actually, they need 3 of the 4 but that's irrelevant to this example).

This creates issues for any plan that includes this st_write() command. For instance, if I have the following plan:

plan <- drake_plan( 
    'spatial_data.shp' = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  ) 

The plan is only tracking one of the four necessary file targets ( spatial_data.shp), and if I were to delete any of the untracked files and re-run the plan it would tell me that all targets are already up to date.

Could there be a function that allows users to create a list of file targets that come from a single command?

spatial_data_files <- drake::file_target_list('nc.shp', 'nc.shx','nc.dbf','nc.prj')  # proposed function

plan <- drake_plan( 
    spatial_data_files = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  ) 

Thanks!

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

@tiernanmartin It's a good point. I was actually trying to solve this sort of problem once and for all in #232, but as I explain here, it is extremely difficult to make drake break the one-file-per-target rule. But right now, you can make use of wildcard templating to make the non-.shp files depend on spatial_data.shp

EDIT: 2018-02-24

I modified the next bit to be FAQ-friendly. The "first solution" that @tiernanmartin refers to next is actually the solution for drake <= 5.0.0.

Solution for drake > 5.0.0

library(drake)
library(magrittr)
drake_plan(st_write(spatial_data, file_out("spatial_data.shp"), driver = "ESRI Shapefile"), 
  c(file_out("spatial_data.EXTN"), file_in("spatial_data.shp"))) %>% evaluate_plan(wildcard = "EXTN", 
  values = c("shx", "prj", "dbj"))
#> # A tibble: 4 x 2
#>   target                 command                                          
#>   <chr>                  <chr>                                            
#> 1 "\"spatial_data.shp\"" "st_write(spatial_data, file_out(\"spatial_data.…
#> 2 "\"spatial_data.shx\"" "c(file_out(\"spatial_data.shx\"), file_in(\"spa…
#> 3 "\"spatial_data.prj\"" "c(file_out(\"spatial_data.prj\"), file_in(\"spa…
#> 4 "\"spatial_data.dbj\"" "c(file_out(\"spatial_data.dbj\"), file_in(\"spa…

Solution for drake <= 5.0.0

library(drake)
library(magrittr)
plan <- drake_plan(list = c(
  spatial_data.shp = "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI Shapefile\")",
  spatial_data = "c(\"spatial_data.EXTN\", 'spatial_data.shp'))"
)) %>%
  evaluate_plan(wildcard = "EXTN", values = c("shx", "prj", "dbj"))
plan$target <- drake_quotes(plan$target, single = TRUE)
plan

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"  

@tiernanmartin
Copy link
Contributor Author

Thanks for the explanation and for pointing me toward the wildcard feature. I'm looking forward to #232 getting merged into the master branch!

In the meantime, I'll use the first approach you recommended. Quick question: I notice that the file targets in the first solution you demonstrated lack file extensions (e.g., 'spatial_data_prj' instead of 'spatial_data.prj'):

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"  

Is there a way to convert those _'s into .'s in the evaluate_plan() call, or do I need to figure out a post-processing step?

@wlandau
Copy link
Member

wlandau commented Feb 16, 2018

Sorry, I forgot about the way drake automatically uses underscore-delimited suffixes. For now, you'll probably have to do plan$target <- gsub("data_", "data\\.", plan$target). Unless I'm overruled in #232, you won't have to worry about setting the target column yourself when it comes to file outputs.

@tiernanmartin
Copy link
Contributor Author

Is there a reason why a target cannot be a directory?

In this example, I realized that the command st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile") will create a directory that contains the four files:

spatial_data/ 
    ├── spatial_data.dbf
    ├── spatial_data.prj
    ├── spatial_data.shp
    └── spatial_data.shx

Having drake track the spatial_data directory rather than the individual files it contains would be nice because it eliminates the need to a many-to-one relationship between the plan's targets and command.

But when I tried implementing it I see the following error:

## Error: The specified pathname is not a file: spatial_data

Reprex
library(drake)
library(sf) 

spatial_data <- st_read(system.file("shape/nc.shp", package = "sf")) 

plan <- drake_plan(
  spatial_data = st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile"),
  strings_in_dots = "literals",
  file_targets = TRUE
)

plan
## # A tibble: 1 x 2
##   target         command                                                  
##   <chr>          <chr>                                                    
## 1 'spatial_data' "st_write(spatial_data, \"spatial_data\", driver = \"ESR~

make(plan)
## cache C:/Users/UrbanDesigner/AppData/Local/Temp/Rtmp40h7BI/.drake
## connect 2 imports: plan, spatial_data
## connect 1 target: 'spatial_data'
## check 2 items: spatial_data, st_write
## check 1 item: 'spatial_data'
## target 'spatial_data'
## Writing layer `spatial_data' to data source `spatial_data' using driver `ESRI Shapefile'
## features:       100
## fields:         14
## geometry type:  Multi Polygon
## Error: The specified pathname is not a file: spatial_data

@wlandau
Copy link
Member

wlandau commented Feb 19, 2018

I believe the "specified pathname" error does not actually come from drake. Last time I checked, directories are not actually safe as file targets. The standard file hashing tools seem to avoid doing it.

$ md5sum file.csv
6463474bfe6973a81dc7cbc4a71e8dd1  file.csv
$ md5sum ~/projects
md5sum: projects: Is a directory
$ man md5sum # has no recursive option

Drake uses the digest package to hash files, and digest avoids hashing directories too.

library(digest)
> digest("~/projects/", file = TRUE)
Error: The specified pathname is not a file: /home/landau/projects/
> file.exists("~/projects")
[1] TRUE

Directory targets would be nice to have, but I do not think it is drake's responsibility to figure out how to hash them quickly and efficiently. It's a thorny problem, maybe for a separate package, maybe called dirgest.

@wlandau wlandau closed this as completed Feb 19, 2018
@wlandau
Copy link
Member

wlandau commented Feb 19, 2018

Oops: forgot this issue was about more than just directory hashes. Reopening.

@wlandau
Copy link
Member

wlandau commented Feb 24, 2018

I just updated #257 (comment) to be more FAQ-friendly, and this thread is now part of our automatically-generated FAQ. I think we can close. We should discuss potential further development on #12.

@wlandau
Copy link
Member

wlandau commented May 2, 2018

FYI: the best practices guide now has detailed guidance on output file targets, including the main drawback and main alternative to the workaround we talked about earlier in the thread.

@tiernanmartin
Copy link
Contributor Author

@wlandau awesome work implementing this feature 🎉

You asked for a shapefile workflow so I did my best to put something together:

Drake Shapefile Example
# SETUP -------------------------------------------------------------------


library(tibble)
library(purrr)
library(sf)
library(drake) # devtools::install_github("ropensci/drake")



# PLAN --------------------------------------------------------------------


make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}
  

st_write_multiple <- function(..., file_outputs){ 
  pwalk(list(...), st_write)
}

u_auckland_plan <- drake_plan(u_auckland = make_place(Name = "University of Auckland", Latitude = -36.8521369, Longitude = 174.7688785),
                              u_aukland_shapefile = st_write_multiple(list(u_auckland), dsn = file_out("u-auckland.shp"), driver = "ESRI Shapefile",delete_dsn=TRUE, 
                                                file_outputs = file_out(c("u-auckland.prj","u-auckland.shx","u-auckland.dbf"))),
                              strings_in_dots = "literals")

u_auckland_plan

make(u_auckland_plan)

# TEST --------------------------------------------------------------------


file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

It is more complicated than I hope it would be. The drake side works perfectly but I needed to write a wrapper function around st_write() to allow the extra file outputs to be tracked.

Perhaps someone else who is experimenting with using drake to manage a spatial data workflow can offer a simpler example? cc: @noamross @krlmlr @pat-s

@wlandau
Copy link
Member

wlandau commented Jul 16, 2018

@tiernanmartin Thanks for the quick start on this example! I am optimistic. drake commands can be arbitrary multi-line code chunks, so I do not think we need the wrapper around st_write(). What about this plan?

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_aukland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

I am not sure my changes are totally correct because make(u_auckland_plan) gives warnings:

Warning: target u_aukland_shapefile warnings:
  GDAL Error 1: u-auckland.shp does not appear to be a file or directory.

On the other hand, the four files do appear, including u-auckland.shp.

@tiernanmartin
Copy link
Contributor Author

Ah I didn't realize that commands we so flexible.

Your code looks good to me! The GDAL error is annoying but not a deal breaker. The reason it shows up is because I set delete_dsn = TRUE, causing GDAL to expect to have to delete a file before it creates the replacement. The default setting of delete_dsn = FALSE will not allow the files to be overwritten.

@tiernanmartin
Copy link
Contributor Author

I just noticed there is a typo - here's the complete version with your suggested revisions:

library(tibble)
library(sf)
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")

make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}


u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_auckland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

make(u_auckland_plan)

file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

@wlandau
Copy link
Member

wlandau commented Jul 19, 2018

Thanks, @tiernanmartin. This is nice inspiration for a chapter in the docs.

@wlandau
Copy link
Member

wlandau commented Mar 22, 2019

FYI: effective #795, you can write st_write(spatial_data, file_out("spatial_data"), driver = "ESRI Shapefile") and drake will track the entire directory of output files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants