Helper function for creating file targets with multiple files #257

tiernanmartin · 2018-02-16T21:40:07Z

I have a command that creates multiple files every time it runs. The command writes a spatial object in the shapefile format which results in the creation of four files:

st_write(spatial_data,"spatial_data.shp", driver = "ESRI Shapefile")

## Creates:
##   - "spatial_data.shp"
##   - "spatial_data.shx"
##   - "spatial_data.prj"
##   - "spatial_data.dbf"

All shapefiles need these four file types in order to work properly (actually, they need 3 of the 4 but that's irrelevant to this example).

This creates issues for any plan that includes this st_write() command. For instance, if I have the following plan:

plan <- drake_plan( 
    'spatial_data.shp' = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  )

The plan is only tracking one of the four necessary file targets ( spatial_data.shp), and if I were to delete any of the untracked files and re-run the plan it would tell me that all targets are already up to date.

Could there be a function that allows users to create a list of file targets that come from a single command?

spatial_data_files <- drake::file_target_list('nc.shp', 'nc.shx','nc.dbf','nc.prj')  # proposed function

plan <- drake_plan( 
    spatial_data_files = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  )

Thanks!

The text was updated successfully, but these errors were encountered:

wlandau · 2018-02-16T22:38:22Z

@tiernanmartin It's a good point. I was actually trying to solve this sort of problem once and for all in #232, but as I explain here, it is extremely difficult to make drake break the one-file-per-target rule. But right now, you can make use of wildcard templating to make the non-.shp files depend on spatial_data.shp

EDIT: 2018-02-24

I modified the next bit to be FAQ-friendly. The "first solution" that @tiernanmartin refers to next is actually the solution for drake <= 5.0.0.

Solution for `drake` > 5.0.0

library(drake)
library(magrittr)
drake_plan(st_write(spatial_data, file_out("spatial_data.shp"), driver = "ESRI Shapefile"), 
  c(file_out("spatial_data.EXTN"), file_in("spatial_data.shp"))) %>% evaluate_plan(wildcard = "EXTN", 
  values = c("shx", "prj", "dbj"))
#> # A tibble: 4 x 2
#>   target                 command                                          
#>   <chr>                  <chr>                                            
#> 1 "\"spatial_data.shp\"" "st_write(spatial_data, file_out(\"spatial_data.…
#> 2 "\"spatial_data.shx\"" "c(file_out(\"spatial_data.shx\"), file_in(\"spa…
#> 3 "\"spatial_data.prj\"" "c(file_out(\"spatial_data.prj\"), file_in(\"spa…
#> 4 "\"spatial_data.dbj\"" "c(file_out(\"spatial_data.dbj\"), file_in(\"spa…

Solution for `drake` <= 5.0.0

library(drake)
library(magrittr)
plan <- drake_plan(list = c(
  spatial_data.shp = "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI Shapefile\")",
  spatial_data = "c(\"spatial_data.EXTN\", 'spatial_data.shp'))"
)) %>%
  evaluate_plan(wildcard = "EXTN", values = c("shx", "prj", "dbj"))
plan$target <- drake_quotes(plan$target, single = TRUE)
plan

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"

tiernanmartin · 2018-02-16T23:20:28Z

Thanks for the explanation and for pointing me toward the wildcard feature. I'm looking forward to #232 getting merged into the master branch!

In the meantime, I'll use the first approach you recommended. Quick question: I notice that the file targets in the first solution you demonstrated lack file extensions (e.g., 'spatial_data_prj' instead of 'spatial_data.prj'):

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"

Is there a way to convert those _'s into .'s in the evaluate_plan() call, or do I need to figure out a post-processing step?

wlandau · 2018-02-16T23:40:11Z

Sorry, I forgot about the way drake automatically uses underscore-delimited suffixes. For now, you'll probably have to do plan$target <- gsub("data_", "data\\.", plan$target). Unless I'm overruled in #232, you won't have to worry about setting the target column yourself when it comes to file outputs.

tiernanmartin · 2018-02-18T23:24:40Z

Is there a reason why a target cannot be a directory?

In this example, I realized that the command st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile") will create a directory that contains the four files:

spatial_data/ 
    ├── spatial_data.dbf
    ├── spatial_data.prj
    ├── spatial_data.shp
    └── spatial_data.shx

Having drake track the spatial_data directory rather than the individual files it contains would be nice because it eliminates the need to a many-to-one relationship between the plan's targets and command.

But when I tried implementing it I see the following error:

## Error: The specified pathname is not a file: spatial_data

Reprex

library(drake)
library(sf) 

spatial_data <- st_read(system.file("shape/nc.shp", package = "sf")) 

plan <- drake_plan(
  spatial_data = st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile"),
  strings_in_dots = "literals",
  file_targets = TRUE
)

plan
## # A tibble: 1 x 2
##   target         command                                                  
##   <chr>          <chr>                                                    
## 1 'spatial_data' "st_write(spatial_data, \"spatial_data\", driver = \"ESR~

make(plan)
## cache C:/Users/UrbanDesigner/AppData/Local/Temp/Rtmp40h7BI/.drake
## connect 2 imports: plan, spatial_data
## connect 1 target: 'spatial_data'
## check 2 items: spatial_data, st_write
## check 1 item: 'spatial_data'
## target 'spatial_data'
## Writing layer `spatial_data' to data source `spatial_data' using driver `ESRI Shapefile'
## features:       100
## fields:         14
## geometry type:  Multi Polygon
## Error: The specified pathname is not a file: spatial_data

wlandau · 2018-02-19T01:19:52Z

I believe the "specified pathname" error does not actually come from drake. Last time I checked, directories are not actually safe as file targets. The standard file hashing tools seem to avoid doing it.

$ md5sum file.csv
6463474bfe6973a81dc7cbc4a71e8dd1  file.csv
$ md5sum ~/projects
md5sum: projects: Is a directory
$ man md5sum # has no recursive option

Drake uses the digest package to hash files, and digest avoids hashing directories too.

library(digest)
> digest("~/projects/", file = TRUE)
Error: The specified pathname is not a file: /home/landau/projects/
> file.exists("~/projects")
[1] TRUE

Directory targets would be nice to have, but I do not think it is drake's responsibility to figure out how to hash them quickly and efficiently. It's a thorny problem, maybe for a separate package, maybe called dirgest.

wlandau · 2018-02-19T01:20:55Z

Oops: forgot this issue was about more than just directory hashes. Reopening.

wlandau · 2018-02-24T19:35:56Z

I just updated #257 (comment) to be more FAQ-friendly, and this thread is now part of our automatically-generated FAQ. I think we can close. We should discuss potential further development on #12.

wlandau · 2018-05-02T03:44:18Z

FYI: the best practices guide now has detailed guidance on output file targets, including the main drawback and main alternative to the workaround we talked about earlier in the thread.

tiernanmartin · 2018-07-16T17:26:20Z

@wlandau awesome work implementing this feature 🎉

You asked for a shapefile workflow so I did my best to put something together:

Drake Shapefile Example

# SETUP -------------------------------------------------------------------


library(tibble)
library(purrr)
library(sf)
library(drake) # devtools::install_github("ropensci/drake")



# PLAN --------------------------------------------------------------------


make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}
  

st_write_multiple <- function(..., file_outputs){ 
  pwalk(list(...), st_write)
}

u_auckland_plan <- drake_plan(u_auckland = make_place(Name = "University of Auckland", Latitude = -36.8521369, Longitude = 174.7688785),
                              u_aukland_shapefile = st_write_multiple(list(u_auckland), dsn = file_out("u-auckland.shp"), driver = "ESRI Shapefile",delete_dsn=TRUE, 
                                                file_outputs = file_out(c("u-auckland.prj","u-auckland.shx","u-auckland.dbf"))),
                              strings_in_dots = "literals")

u_auckland_plan

make(u_auckland_plan)

# TEST --------------------------------------------------------------------


file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

It is more complicated than I hope it would be. The drake side works perfectly but I needed to write a wrapper function around st_write() to allow the extra file outputs to be tracked.

Perhaps someone else who is experimenting with using drake to manage a spatial data workflow can offer a simpler example? cc: @noamross @krlmlr @pat-s

wlandau · 2018-07-16T18:09:13Z

@tiernanmartin Thanks for the quick start on this example! I am optimistic. drake commands can be arbitrary multi-line code chunks, so I do not think we need the wrapper around st_write(). What about this plan?

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_aukland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

I am not sure my changes are totally correct because make(u_auckland_plan) gives warnings:

Warning: target u_aukland_shapefile warnings:
  GDAL Error 1: u-auckland.shp does not appear to be a file or directory.

On the other hand, the four files do appear, including u-auckland.shp.

tiernanmartin · 2018-07-16T18:20:55Z

Ah I didn't realize that commands we so flexible.

Your code looks good to me! The GDAL error is annoying but not a deal breaker. The reason it shows up is because I set delete_dsn = TRUE, causing GDAL to expect to have to delete a file before it creates the replacement. The default setting of delete_dsn = FALSE will not allow the files to be overwritten.

tiernanmartin · 2018-07-16T18:25:04Z

I just noticed there is a typo - here's the complete version with your suggested revisions:

library(tibble)
library(sf)
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")

make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}


u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_auckland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

make(u_auckland_plan)

file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

wlandau · 2018-07-19T19:06:54Z

Thanks, @tiernanmartin. This is nice inspiration for a chapter in the docs.

wlandau · 2019-03-22T19:40:56Z

FYI: effective #795, you can write st_write(spatial_data, file_out("spatial_data"), driver = "ESRI Shapefile") and drake will track the entire directory of output files.

wlandau added topic: documentation type: faq topic: api labels Feb 16, 2018

wlandau closed this as completed Feb 19, 2018

wlandau reopened this Feb 19, 2018

wlandau mentioned this issue Feb 20, 2018

Directories (folders) are not reproducibly tracked. #12

Closed

wlandau closed this as completed Feb 24, 2018

wlandau mentioned this issue Feb 26, 2018

Allow multiple output files for each command #283

Closed

bmchorse mentioned this issue Apr 23, 2018

How to speed up construction of a (very) large plan? #366

Closed

wlandau mentioned this issue Jul 15, 2018

Multiple output files per command: complete implementation #469

Merged

7 tasks

wlandau mentioned this issue Jul 19, 2018

New chapter: example file-based data analysis project ropensci-books/drake#19

Closed

wlandau mentioned this issue Aug 2, 2018

pretty printing of plans #489

Closed

wlandau removed the type: faq label Dec 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helper function for creating file targets with multiple files #257

Helper function for creating file targets with multiple files #257

tiernanmartin commented Feb 16, 2018

wlandau commented Feb 16, 2018 •

edited

Loading

tiernanmartin commented Feb 16, 2018

wlandau commented Feb 16, 2018

tiernanmartin commented Feb 18, 2018

wlandau commented Feb 19, 2018

wlandau commented Feb 19, 2018

wlandau commented Feb 24, 2018

wlandau commented May 2, 2018

tiernanmartin commented Jul 16, 2018

wlandau commented Jul 16, 2018 •

edited

Loading

tiernanmartin commented Jul 16, 2018

tiernanmartin commented Jul 16, 2018

wlandau commented Jul 19, 2018

wlandau commented Mar 22, 2019

Helper function for creating file targets with multiple files #257

Helper function for creating file targets with multiple files #257

Comments

tiernanmartin commented Feb 16, 2018

wlandau commented Feb 16, 2018 • edited Loading

EDIT: 2018-02-24

Solution for drake > 5.0.0

Solution for drake <= 5.0.0

tiernanmartin commented Feb 16, 2018

wlandau commented Feb 16, 2018

tiernanmartin commented Feb 18, 2018

wlandau commented Feb 19, 2018

wlandau commented Feb 19, 2018

wlandau commented Feb 24, 2018

wlandau commented May 2, 2018

tiernanmartin commented Jul 16, 2018

wlandau commented Jul 16, 2018 • edited Loading

tiernanmartin commented Jul 16, 2018

tiernanmartin commented Jul 16, 2018

wlandau commented Jul 19, 2018

wlandau commented Mar 22, 2019

wlandau commented Feb 16, 2018 •

edited

Loading

Solution for `drake` > 5.0.0

Solution for `drake` <= 5.0.0

wlandau commented Jul 16, 2018 •

edited

Loading