get_stats19 fail with multiple years and casualties #154

Robinlovelace · 2020-02-14T16:19:56Z

Reprex:

casualties = get_stats19(2012:2018, type = "cas")

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

The text was updated successfully, but these errors were encountered:

agila5 · 2020-02-18T19:31:00Z

This is a reprex of the problem:

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

casualties_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
ncol(casualties_2013)
#> [1] 14
casualties_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
ncol(casualties_2014)
#> [1] 15
get_stats19(2013:2014, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

^{Created on 2020-02-18 by the reprex package (v0.3.0)}

while in other cases it works

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

casualties_2017 <- get_stats19(2017, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
ncol(casualties_2017)
#> [1] 16
casualties_2018 <- get_stats19(2018, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2018)
#> [1] 16
casualties_2017_2018 <- get_stats19(c(2017, 2018), type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2017_2018)
#> [1] 16
nrow(casualties_2017) + nrow(casualties_2018) == nrow(casualties_2017_2018)
#> [1] TRUE

^{Created on 2020-02-18 by the reprex package (v0.3.0)}

So this is a problem that occurs when we use rbind with two or more data.frame/tibbles/sf having different columns.

rbind(data.frame(x = 1), data.frame(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

library(tibble)
rbind(tibble(x = 1), tibble(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

^{Created on 2020-02-18 by the reprex package (v0.3.0)}

I'm not sure what's the best solution. My first idea now would be using dplyr::bind_rows (and maybe that implies moving dplyr to Imports, I'm not sure) but, as far as I know, at the moment there are some frictions between bind_rows and sf objects: r-spatial/sf#49.

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

a = st_sf(a=1, geom = st_sfc(st_point(0:1)), crs = 3857)
b = st_sf(a=1, b = 2, geom = st_sfc(st_linestring(matrix(1:4,2))), crs = 3857)

rbind(a, b)
#> Error in rbind.data.frame(...): numbers of columns of arguments do not match
bind_rows(a, b)
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_POINT' elements may not preserve
#> their attributes
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_LINESTRING' elements may not
#> preserve their attributes
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index

^{Created on 2020-02-18 by the reprex package (v0.3.0)}

Maybe we could simply raise an error in case of multiple years and output_format == "sf". Another solution would be to just take the common columns and return the rbind of common columns. Suggestion?

Robinlovelace · 2020-02-18T21:14:40Z

I suggest the base R solution that keeps all columns would be good, e.g. as shown here: https://stackoverflow.com/a/46635610/1694378

Robinlovelace · 2020-02-18T21:31:07Z

Update, having thought about it and played with it I've changed my mind... I think dplyr::bind_rows() may be the least bad option, e.g.:

cas_list = lapply(2013:2018, function(i) stats19::get_stats19(i, type = "cas"))
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Files identified: RoadSafetyData_Casualties_2015.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/RoadSafetyData_Casualties_2015.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/RoadSafetyData_Casualties_2015/Casualties_2015.csv
#> Files identified: dftRoadSafetyData_Casualties_2016.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2016.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2016/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2018.csv
sapply(cas_list, ncol)
#> [1] 14 15 16 16 16 16
# sapply(cas_list, names)
names_diff = setdiff(names(cas_list[[5]]), names(cas_list[[1]]))
names_all = unique(unlist(lapply(cas_list, names)))

res = dplyr::bind_rows(cas_list)
res
#> # A tibble: 1,077,310 x 16
#>    accident_index vehicle_referen… casualty_refere… casualty_class
#>    <chr>                     <int>            <int> <chr>         
#>  1 201301BS70003                 2                1 Driver or rid…
#>  2 201301BS70005                 1                1 Driver or rid…
#>  3 201301BS70005                 1                2 Pedestrian    
#>  4 201301BS70006                 1                1 Passenger     
#>  5 201301BS70007                 1                1 Driver or rid…
#>  6 201301BS70009                 1                1 Driver or rid…
#>  7 201301BS70010                 1                1 Pedestrian    
#>  8 201301BS70012                 2                1 Driver or rid…
#>  9 201301BS70013                 1                1 Pedestrian    
#> 10 201301BS70015                 1                1 Pedestrian    
#> # … with 1,077,300 more rows, and 12 more variables: sex_of_casualty <chr>,
#> #   age_band_of_casualty <chr>, casualty_severity <chr>,
#> #   pedestrian_location <chr>, pedestrian_movement <chr>, car_passenger <chr>,
#> #   bus_or_coach_passenger <chr>, pedestrian_road_maintenance_worker <chr>,
#> #   casualty_type <chr>, casualty_home_area_type <chr>, age_of_casualty <int>,
#> #   casualty_imd_decile <chr>
res[names_diff]
#> # A tibble: 1,077,310 x 2
#>    age_of_casualty casualty_imd_decile
#>              <int> <chr>              
#>  1              NA <NA>               
#>  2              NA <NA>               
#>  3              NA <NA>               
#>  4              NA <NA>               
#>  5              NA <NA>               
#>  6              NA <NA>               
#>  7              NA <NA>               
#>  8              NA <NA>               
#>  9              NA <NA>               
#> 10              NA <NA>               
#> # … with 1,077,300 more rows
summary(.Last.value)
#>    Mode    TRUE 
#> logical       1

^{Created on 2020-02-18 by the reprex package (v0.3.0)}

Robinlovelace · 2020-02-18T21:31:42Z

Remember: cas will never be an sf object...

Robinlovelace · 2020-02-18T21:33:20Z

Plan to give this some thought next week, any further ideas welcome. Thanks for the reprex @agila5, definitely food for thought and eventually action to solve this pesky bug!

layik · 2020-02-19T13:34:07Z

Hi Andrea and hi Robin. I think you both know my position. I will only resort to non-base if it is the only option. I hope I can find some time to give it some thought too.

Robinlovelace · 2020-02-19T17:20:17Z

Happy with non-base solution, if I can understand it! Thoughts @agila5 ?

agila5 · 2020-02-19T18:35:41Z

I would also prefer a base-R solution wrt dplyr::bind_rows but I'm not 100% happy with "rbind-ing using common columns" because it drops the age_of_casuality and imd_decile variables in the previous examples, i.e.

# packages
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)

names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile"

^{Created on 2020-02-19 by the reprex package (v0.3.0)}

and I think they could be important from a modeling perspective. I will think about it in the next days.

layik · 2020-02-19T22:12:48Z

Thanks @agila5, it seems there is a base way of doing this which would not be "very" expensive in our case, ~~but I just could not play with the do.call(rbind, all) to get you some code~~.

rbind(
        data.frame(c(df1, sapply(setdiff(names(df2), names(df1)), function(x) NA))),
        data.frame(c(df2, sapply(setdiff(names(df1), names(df2)), function(x) NA)))
      )

From great SO.

layik · 2020-02-19T22:16:40Z

I assume that is or something similar to that what dplyr::bind_rows does, and possibly run even more lines of code.

layik · 2020-02-20T01:06:33Z

This should go into tests really:

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)

names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile"

c = get_stats19(2013:2016, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ...

any(grepl("age_of_casualty", names(c)))
#> [1] TRUE
any(grepl("casualty_imd_decile", names(c)))
#> [1] TRUE

t = get_stats19(2013:2016, output_format = "sf")
#> Files identified: DfTRoadSafety_Accidents_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Accidents_2013.zip
#> Attempt downloading from:
#> Data saved at ....
#> 7 rows removed with no coordinates
class(t)
#> [1] "sf"         "data.frame"

^{Created on 2020-02-20 by the reprex package (v0.3.0)}

layik · 2020-02-20T01:11:25Z

Do we need a separate ticket to change "year" to "years", would that be a breaking change?

ping #153

agila5 · 2020-02-20T12:02:58Z

Thanks for the solution, I will test it carefully in the next days!

Anyway I'm not sure I understand your question. Do you want to change the name of the parameter from year to years? I wouldn't do that and I would just update the docs overwriting the description for the year parameter.

Years fix cas in #154

layik · 2020-02-25T22:03:30Z

Please open if more work is needed.

Robinlovelace · 2020-02-26T09:15:23Z

Great work @layik, harder than expected but will make important mult-year analysis more accessible.

layik added a commit that referenced this issue Feb 20, 2020

addressing do.call rbind on multiple years: #154, #99 and #145

374d152

layik added a commit that referenced this issue Feb 20, 2020

add #154 comment as tests

c9e46e6

layik added a commit that referenced this issue Feb 21, 2020

fix output_format in #154

4f37c64

Robinlovelace added a commit that referenced this issue Feb 25, 2020

Merge pull request #157 from ropensci/years

fe420d5

Years fix cas in #154

layik closed this as completed Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_stats19 fail with multiple years and casualties #154

get_stats19 fail with multiple years and casualties #154

Robinlovelace commented Feb 14, 2020

agila5 commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

layik commented Feb 19, 2020

Robinlovelace commented Feb 19, 2020

agila5 commented Feb 19, 2020

layik commented Feb 19, 2020 •

edited

Loading

layik commented Feb 19, 2020 •

edited

Loading

layik commented Feb 20, 2020 •

edited

Loading

layik commented Feb 20, 2020 •

edited

Loading

agila5 commented Feb 20, 2020

layik commented Feb 25, 2020

Robinlovelace commented Feb 26, 2020

get_stats19 fail with multiple years and casualties #154

get_stats19 fail with multiple years and casualties #154

Comments

Robinlovelace commented Feb 14, 2020

agila5 commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

Robinlovelace commented Feb 18, 2020

layik commented Feb 19, 2020

Robinlovelace commented Feb 19, 2020

agila5 commented Feb 19, 2020

layik commented Feb 19, 2020 • edited Loading

layik commented Feb 19, 2020 • edited Loading

layik commented Feb 20, 2020 • edited Loading

layik commented Feb 20, 2020 • edited Loading

agila5 commented Feb 20, 2020

layik commented Feb 25, 2020

Robinlovelace commented Feb 26, 2020

layik commented Feb 19, 2020 •

edited

Loading

layik commented Feb 19, 2020 •

edited

Loading

layik commented Feb 20, 2020 •

edited

Loading

layik commented Feb 20, 2020 •

edited

Loading