Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_stats19 fail with multiple years and casualties #154

Closed
Robinlovelace opened this issue Feb 14, 2020 · 15 comments
Closed

get_stats19 fail with multiple years and casualties #154

Robinlovelace opened this issue Feb 14, 2020 · 15 comments

Comments

@Robinlovelace
Copy link
Member

Reprex:

casualties = get_stats19(2012:2018, type = "cas")

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match
@agila5
Copy link
Collaborator

agila5 commented Feb 18, 2020

This is a reprex of the problem:

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

casualties_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
ncol(casualties_2013)
#> [1] 14
casualties_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
ncol(casualties_2014)
#> [1] 15
get_stats19(2013:2014, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

Created on 2020-02-18 by the reprex package (v0.3.0)

while in other cases it works

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

casualties_2017 <- get_stats19(2017, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
ncol(casualties_2017)
#> [1] 16
casualties_2018 <- get_stats19(2018, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2018)
#> [1] 16
casualties_2017_2018 <- get_stats19(c(2017, 2018), type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2017_2018)
#> [1] 16
nrow(casualties_2017) + nrow(casualties_2018) == nrow(casualties_2017_2018)
#> [1] TRUE

Created on 2020-02-18 by the reprex package (v0.3.0)

So this is a problem that occurs when we use rbind with two or more data.frame/tibbles/sf having different columns.

rbind(data.frame(x = 1), data.frame(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

library(tibble)
rbind(tibble(x = 1), tibble(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

Created on 2020-02-18 by the reprex package (v0.3.0)

I'm not sure what's the best solution. My first idea now would be using dplyr::bind_rows (and maybe that implies moving dplyr to Imports, I'm not sure) but, as far as I know, at the moment there are some frictions between bind_rows and sf objects: r-spatial/sf#49.

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

a = st_sf(a=1, geom = st_sfc(st_point(0:1)), crs = 3857)
b = st_sf(a=1, b = 2, geom = st_sfc(st_linestring(matrix(1:4,2))), crs = 3857)

rbind(a, b)
#> Error in rbind.data.frame(...): numbers of columns of arguments do not match
bind_rows(a, b)
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_POINT' elements may not preserve
#> their attributes
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_LINESTRING' elements may not
#> preserve their attributes
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index

Created on 2020-02-18 by the reprex package (v0.3.0)

Maybe we could simply raise an error in case of multiple years and output_format == "sf". Another solution would be to just take the common columns and return the rbind of common columns. Suggestion?

@Robinlovelace
Copy link
Member Author

I suggest the base R solution that keeps all columns would be good, e.g. as shown here: https://stackoverflow.com/a/46635610/1694378

@Robinlovelace
Copy link
Member Author

Update, having thought about it and played with it I've changed my mind... I think dplyr::bind_rows() may be the least bad option, e.g.:

cas_list = lapply(2013:2018, function(i) stats19::get_stats19(i, type = "cas"))
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Files identified: RoadSafetyData_Casualties_2015.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/RoadSafetyData_Casualties_2015.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/RoadSafetyData_Casualties_2015/Casualties_2015.csv
#> Files identified: dftRoadSafetyData_Casualties_2016.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2016.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2016/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2018.csv
sapply(cas_list, ncol)
#> [1] 14 15 16 16 16 16
# sapply(cas_list, names)
names_diff = setdiff(names(cas_list[[5]]), names(cas_list[[1]]))
names_all = unique(unlist(lapply(cas_list, names)))

res = dplyr::bind_rows(cas_list)
res
#> # A tibble: 1,077,310 x 16
#>    accident_index vehicle_referen… casualty_refere… casualty_class
#>    <chr>                     <int>            <int> <chr>         
#>  1 201301BS70003                 2                1 Driver or rid…
#>  2 201301BS70005                 1                1 Driver or rid…
#>  3 201301BS70005                 1                2 Pedestrian    
#>  4 201301BS70006                 1                1 Passenger     
#>  5 201301BS70007                 1                1 Driver or rid…
#>  6 201301BS70009                 1                1 Driver or rid…
#>  7 201301BS70010                 1                1 Pedestrian    
#>  8 201301BS70012                 2                1 Driver or rid…
#>  9 201301BS70013                 1                1 Pedestrian    
#> 10 201301BS70015                 1                1 Pedestrian    
#> # … with 1,077,300 more rows, and 12 more variables: sex_of_casualty <chr>,
#> #   age_band_of_casualty <chr>, casualty_severity <chr>,
#> #   pedestrian_location <chr>, pedestrian_movement <chr>, car_passenger <chr>,
#> #   bus_or_coach_passenger <chr>, pedestrian_road_maintenance_worker <chr>,
#> #   casualty_type <chr>, casualty_home_area_type <chr>, age_of_casualty <int>,
#> #   casualty_imd_decile <chr>
res[names_diff]
#> # A tibble: 1,077,310 x 2
#>    age_of_casualty casualty_imd_decile
#>              <int> <chr>              
#>  1              NA <NA>               
#>  2              NA <NA>               
#>  3              NA <NA>               
#>  4              NA <NA>               
#>  5              NA <NA>               
#>  6              NA <NA>               
#>  7              NA <NA>               
#>  8              NA <NA>               
#>  9              NA <NA>               
#> 10              NA <NA>               
#> # … with 1,077,300 more rows
summary(.Last.value)
#>    Mode    TRUE 
#> logical       1

Created on 2020-02-18 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member Author

Remember: cas will never be an sf object...

@Robinlovelace
Copy link
Member Author

Plan to give this some thought next week, any further ideas welcome. Thanks for the reprex @agila5, definitely food for thought and eventually action to solve this pesky bug!

@layik
Copy link
Member

layik commented Feb 19, 2020

Hi Andrea and hi Robin. I think you both know my position. I will only resort to non-base if it is the only option. I hope I can find some time to give it some thought too.

@Robinlovelace
Copy link
Member Author

Happy with non-base solution, if I can understand it! Thoughts @agila5 ?

@agila5
Copy link
Collaborator

agila5 commented Feb 19, 2020

I would also prefer a base-R solution wrt dplyr::bind_rows but I'm not 100% happy with "rbind-ing using common columns" because it drops the age_of_casuality and imd_decile variables in the previous examples, i.e.

# packages
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)

names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile"

Created on 2020-02-19 by the reprex package (v0.3.0)

and I think they could be important from a modeling perspective. I will think about it in the next days.

@layik
Copy link
Member

layik commented Feb 19, 2020

Thanks @agila5, it seems there is a base way of doing this which would not be "very" expensive in our case, but I just could not play with the do.call(rbind, all) to get you some code.

rbind(
        data.frame(c(df1, sapply(setdiff(names(df2), names(df1)), function(x) NA))),
        data.frame(c(df2, sapply(setdiff(names(df1), names(df2)), function(x) NA)))
      )

From great SO.

@layik
Copy link
Member

layik commented Feb 19, 2020

I assume that is or something similar to that what dplyr::bind_rows does, and possibly run even more lines of code.

@layik
Copy link
Member

layik commented Feb 20, 2020

This should go into tests really:

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)

names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile"

c = get_stats19(2013:2016, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ...

any(grepl("age_of_casualty", names(c)))
#> [1] TRUE
any(grepl("casualty_imd_decile", names(c)))
#> [1] TRUE

t = get_stats19(2013:2016, output_format = "sf")
#> Files identified: DfTRoadSafety_Accidents_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Accidents_2013.zip
#> Attempt downloading from:
#> Data saved at ....
#> 7 rows removed with no coordinates
class(t)
#> [1] "sf"         "data.frame"

Created on 2020-02-20 by the reprex package (v0.3.0)

@layik
Copy link
Member

layik commented Feb 20, 2020

Do we need a separate ticket to change "year" to "years", would that be a breaking change?

ping #153

@agila5
Copy link
Collaborator

agila5 commented Feb 20, 2020

Thanks for the solution, I will test it carefully in the next days!

Anyway I'm not sure I understand your question. Do you want to change the name of the parameter from year to years? I wouldn't do that and I would just update the docs overwriting the description for the year parameter.

layik added a commit that referenced this issue Feb 20, 2020
layik added a commit that referenced this issue Feb 21, 2020
Robinlovelace added a commit that referenced this issue Feb 25, 2020
@layik
Copy link
Member

layik commented Feb 25, 2020

Please open if more work is needed.

@layik layik closed this as completed Feb 25, 2020
@Robinlovelace
Copy link
Member Author

Great work @layik, harder than expected but will make important mult-year analysis more accessible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants