-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_stats19 fail with multiple years and casualties #154
Comments
This is a reprex of the problem: library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
casualties_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
ncol(casualties_2013)
#> [1] 14
casualties_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
ncol(casualties_2014)
#> [1] 15
get_stats19(2013:2014, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match Created on 2020-02-18 by the reprex package (v0.3.0) while in other cases it works library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
casualties_2017 <- get_stats19(2017, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
ncol(casualties_2017)
#> [1] 16
casualties_2018 <- get_stats19(2018, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2018)
#> [1] 16
casualties_2017_2018 <- get_stats19(c(2017, 2018), type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2017_2018)
#> [1] 16
nrow(casualties_2017) + nrow(casualties_2018) == nrow(casualties_2017_2018)
#> [1] TRUE Created on 2020-02-18 by the reprex package (v0.3.0) So this is a problem that occurs when we use rbind(data.frame(x = 1), data.frame(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
library(tibble)
rbind(tibble(x = 1), tibble(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match Created on 2020-02-18 by the reprex package (v0.3.0) I'm not sure what's the best solution. My first idea now would be using # packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
a = st_sf(a=1, geom = st_sfc(st_point(0:1)), crs = 3857)
b = st_sf(a=1, b = 2, geom = st_sfc(st_linestring(matrix(1:4,2))), crs = 3857)
rbind(a, b)
#> Error in rbind.data.frame(...): numbers of columns of arguments do not match
bind_rows(a, b)
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_POINT' elements may not preserve
#> their attributes
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_LINESTRING' elements may not
#> preserve their attributes
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index Created on 2020-02-18 by the reprex package (v0.3.0) Maybe we could simply raise an error in case of multiple years and |
I suggest the base R solution that keeps all columns would be good, e.g. as shown here: https://stackoverflow.com/a/46635610/1694378 |
Update, having thought about it and played with it I've changed my mind... I think cas_list = lapply(2013:2018, function(i) stats19::get_stats19(i, type = "cas"))
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Files identified: RoadSafetyData_Casualties_2015.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/RoadSafetyData_Casualties_2015.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/RoadSafetyData_Casualties_2015/Casualties_2015.csv
#> Files identified: dftRoadSafetyData_Casualties_2016.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2016.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2016/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2018.csv
sapply(cas_list, ncol)
#> [1] 14 15 16 16 16 16
# sapply(cas_list, names)
names_diff = setdiff(names(cas_list[[5]]), names(cas_list[[1]]))
names_all = unique(unlist(lapply(cas_list, names)))
res = dplyr::bind_rows(cas_list)
res
#> # A tibble: 1,077,310 x 16
#> accident_index vehicle_referen… casualty_refere… casualty_class
#> <chr> <int> <int> <chr>
#> 1 201301BS70003 2 1 Driver or rid…
#> 2 201301BS70005 1 1 Driver or rid…
#> 3 201301BS70005 1 2 Pedestrian
#> 4 201301BS70006 1 1 Passenger
#> 5 201301BS70007 1 1 Driver or rid…
#> 6 201301BS70009 1 1 Driver or rid…
#> 7 201301BS70010 1 1 Pedestrian
#> 8 201301BS70012 2 1 Driver or rid…
#> 9 201301BS70013 1 1 Pedestrian
#> 10 201301BS70015 1 1 Pedestrian
#> # … with 1,077,300 more rows, and 12 more variables: sex_of_casualty <chr>,
#> # age_band_of_casualty <chr>, casualty_severity <chr>,
#> # pedestrian_location <chr>, pedestrian_movement <chr>, car_passenger <chr>,
#> # bus_or_coach_passenger <chr>, pedestrian_road_maintenance_worker <chr>,
#> # casualty_type <chr>, casualty_home_area_type <chr>, age_of_casualty <int>,
#> # casualty_imd_decile <chr>
res[names_diff]
#> # A tibble: 1,077,310 x 2
#> age_of_casualty casualty_imd_decile
#> <int> <chr>
#> 1 NA <NA>
#> 2 NA <NA>
#> 3 NA <NA>
#> 4 NA <NA>
#> 5 NA <NA>
#> 6 NA <NA>
#> 7 NA <NA>
#> 8 NA <NA>
#> 9 NA <NA>
#> 10 NA <NA>
#> # … with 1,077,300 more rows
summary(.Last.value)
#> Mode TRUE
#> logical 1 Created on 2020-02-18 by the reprex package (v0.3.0) |
Remember: cas will never be an |
Plan to give this some thought next week, any further ideas welcome. Thanks for the reprex @agila5, definitely food for thought and eventually action to solve this pesky bug! |
Hi Andrea and hi Robin. I think you both know my position. I will only resort to non-base if it is the only option. I hope I can find some time to give it some thought too. |
Happy with non-base solution, if I can understand it! Thoughts @agila5 ? |
I would also prefer a base-R solution wrt # packages
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)
names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile" Created on 2020-02-19 by the reprex package (v0.3.0) and I think they could be important from a modeling perspective. I will think about it in the next days. |
Thanks @agila5, it seems there is a base way of doing this which would not be "very" expensive in our case, rbind(
data.frame(c(df1, sapply(setdiff(names(df2), names(df1)), function(x) NA))),
data.frame(c(df2, sapply(setdiff(names(df1), names(df2)), function(x) NA)))
) From great SO. |
I assume that is or something similar to that what |
This should go into tests really: library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)
names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile"
c = get_stats19(2013:2016, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ...
any(grepl("age_of_casualty", names(c)))
#> [1] TRUE
any(grepl("casualty_imd_decile", names(c)))
#> [1] TRUE
t = get_stats19(2013:2016, output_format = "sf")
#> Files identified: DfTRoadSafety_Accidents_2013.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Accidents_2013.zip
#> Attempt downloading from:
#> Data saved at ....
#> 7 rows removed with no coordinates
class(t)
#> [1] "sf" "data.frame" Created on 2020-02-20 by the reprex package (v0.3.0) |
Do we need a separate ticket to change "year" to "years", would that be a breaking change? ping #153 |
Thanks for the solution, I will test it carefully in the next days! Anyway I'm not sure I understand your question. Do you want to change the name of the parameter from |
Please open if more work is needed. |
Great work @layik, harder than expected but will make important mult-year analysis more accessible. |
Reprex:
The text was updated successfully, but these errors were encountered: