An R package for downloading data from the gov.UK publications
This package is, for now, only available on GitHub and can be installed by running:
Setting Up RSelenium
The primary depedency is RSelenium
package that allows to run headless browser and normal browser on headless server.
RgovUK allows both internal launch of a browser or connecting to an
existing Docker instance. With the second, Docker, approach being the preferred
one, due to higher stability. See RSelenium: Docker Containers
vignette for more details on how to set up Docker for
Before any functionality of the package can be used, the broswer needs to be intantiated:
Or, if the approach with Docker is used:
start_browser(port = 4445L, docker = TRUE)
port should correspond to the host port that used to map the container port. E.g.
docker run -d -p 4445:4444 selenium/standalone-firefox:3.10.0 maps container port 4444 to the host port 4445.
After the browser is launched, it should be pointed to the main page of the website by running:
The website contains two key fields: filters and results.
use_filters are the two function for retrieving the available
filters and applying them to narrow down the required documents.
f <- get_filters(field = "descriptors") # f # $descriptors #  "Contains" "Publication type" "Policy area" "Department" #  "Official document status" "World locations" "Published after" "Published before"
filters <- get_filters() head(as.data.frame(filters)) # values descriptors.txts opt.groups opt.values opt.descriptors # 1 keywords Contains <NA> <NA> <NA> # 2 publication_filter_option Publication type <NA> all All publication types # 3 publication_filter_option Publication type Consultations consultations All consultations # 4 publication_filter_option Publication type Consultations closed-consultations Closed consultations # 5 publication_filter_option Publication type Consultations open-consultations Open consultations # 6 publication_filter_option Publication type Corporate corporate-reports Corporate reports
departments <- as_data_frame(filters) %>% filter(opt.groups == "Ministerial departments") %>% select(opt.values) %>% unlist()
use_filter(departments) use_filter("meetings", filter_type = "text")
To download the documents that match the filter criteria as described above,
the package contains two functions:
The former allows to download the files the links to which are listed on the
pages that are listed as search results. The latter downloads the pages themselves.
This can be particularly useful, when the document search criteria are more
complicated than the filter functionality of the website allows to apply.
Or, when the meta information about the files is required.
temp <- tempdir() download_files(temp, limit = 10, type = "csv")