Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
73 lines (62 sloc) 2.08 KB
Scraping Web data using rvest
========================================================
Load rvest package
```{r}
if(!require(rvest)){
install.packages("rvest")
}
library(rvest)
library(RCurl)
```
set our initial master URL. This is where our multiple tables we wish to extract reside.
```{r}
url <- "https://www.londonstockexchange.com/exchange/prices-and-markets/ETFs/ETFs.html"
```
Create a directory to do our work in, if it doesn't already exist.
```{r echo=FALSE}
# showWarnings = F silences warnings if directory already exists
dir.create(file.path("C:/rvest_test"), showWarnings = FALSE)
```
Find how many pages there are.
```{r}
setInternet2(use = TRUE)
# All file links have to be explicit in Rmd otherwise they won't load
download.file(url,"C:/rvest_test/ETFs.html")
html <- html("C:/rvest_test/ETFs.html")
pages <- html_text(html_node(html,"p.floatsx"))
pages <- as.numeric(substr(pages,nchar(pages)-1,nchar(pages)))
```
Remove the file when we're finished
```{r}
file.remove("C:/rvest_test/ETFs.html")
```
Initialise an empty data frame then loop over each page, downloading the table, recording the time, and saving in the master data frame.
```{r warning=FALSE}
etf_table <- data.frame()
#for each page
for (p in 1:pages) {
cur_url <- paste(url,"?&page=",p,sep="")
#download the file
download.file(cur_url,paste("C:/rvest_test/",p,"_rvest",".html",sep=""))
#create html object
html <- html(paste("C:/rvest_test/",p,"_rvest",".html",sep=""))
#look for tables on the page and get the first one
table <- html_table(html_nodes(html,"table")[[1]])
#only first 6 columns contain information that we need
table <- table[1:6]
#stick a timestamp at end
table["Timestamp"] <- Sys.time()
#add into the final results table
etf_table <- rbind(etf_table,table)
#remove the originally downloaded file
file.remove(paste("C:/rvest_test/",p,"_rvest",".html",sep=""))
}
```
```{r eval=FALSE,echo=FALSE}
##**WARNING** danger of deleting files. Set eval condition to `TRUE` if required **WARNING**
unlink("C:/rvest_test",recursive=T)
```
Summarize the ETF
```{r}
summary(etf_table)
```