Easier way to scrape regular html data to data frame #12

renkun-ken · 2014-09-13T16:07:35Z

Consider such a function, say, html_df() which is used to create data frame directly by specifying columns by css selector or xpath query respectively.

The function can be roughly written like (not fully implemented yet):

html_df <- function(x, columns, ...) {
  coldata <- lapply(columns, function(col) {
    nodes <- html_node(x, "css | xpath") ### not implemented
    html_text(nodes)
  })
  data.frame(coldata,...)
}

An example for http://pyvideo.org/category/50/pycon-us-2014:

library(rvest)
html("http://pyvideo.org/category/50/pycon-us-2014") %>%
  html_node("div.video-summary-data") %>%
  html_df(list(
    title = xpath("div[1]//a//text()"),
    author = xpath("div[3]//a//text()"),
    date = xpath("div[4]//small[1]//text()"),
    language = xpath("div[4]//small[2]//text()"),
    description = xpath("div[5]//p//text()")),
    stringsAsFactors = FALSE)

Here the result should be a data.frame in which each columns is specified by a selector either css or xpath, so that only one step can create a data frame from web data that is regular enough.

The text was updated successfully, but these errors were encountered:

abresler · 2014-09-13T16:15:19Z

Man this is amazing!! Soon people will see how powerful r can be for webscraping

hadley · 2014-09-13T17:48:09Z

I like it - I need to think through the syntax a bit though. It's a bit too reliant on xpath currently - i.e. there's no way to access an attribute with a css selector. It'd also be nice if it worked in a where you could easily test each component, before building up into a data frame.

(Also I think this will require a variant of html_node() that doesn't unlist(), and assumes the selector only pulls out one node. Now that I think of it, html_node() is basically [, and we need an equivalent of [[. Maybe html_node() should be renamed to html_nodes() and html_node() should only ever pull out one node per input element.)

hadley · 2014-09-13T19:14:33Z

Another idea is to implement the css selector extensions used by https://github.com/EricChiang/pup

renkun-ken · 2014-09-14T00:50:54Z

Great thinking! The design of html_node() and html_nodes() works with many problematic cases in which the nodes are not quite regular. The code above does not really work due to some exceptional cases in the webpage, but with these two functions, it works correctly without too much worry about the missing/blank values in some fields.

renkun-ken · 2014-09-14T11:23:44Z

Using html_node() and html_nodes(), the code can be much less error-prone.

library(rvest)

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
  html %>%
  html_nodes("div.video-summary-data")

cols <- list(
  title = "div[1]//a//text()",
  author = "div[3]//a//text()",
  date = "div[4]//small[1]//text()",
  language = "div[4]//small[2]//text()",
  description = "div[5]//p//text()")

df <- cols %>%
  lapply(function(col) {
    nodes %>%
      html_node(xpath = col) %>%
      html_text
  }) %>%
  data.frame(stringsAsFactors = FALSE)

hadley · 2014-09-15T11:00:56Z

Alternatively, you could write it like this:

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
  html %>%
  html_nodes("div.video-summary-data")

column <- function(x) nodes %>% html_node(xpath = x) %>% html_text()

df <- data.frame(
  title = column("div[1]//a"),
  author = column("div[3]//a"),
  date = column("div[4]//small[1]"),
  language = column("div[4]//small[2]"),
  description = column("div[5]//p"),
  stringsAsFactors = FALSE
)

I think this is simple enough that you don't need a wrapper

renkun-ken · 2014-09-15T11:03:17Z

@hadley, fantastic use, just forget about using function :) This workflow looks very natural now.

hadley closed this as completed Sep 15, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easier way to scrape regular html data to data frame #12

Easier way to scrape regular html data to data frame #12

renkun-ken commented Sep 13, 2014

abresler commented Sep 13, 2014

hadley commented Sep 13, 2014

hadley commented Sep 13, 2014

renkun-ken commented Sep 14, 2014

renkun-ken commented Sep 14, 2014

hadley commented Sep 15, 2014

renkun-ken commented Sep 15, 2014

Easier way to scrape regular html data to data frame #12

Easier way to scrape regular html data to data frame #12

Comments

renkun-ken commented Sep 13, 2014

abresler commented Sep 13, 2014

hadley commented Sep 13, 2014

hadley commented Sep 13, 2014

renkun-ken commented Sep 14, 2014

renkun-ken commented Sep 14, 2014

hadley commented Sep 15, 2014

renkun-ken commented Sep 15, 2014