Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier way to scrape regular html data to data frame #12

Closed
renkun-ken opened this issue Sep 13, 2014 · 7 comments
Closed

Easier way to scrape regular html data to data frame #12

renkun-ken opened this issue Sep 13, 2014 · 7 comments

Comments

@renkun-ken
Copy link
Contributor

Consider such a function, say, html_df() which is used to create data frame directly by specifying columns by css selector or xpath query respectively.

The function can be roughly written like (not fully implemented yet):

html_df <- function(x, columns, ...) {
  coldata <- lapply(columns, function(col) {
    nodes <- html_node(x, "css | xpath") ### not implemented
    html_text(nodes)
  })
  data.frame(coldata,...)
}

An example for http://pyvideo.org/category/50/pycon-us-2014:

library(rvest)
html("http://pyvideo.org/category/50/pycon-us-2014") %>%
  html_node("div.video-summary-data") %>%
  html_df(list(
    title = xpath("div[1]//a//text()"),
    author = xpath("div[3]//a//text()"),
    date = xpath("div[4]//small[1]//text()"),
    language = xpath("div[4]//small[2]//text()"),
    description = xpath("div[5]//p//text()")),
    stringsAsFactors = FALSE)

Here the result should be a data.frame in which each columns is specified by a selector either css or xpath, so that only one step can create a data frame from web data that is regular enough.

@abresler
Copy link

Man this is amazing!! Soon people will see how powerful r can be for webscraping

@hadley
Copy link
Member

hadley commented Sep 13, 2014

I like it - I need to think through the syntax a bit though. It's a bit too reliant on xpath currently - i.e. there's no way to access an attribute with a css selector. It'd also be nice if it worked in a where you could easily test each component, before building up into a data frame.

(Also I think this will require a variant of html_node() that doesn't unlist(), and assumes the selector only pulls out one node. Now that I think of it, html_node() is basically [, and we need an equivalent of [[. Maybe html_node() should be renamed to html_nodes() and html_node() should only ever pull out one node per input element.)

@hadley
Copy link
Member

hadley commented Sep 13, 2014

Another idea is to implement the css selector extensions used by https://github.com/EricChiang/pup

@renkun-ken
Copy link
Contributor Author

Great thinking! The design of html_node() and html_nodes() works with many problematic cases in which the nodes are not quite regular. The code above does not really work due to some exceptional cases in the webpage, but with these two functions, it works correctly without too much worry about the missing/blank values in some fields.

@renkun-ken
Copy link
Contributor Author

Using html_node() and html_nodes(), the code can be much less error-prone.

library(rvest)

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
  html %>%
  html_nodes("div.video-summary-data")

cols <- list(
  title = "div[1]//a//text()",
  author = "div[3]//a//text()",
  date = "div[4]//small[1]//text()",
  language = "div[4]//small[2]//text()",
  description = "div[5]//p//text()")

df <- cols %>%
  lapply(function(col) {
    nodes %>%
      html_node(xpath = col) %>%
      html_text
  }) %>%
  data.frame(stringsAsFactors = FALSE)

@hadley
Copy link
Member

hadley commented Sep 15, 2014

Alternatively, you could write it like this:

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
  html %>%
  html_nodes("div.video-summary-data")

column <- function(x) nodes %>% html_node(xpath = x) %>% html_text()

df <- data.frame(
  title = column("div[1]//a"),
  author = column("div[3]//a"),
  date = column("div[4]//small[1]"),
  language = column("div[4]//small[2]"),
  description = column("div[5]//p"),
  stringsAsFactors = FALSE
)

I think this is simple enough that you don't need a wrapper

@renkun-ken
Copy link
Contributor Author

@hadley, fantastic use, just forget about using function :) This workflow looks very natural now.

@hadley hadley closed this as completed Sep 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants