-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier way to scrape regular html data to data frame #12
Comments
Man this is amazing!! Soon people will see how powerful r can be for webscraping |
I like it - I need to think through the syntax a bit though. It's a bit too reliant on xpath currently - i.e. there's no way to access an attribute with a css selector. It'd also be nice if it worked in a where you could easily test each component, before building up into a data frame. (Also I think this will require a variant of |
Another idea is to implement the css selector extensions used by https://github.com/EricChiang/pup |
Great thinking! The design of |
Using library(rvest)
nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
html %>%
html_nodes("div.video-summary-data")
cols <- list(
title = "div[1]//a//text()",
author = "div[3]//a//text()",
date = "div[4]//small[1]//text()",
language = "div[4]//small[2]//text()",
description = "div[5]//p//text()")
df <- cols %>%
lapply(function(col) {
nodes %>%
html_node(xpath = col) %>%
html_text
}) %>%
data.frame(stringsAsFactors = FALSE) |
Alternatively, you could write it like this: nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
html %>%
html_nodes("div.video-summary-data")
column <- function(x) nodes %>% html_node(xpath = x) %>% html_text()
df <- data.frame(
title = column("div[1]//a"),
author = column("div[3]//a"),
date = column("div[4]//small[1]"),
language = column("div[4]//small[2]"),
description = column("div[5]//p"),
stringsAsFactors = FALSE
) I think this is simple enough that you don't need a wrapper |
@hadley, fantastic use, just forget about using function :) This workflow looks very natural now. |
Consider such a function, say,
html_df()
which is used to create data frame directly by specifying columns by css selector or xpath query respectively.The function can be roughly written like (not fully implemented yet):
An example for http://pyvideo.org/category/50/pycon-us-2014:
Here the result should be a
data.frame
in which each columns is specified by a selector either css or xpath, so that only one step can create a data frame from web data that is regular enough.The text was updated successfully, but these errors were encountered: