-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping college rankings data? #17
Comments
Using the following packages:
library(rvest) # devtools::install_github("rvest","hadley")
library(stringr) # install.pacakges("stringr")
url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"
trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100
table <- 1:3 %>%
lapply(function(page) {
nodes <- sprintf(url, page) %>%
html() %>%
html_nodes("table tbody tr")
column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)
data.frame(
rank = column("td[1]/div[1]/span") %>%
str_replace("#(\\d+)(Tie)?","\\1") %>%
as.integer,
score = column("td[1]/span[1]/span") %>%
str_replace("(\\d+) out of 100.","\\1") %>%
asInteger,
name = column("td[2]/a"),
location = column("td[2]/p/text()[1]"),
tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
stringsAsFactors = FALSE
)
}) %>%
do.call(rbind, .)
You may change Note that in the later pages, some cells have tips (which is annoying) so that I have to use |
Very impresive. I need to go over it line by line to make sure i understand how your did your magic. |
Thanks, @ignacio82! If you have any question about it, just ask here. Let me first point out the basic knowledge you need:
You don't have dive deep but get to know the very basics. You don't have to be a professional web developer to just scrape some webpages. |
were can I read about:
|
http://www.w3schools.com/ offers great and basic tutorials on a wide variety of web stuff. You can quickly go though HTML, CSS and XPath. Basically speaking, HTML is the markup language behind web pages, it defines the contents and layout of a web page. A web page like the ranking is described by a very nested collection of tags which is expressed in plain text so that your web browser can receive the text from server, analyze its structure and figure out how to render it. CSS is a language that defines a style sheet for the tags or classes in HTML to match, so that the different groups of elements can have different styles (color, border, etc.) without too redundant declaration of inline styles for each element. A CSS selector can help the browser (and us) to select a particular group of elements in the web page. Note that HTML is very close to XML which is used to store and transmit data between different services. XML has no pre-definition of tags but HTML defines some tags so that browser can understand how to interpret an element by their tag name. XPath is very flexible and powerful to describe a query for a particular set of nodes in XML, which mostly also applies to HTML. So that you have to understand the basic motivation and know-how to get started scraping web pages :) |
Is it possible to scrape the college rankings data from this website
http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data
using rvest? Can you point me out to a tutorial or example that can get me started?
Thanks!
The text was updated successfully, but these errors were encountered: