Scraping college rankings data? #17

ignacio82 · 2014-09-16T22:23:13Z

Is it possible to scrape the college rankings data from this website
http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data
using rvest? Can you point me out to a tutorial or example that can get me started?

Thanks!

renkun-ken · 2014-09-17T01:30:51Z

Using the following packages:

rvest: for easy scraping
stringr: for easy string manipulation

library(rvest)    # devtools::install_github("rvest","hadley")
library(stringr)  # install.pacakges("stringr")

url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"

trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100

table <- 1:3 %>%
  lapply(function(page) {
    nodes <- sprintf(url, page) %>%
      html() %>%
      html_nodes("table tbody tr")

    column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)

    data.frame(
      rank = column("td[1]/div[1]/span") %>%
        str_replace("#(\\d+)(Tie)?","\\1") %>%
        as.integer,
      score = column("td[1]/span[1]/span") %>%
        str_replace("(\\d+) out of 100.","\\1") %>%
        asInteger,
      name = column("td[2]/a"),
      location = column("td[2]/p/text()[1]"),
      tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
      totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
      fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
      averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
      sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
      stringsAsFactors = FALSE
    )
  }) %>%
  do.call(rbind, .)

> head(table)
  rank score                  name      location tuitionAndFees
1    1   100  Princeton University Princeton, NJ        $41,820
2    2    99    Harvard University Cambridge, MA        $43,938
3    3    98       Yale University New Haven, CT        $45,800
4    4    95   Columbia University  New York, NY        $51,008
5    4    95   Stanford University  Stanford, CA        $44,757
6    4    95 University of Chicago   Chicago, IL        $48,253
  totalEnrollment fall2013AcceptanceRate averageFreshmanRetentionRate
1            8014                  0.074                         0.98
2           19882                  0.058                         0.97
3           12109                  0.069                         0.99
4           23606                  0.069                         0.99
5           18136                  0.057                         0.98
6           12539                  0.088                         0.99
  sixYearGraduationRate
1                  0.97
2                  0.97
3                  0.98
4                  0.96
5                  0.96
6                  0.93

You may change 1:3 to more page numbers, like all 11. :)

Note that in the later pages, some cells have tips (which is annoying) so that I have to use td[5]/text()[1] such xpath to ensure only first text is selected.

ignacio82 · 2014-09-17T12:05:04Z

Very impresive. I need to go over it line by line to make sure i understand how your did your magic.
Thanks!

renkun-ken · 2014-09-17T12:09:41Z

Thanks, @ignacio82! If you have any question about it, just ask here. Let me first point out the basic knowledge you need:

HTML
CSS selector
XPath selector
Regular expression

You don't have dive deep but get to know the very basics. You don't have to be a professional web developer to just scrape some webpages.

ignacio82 · 2014-09-17T12:11:03Z

were can I read about:

CSS selector
XPath selector
?
This is the first time I hear about that stuff...

renkun-ken · 2014-09-17T12:37:55Z

http://www.w3schools.com/ offers great and basic tutorials on a wide variety of web stuff.

You can quickly go though HTML, CSS and XPath.

Basically speaking,

HTML is the markup language behind web pages, it defines the contents and layout of a web page. A web page like the ranking is described by a very nested collection of tags which is expressed in plain text so that your web browser can receive the text from server, analyze its structure and figure out how to render it.

CSS is a language that defines a style sheet for the tags or classes in HTML to match, so that the different groups of elements can have different styles (color, border, etc.) without too redundant declaration of inline styles for each element. A CSS selector can help the browser (and us) to select a particular group of elements in the web page.

Note that HTML is very close to XML which is used to store and transmit data between different services. XML has no pre-definition of tags but HTML defines some tags so that browser can understand how to interpret an element by their tag name. XPath is very flexible and powerful to describe a query for a particular set of nodes in XML, which mostly also applies to HTML.

So that you have to understand the basic motivation and know-how to get started scraping web pages :)

hadley closed this as completed Nov 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping college rankings data? #17

Scraping college rankings data? #17

ignacio82 commented Sep 16, 2014

renkun-ken commented Sep 17, 2014

ignacio82 commented Sep 17, 2014

renkun-ken commented Sep 17, 2014

ignacio82 commented Sep 17, 2014

renkun-ken commented Sep 17, 2014

Scraping college rankings data? #17

Scraping college rankings data? #17

Comments

ignacio82 commented Sep 16, 2014

renkun-ken commented Sep 17, 2014

ignacio82 commented Sep 17, 2014

renkun-ken commented Sep 17, 2014

ignacio82 commented Sep 17, 2014

renkun-ken commented Sep 17, 2014