Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping college rankings data? #17

Closed
ignacio82 opened this issue Sep 16, 2014 · 5 comments
Closed

Scraping college rankings data? #17

ignacio82 opened this issue Sep 16, 2014 · 5 comments

Comments

@ignacio82
Copy link

Is it possible to scrape the college rankings data from this website
http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data
using rvest? Can you point me out to a tutorial or example that can get me started?

Thanks!

@renkun-ken
Copy link
Contributor

Using the following packages:

  • rvest: for easy scraping
  • stringr: for easy string manipulation
library(rvest)    # devtools::install_github("rvest","hadley")
library(stringr)  # install.pacakges("stringr")

url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"

trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100

table <- 1:3 %>%
  lapply(function(page) {
    nodes <- sprintf(url, page) %>%
      html() %>%
      html_nodes("table tbody tr")

    column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)

    data.frame(
      rank = column("td[1]/div[1]/span") %>%
        str_replace("#(\\d+)(Tie)?","\\1") %>%
        as.integer,
      score = column("td[1]/span[1]/span") %>%
        str_replace("(\\d+) out of 100.","\\1") %>%
        asInteger,
      name = column("td[2]/a"),
      location = column("td[2]/p/text()[1]"),
      tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
      totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
      fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
      averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
      sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
      stringsAsFactors = FALSE
    )
  }) %>%
  do.call(rbind, .)
> head(table)
  rank score                  name      location tuitionAndFees
1    1   100  Princeton University Princeton, NJ        $41,820
2    2    99    Harvard University Cambridge, MA        $43,938
3    3    98       Yale University New Haven, CT        $45,800
4    4    95   Columbia University  New York, NY        $51,008
5    4    95   Stanford University  Stanford, CA        $44,757
6    4    95 University of Chicago   Chicago, IL        $48,253
  totalEnrollment fall2013AcceptanceRate averageFreshmanRetentionRate
1            8014                  0.074                         0.98
2           19882                  0.058                         0.97
3           12109                  0.069                         0.99
4           23606                  0.069                         0.99
5           18136                  0.057                         0.98
6           12539                  0.088                         0.99
  sixYearGraduationRate
1                  0.97
2                  0.97
3                  0.98
4                  0.96
5                  0.96
6                  0.93

You may change 1:3 to more page numbers, like all 11. :)

Note that in the later pages, some cells have tips (which is annoying) so that I have to use td[5]/text()[1] such xpath to ensure only first text is selected.

@ignacio82
Copy link
Author

Very impresive. I need to go over it line by line to make sure i understand how your did your magic.
Thanks!

@renkun-ken
Copy link
Contributor

Thanks, @ignacio82! If you have any question about it, just ask here. Let me first point out the basic knowledge you need:

  • HTML
  • CSS selector
  • XPath selector
  • Regular expression

You don't have dive deep but get to know the very basics. You don't have to be a professional web developer to just scrape some webpages.

@ignacio82
Copy link
Author

were can I read about:

  • CSS selector
  • XPath selector
    ?
    This is the first time I hear about that stuff...

@renkun-ken
Copy link
Contributor

http://www.w3schools.com/ offers great and basic tutorials on a wide variety of web stuff.

You can quickly go though HTML, CSS and XPath.

Basically speaking,

HTML is the markup language behind web pages, it defines the contents and layout of a web page. A web page like the ranking is described by a very nested collection of tags which is expressed in plain text so that your web browser can receive the text from server, analyze its structure and figure out how to render it.

CSS is a language that defines a style sheet for the tags or classes in HTML to match, so that the different groups of elements can have different styles (color, border, etc.) without too redundant declaration of inline styles for each element. A CSS selector can help the browser (and us) to select a particular group of elements in the web page.

Note that HTML is very close to XML which is used to store and transmit data between different services. XML has no pre-definition of tags but HTML defines some tags so that browser can understand how to interpret an element by their tag name. XPath is very flexible and powerful to describe a query for a particular set of nodes in XML, which mostly also applies to HTML.

So that you have to understand the basic motivation and know-how to get started scraping web pages :)

@hadley hadley closed this as completed Nov 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants