In [None]:
#| echo: false
from utils import build_buttons
from importlib import reload
import utils
reload(utils)
utils.build_buttons(link= 'live_scraping', 
                    github= 'https://github.com/yinleon/inspect-element/blob/main/browser_automation.ipynb',
                    colab = False,
                    citation= True)

Live web scraping is a handy tool in any data journalist's toolbox — one more powerful than traditional scraping, but easier to implement than browser automation with Selenium or Playwright.

In this tutorial, I'll cover simple live scraping techniques using the [rvest](https://rvest.tidyverse.org/reference/read_html_live.html) library of the R programming language. No knowledge of R is assumed, but those without any prior familiarity with the language may consider starting with Python-based browser automation.

::: {.callout-note}
The R language can be more difficult than Python for beginning scrapers, and parts of the rvest library covered here are currently in the **experimental** stage of their lifecycle — meaning that they are actively under development and may contain bugs. While I find the tools they provide elegant and powerful compared to their Python equivalents, everything done here can be done equally using Leon's excellent [browser automation tutorial](https://inspectelement.org/browser_automation.html).
:::

# Intro
Browser automation can feel like magic, but it often requires thinking about a page from the perspective of a browser. R takes a human-centered approach. Behind the scenes, the code we write here will run an automated browser much like those demonstrated in the [browser automation tutorial](https://inspectelement.org/browser_automation.html). But, unless called for, it will remain invisible, and we'll be able to read in data more intuitively.

## Installations
Before getting started, make sure you have a copy of the Google Chrome browser installed in order to run Chromote. If using the R language for the first time, install the latest version of R [here](https://www.r-project.org/). I recommend also installing [RStudio](https://posit.co/download/rstudio-desktop/), an R-focused IDE.

# Building a Scraper
## Getting Started
To get started, do one of the following:

1) Open RStudio (recommended) 

    - Create a new project titled live-scraping-demo 

    - Create an R script called scrape.R

2) Or, open a new Jupyter Notebook and create an R cell

3) Or, (Mac users) open an R session from the Terminal by typing the 'R' command

We're now ready to run our first commands, which will install the libraries we need to complete the rest of the project. These libraries are large packages of pre-written code, largely created by Posit Chief Scientist and R developer Hadley Wickham. This tutorial is in part based on Hadley's excellent NICAR session on live scraping, which you can find [here](https://github.com/hadley/web-scraping).

In [None]:
# General tools for the R language
install.packages("tidyverse")
# Web scraping tools
install.packages("rvest")

To access these libraries, we'll have to call them from the body of our R script. Add these lines to your scrape.R file.

In [None]:
library(tidyverse)
library(rvest)

Now that we have the libraries we need, we're ready to start scraping. For demonstration purposes, we'll be collecting data from the [*Forbes* rankings of U.S. colleges](https://www.forbes.com/top-colleges/), but these same techniques can be applied to a wide range of sites. Let's start with the simplest possible scraper: specifying a URL, and reading in that page so that we can work with it. We'll progressively revise this code throughout the tutorial until we've collected all the data we need.

In [137]:
# retrieve the page
page <- read_html("https://www.forbes.com/top-colleges/")

Let's print out the first few lines of the page to verify that we have the right one.

In [138]:
page

{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<div id="__next">\n<div class="ForbesHeader_mainHeader__XuFcZ"><h ...

Everything looks good here. Time to start retrieving our data.

## Retrieving the First Table
Before writing any more code, we'll need to identify the HTML elements that contain our data. Open your browser's developer tools. (In most standard browsers, this can be done by clicking the dot menu in the upper right and selecting "Developer Tools.") I recommend keeping the developer panel open in a split-screen window in order to compare the page source to its visible elements. Using the selector tool in the upper left, click on the central table that contains college listings.

<figure>
<img src="assets/forbes_table-selection.png"
    alt="https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool#google-search-flow" 
     style="width:100%" 
    />
</figure>

The table we want has the class `ListTable_listTable__-N5U5`. Typically, it's best practice to use the most specific tag available when selecting elements to avoid mixups. There are tradeoffs to that approach, however. The designation `-N5U5` looks largely random — if Forbes updates this page to use some other signifier, our code will stop working. 

Instead, I'll recommend a simpler approach. Since the table we're interested in is the only one on the page (that we can see), we can simply select the tag `table`. Let's try it.

In [139]:
page |> html_nodes("table")

{xml_nodeset (0)}

The code returns no results. Why? Despite the fact that we can see the table in the browser, it's not present in the HTML we accessed with the read_html command. That's because this is a "live", or dynamic, page, one that renders significant parts of its layout — including our data — using JavaScript. Normally, scraping such a page would require browser automation. But the power of rvest will allow us to fix the problem by changing only one command.

In [140]:
page <- read_html_live("https://www.forbes.com/top-colleges/")
page |> html_nodes("table")

{xml_nodeset (1)}
[1] <table class="ListTable_listTable__-N5U5">\n<thead><tr>\n<th class="ListT ...

When we use the experimental read_html_live command instead of the traditional read_html, the code works as expected. We can now access the table, and all other data on this dynamic page, as a human user would. Let's save the data we retrieved as an R dataframe.

In [141]:
df <- page |> html_node("table") |> html_table()

We've now retrieved the data we set out to obtain in only five lines. The complete code for this scraper can be written as:

In [142]:
library(tidyverse)
library(rvest)
page <- read_html_live("https://www.forbes.com/top-colleges/")
Sys.sleep(1)
df <- page |> html_node("table") |> html_table()

## Data Cleaning
Unfortunately, the table we've retrieved isn't easy to work with. While largely correct, it contains the long-form callouts that *Forbes* includes for some institutions, such as Princeton. While these are interesting, they don't belong in our dataframe. Let's filter them out.

There are many possible approaches to removing junk data, but we'll seek the simplest. By closely examining the table, we can see that the listings we're interested in all have a numeric rank in the 'Rank' column, while the callout has a rank of NA. Let's filter the dataframe to only include listings with a non-NA rank. We'll use the filter command, which returns the rows of a dataframe for which the provided expression is `True`. We'll specify the Rank column, and require that the Rank value for each row evaluate to some non-NA value.

In [143]:
df <- df |> filter(!is.na(Rank))

## Retrieving all the Tables
Our data table is now complete. Unfortunately, we've only collected the first 50 entries from *Forbes'* list. The *Forbes* site is paginated, meaning we'll have to click to a new page to get more data. Luckily, rvest provides a simple syntax for this.

In [144]:
page$click('button[aria-label="Next"]')

With only one line of code, we can select the 'Next' button at the bottom of the page and click it. If we rerun our code for scraping the data table, we'll see that we're now collecting the next page of data.

In [145]:
page |> html_node("table") |> html_table() |> filter(!is.na(Rank)) |> head()

Rank,Name,State,Type,Av. Grant Aid,Av. Debt,Median 10-year Salary,Financial Grade
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
51,Middlebury College,VT,Private not-for-profit,"$56,639","$8,541","$139,500",A-
52,Tufts University,MA,Private not-for-profit,"$51,613","$7,034","$144,400",B+
53,Boston University,MA,Private not-for-profit,"$51,565","$10,425","$144,400",B
54,Wesleyan University,CT,Private not-for-profit,"$59,825","$9,716","$149,200",B+
55,William & Mary,VA,Public,"$15,134","$8,882","$134,200",
56,Barnard College,NY,Private not-for-profit,"$53,817","$7,339","$149,300",A-


Unfortunately, our new data table has the same data quality problems as the first one. Let's define a simple function for collecting ranking data from a page so that we don't have to repeat ourselves.

In [146]:
get_college_rankings <- function(page) {
    df <- page |> html_node("table") |> html_table()
    df <- df |> filter(!is.na(Rank))
    return(df)
}
get_college_rankings(page) |> head()

Rank,Name,State,Type,Av. Grant Aid,Av. Debt,Median 10-year Salary,Financial Grade
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
51,Middlebury College,VT,Private not-for-profit,"$56,639","$8,541","$139,500",A-
52,Tufts University,MA,Private not-for-profit,"$51,613","$7,034","$144,400",B+
53,Boston University,MA,Private not-for-profit,"$51,565","$10,425","$144,400",B
54,Wesleyan University,CT,Private not-for-profit,"$59,825","$9,716","$149,200",B+
55,William & Mary,VA,Public,"$15,134","$8,882","$134,200",
56,Barnard College,NY,Private not-for-profit,"$53,817","$7,339","$149,300",A-


We can now retrieve data from each page of results with only one command. Let's use that new shortcut to scrape in all ten pages of data. Start by scraping in the first page.

In [109]:
page <- read_html_live("https://www.forbes.com/top-colleges/")
Sys.sleep(1)
df <- get_college_rankings(page)

Now, we'll loop through the remaining pages, adding each new dataset to the main table before clicking the next button. We'll stop the first time we find that the next button has been disabled, indicating that there are no more pages available.

In [147]:
while (is.na(page |> html_node('button[aria-label="Next"]') |> html_attr("disabled"))) {
    page$click('button[aria-label="Next"]')
    Sys.sleep(1)
    df <- rbind(df, get_college_rankings(page))
}

As a final step, we'll export the data we collected to a CSV file for later use. Our complete program for scraping the *Forbes* college rankings can be written in only fifteen lines:

In [132]:
library(tidyverse)
library(rvest)

get_college_rankings <- function(page) {
    df <- page |> html_node("table") |> html_table()
    df <- df |> filter(!is.na(Rank))
    return(df)
}

page <- read_html_live("https://www.forbes.com/top-colleges/")
Sys.sleep(1)
df <- get_college_rankings(page)

while (is.na(page |> html_node('button[aria-label="Next"]') |> html_attr("disabled"))) {
    page$click('button[aria-label="Next"]')
    Sys.sleep(1)
    df <- rbind(df, get_college_rankings(page))
}

write_csv(df,"forbes-rankings.csv")

As I wrote above, read_html_live is an experimental tool, and more features are in active development. For more information, see Hadley Wickham's documentation, including [a summary of live scraping in rvest](https://rvest.tidyverse.org/reference/read_html_live.html) and [an overview of commands for interacting with a live page's UI](https://rvest.tidyverse.org/reference/LiveHTML.html). Feel free to reach out to me at [declanrjb@gmail.com](mailto:declanrjb@gmail.com) with any questions or suggestions for future tutorials.

# Citation

To cite this chapter, please use the following BibTex entry:

<pre>
@incollection{inspect2023browser,
  author    = {Bradley, Declan},
  title     = {Live Scraping With R},
  booktitle = {Inspect Element: A practitioner's guide to auditing algorithms and hypothesis-driven investigations},
  year      = {2024},
  editor    = {Yin, Leon and Sapiezynski, Piotr and Raji, Inioluwa Deborah},
  note      = {\url{https://inspectelement.org}}
}
</pre>

## Acknowledgements

Thank you to Leon Yin for publishing this tutorial, and to Hadley Wickham for developing these tools and taking the time to teach journalists like me how to use them.