<a href="https://colab.research.google.com/github/xvariable/rvest_tutorial/blob/master/rvest_tutorial_dmi_winterschool_2020_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial Webscraping in R

Jule Scheper, M.A. & Dr. Julia Niemann-Lenz<br />Hanover University of Music, Drama and Media | Department of Journalism and Communication Research <br /><br />Solution


#0. Prerequisites

*   Tutorial code in jupyter notebook, download here:
*   Register for https://colab.research.google.com, put your copy of the notebook there and open it
* Selector Gadget plugin for Google Chrome: https://chrome.google.com/webstore/detail/selectorgadget/

# 1. What is Webscraping?


<b>Saving data from websites automatically</b> <br /> Step 1: Calling up a website <br />Step 2: Extraction of relevant content<br /><br /> <b> Step 1: Calling up a website</b> <br /> Most ideally, all data is located on one website and can be extracted directly. If the data is located on several web pages, links can also be extracted and called up one after the other.<br /><br /><b>Step 2: Extraction of relevant content</b><br />Textual data on websites is available as HTML. Using HTML tags and CSS attributes, you can select specific content elements and save it.




# 2. HTML & CSS


<b>HTML (Hypertext Markup Language)</b> <br /> 
<li> Markup language that can be used to describe how digital documents are semantically structured </li> 
<li>Formatting by web browser or CSS</li> 
<li> Based on XML </li> 
<li>Setup with "tags" </li><br /> 
<b>CSS(Cascading Style Sheets)</b>
<li>Design instructions for the HTML</li><br /><b>

Example HTML:
```
<html>
<body>
<h1>Star Wars Intro</h1>
<p class = "sw_text">A long time ago in a galaxy far, far away....</p>
...
```

Example CSS class:
```
p.error {
  font-family: courier;
  color: red;
  font-size: 160%;
}
```



</b><b>Exercise</b></b> <br />Visit https://www.imdb.com/title/tt2527338/?ref_=nv_sr_srsg_1, activate your Inspector Gadget in Chrome. <li> Find the HTML-Tag or CSS class that identifies the Rating (6,9) of Star Wars Episode IX <li>How can you select several elements at once? <li>How can you deselect elements?</li>




# 3. Webscraping with R | Scraping content from a single web page

For Webscraping you need the R package rvest (Easily Ha<b>rvest</b> (Scrape) web pages)<br /> Important notes and documentation on rvest: https://cran.r-project.org/web/packages/rvest/rvest.pdf<br /> <br /> We practice web scraping using the website of the EU Parliament. As a first exercise we want to scrape the names of the parliament members. First we want to scrape only the names that are visible on the first page of the overview (https://www.europarl.europa.eu/meps/en/full-list). <br /> <br />Run the following codes to scrape the names of the parliament members visibly on the first page





In [0]:
# install.package("rvest"))
library(rvest)

# Store the URL you want to scrape as a new object 
eu_url <- "https://www.europarl.europa.eu/meps/en/full-list"

# Download HTML code from that URL and save it in a new object
eu_html <- read_html(eu_url, encoding = "UTF-8")

# View the source code we just downloaded: 
eu_html

{html_document}
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body xml:lang="en" lang="en" data-isie="true">\r\n\t<div id="website">\r ...

Now we have extracted the whole HTML code of the web page. But if we only want to extract certain content - such as the names of the members of parliament - we have to address certain HTML nodes that contain this content. 

Since we don´t know how the node could be identifies we use the Inspector gadget in Chrome and identify the CSS class that is used on the site to format parlamentarians names.

In [0]:
# Exercise: Find the CSS class used for the names of parliament members
eu_names_node <- html_nodes(eu_html,".member-name") 

# View the list of html_nodes:
eu_names_node

{xml_nodeset (37)}
 [1] <span class="ep_name member-name">Magdalena ADAMOWICZ</span>
 [2] <span class="ep_name member-name">Asim ADEMOV</span>
 [3] <span class="ep_name member-name">Isabella ADINOLFI</span>
 [4] <span class="ep_name member-name">Matteo ADINOLFI</span>
 [5] <span class="ep_name member-name">Alex AGIUS SALIBA</span>
 [6] <span class="ep_name member-name">Mazaly AGUILAR</span>
 [7] <span class="ep_name member-name">Clara AGUILERA</span>
 [8] <span class="ep_name member-name">Scott AINSLIE</span>
 [9] <span class="ep_name member-name">Alexander ALEXANDROV YORDANOV</span>
[10] <span class="ep_name member-name">François ALFONSI</span>
[11] <span class="ep_name member-name">Atidzhe ALIEVA-VELI</span>
[12] <span class="ep_name member-name">Christian ALLARD</span>
[13] <span class="ep_name member-name">Abir AL-SAHLANI</span>
[14] <span class="ep_name member-name">Álvaro AMARO</span>
[15] <span class="ep_name member-name">Andris AMERIKS</span>
[16] <span class="ep_name member-na

Alright, the names are here. But there are still the HTML tags arouns the names of the parlamentarians. To get rid of any HTML we can use the function `html_text()`:

In [0]:
#Now we can extract the text of the specific node
eu_parl_names <- html_text(eu_names_node)

#Viewing list of names:
eu_parl_names

<b>Exercise</b><br /> Now you now how to scrape text from a single webpage. Try to reproduce the code and scrape the country of origin of every parliament member (instead of just the name).

In [0]:
#Fill in your code (you can copy the code from above and simply change the specific elements)
eu_parl_countries <- html_nodes(eu_html,".ep-layout_country .ep_name") 
eu_parl_countries <- html_text(eu_parl_countries)

# We can now combine the two variables together and have a nice little dataset:
data <- cbind(eu_parl_names, eu_parl_countries)
head(data)

eu_parl_names,eu_parl_countries
Magdalena ADAMOWICZ,Poland
Asim ADEMOV,Bulgaria
Isabella ADINOLFI,Italy
Matteo ADINOLFI,Italy
Alex AGIUS SALIBA,Malta
Mazaly AGUILAR,Spain


# 4. Webscraping with R | Scraping text from multiple web pages

By now you know how to scrape text from a single URL. But what if we don't want to scrape the names of the members of parliament, but their date of birth. This is not on the first page, but on the bottom of each Member's page. <br /><br />To do so, we need to visit the subpage of every parlamentarian and extract the birth date from there. But first we need a list of the links to all subpages we want to visit.
Afterwards we can use a loop to visit each subpage one after the other and extract the date of birth from there.

But how do we get a list of the URLs? They are not written on the page, but obviusly they are somewhere in the source code. Links are an attribute of the HTML tag `<a>`. We can extract attributes of tags using the rvest function `html_attr()`.


In [0]:
# We nee a slightly different node this time:
eu_links_nodes <- html_nodes(eu_html,".single-member-container a") 

# Now using the html_attr() function on the new nodes
# we are telling it to extract the "href" attribute:
eu_links <- html_attr(eu_links_nodes, "href")

# We need to paste the url to the links:
eu_links <- paste0("https://www.europarl.europa.eu", eu_links)

# View the list of links:
eu_links

Now we have the URLs to the subpages of the members of parliament. We can use the `read_html()` function again to scrape text (birth dates) from each subpage. 
But since we want to scrape several pages, we would have to adapt the code for each page and execute them one after the other. We can work around this with a for loop that automatically executes the same code for each page in turn.

In [0]:
# Before we start our for loop, we have to create an empty container object. 
# This empty container can be filled with specific content (in our case the dates of birth) through our for loop.
eu_birthdates <- as.character()

# In our for loop, we must define how often the element i of the link list is passed through. 
for (i in 1:length(eu_links)){

  # Download html of current eu parlamentarian subpage
  html_i <- read_html(eu_links[i], encoding = "UTF-8")

  # Find the birth date node (inspector gadget)
  birth_node_i <- html_nodes(html_i, "#birthDate")

  # Now we extract the text of our specific node
  birth_i <- html_text(birth_node_i)

  # Not every parlamentarian has a birth date on their website. Therefore we need to check
  if (length(birth_i) == 0) {
    birth_i <- NA
  }

  print(birth_i)

  # We append the new birth date to our previously empty container object
  eu_birthdates <- append(eu_birthdates, birth_i)
}

# We now have a list of names and a list of birth dates. 
# Let´s bind them to our data set
data <- cbind(data, eu_birthdates)

# Outputting content to our console
head(data)

# Within a for loop you can include multiple nodes from which you want to extract content. 
# For each content you want to scrape, you only need to create an empty container variable first, which is then filled in the for loop. 

[1] NA
[1] "03-12-1968"
[1] NA
[1] "24-12-1963"
[1] "31-01-1989"
[1] "20-09-1949"
[1] "03-01-1964"
[1] "27-12-1968"
[1] "13-02-1952"
[1] "14-09-1953"
[1] "18-09-1981"
[1] "31-03-1964"
[1] "18-05-1976"
[1] "25-05-1953"
[1] "05-03-1961"
[1] NA
[1] "16-04-1962"
[1] "20-02-1986"
[1] "14-04-1960"
[1] "03-07-1984"
[1] "07-02-1979"
[1] "12-03-1963"
[1] "08-11-1958"
[1] "01-10-1956"
[1] "14-01-1953"
[1] "17-12-1966"
[1] "30-06-1970"
[1] "25-09-1974"
[1] "30-12-1971"
[1] "28-06-1973"
[1] "09-06-1964"
[1] "27-03-1967"
[1] "22-12-1989"
[1] NA
[1] "16-05-1963"
[1] "08-05-1966"
[1] "20-01-1976"


eu_parl_names,eu_parl_countries,eu_birthdates
Magdalena ADAMOWICZ,Poland,
Asim ADEMOV,Bulgaria,03-12-1968
Isabella ADINOLFI,Italy,
Matteo ADINOLFI,Italy,24-12-1963
Alex AGIUS SALIBA,Malta,31-01-1989
Mazaly AGUILAR,Spain,20-09-1949


<b>Exercise</b><br /> We have only scraped the subpages of parlamentarians who are on the first page. And we know only two things about them: Their name and their birth date. 

But what if that's not enough? We probably want to create a dataset of all members of the parliament (from a - z). Also, we need more information, e.g. on their parlamentary group and country. 

This is your task! Use the following script to create a small dataset of all eu parlamentarians including the variables name, birth_date, group, country and party. 

In [0]:
# First find the right link for your URL
url_fulllist <- "https://www2.europarl.europa.eu/meps/en/full-list/all"

url <- read_html(url_fulllist, encoding = "UTF-8")

# Now extract the variables eu_names, eu_group, eu_country and eu_party
eu_names_nodes <- html_nodes(url, ".member-name")
eu_names <- html_text(eu_names_nodes)

eu_group_nodes <- html_nodes(url, ".ep-layout_group .ep_name")
eu_group <- html_text(eu_group_nodes)

eu_country_nodes <- html_nodes(url, ".ep-layout_country .ep_name")
eu_country <- html_text(eu_country_nodes) 

eu_party_nodes <- html_nodes(url, ".ep-layout_party .ep_name")
eu_party <- html_text(eu_party_nodes) 

# We nee a slightly different node to scrape all links:
eu_links_nodes <- html_nodes(url,".single-member-container a") 

# Now using the html_attr() function on the new nodes
# we are telling it to extract the "href" attribute:
eu_links <- html_attr(eu_links_nodes, "href")

# We need to paste the url to the links:
eu_links <- paste0("https://www.europarl.europa.eu", eu_links)

eu_birthdates <- as.character()
# Insert a for loop that scrapes all the birth dates
for (i in 1:length(eu_links)){
  
  # Download html of current eu parlamentarian subpage
  html_i <- read_html(eu_links[i], encoding = "UTF-8")
  #print(html_nodes(html_i, "div+ .ep-a_text"))
  
  # Find the birth date node (inspector gadget)
  birth_node_i <- html_nodes(html_i, "#birthDate")
  
  # Now we extract the text of our specific node
  birth_i <- html_text(birth_node_i)
  
  if (length(birth_i) == 0) {
    #if (as.integer(length(birth_i)) == 0) {
    birth_i <- NA
  }
  
  # We append the new birth date to our previously empty container object
  eu_birthdates <- append(eu_birthdates, birth_i)
}

# combine all variables into one matrix
data <- cbind(eu_names,eu_group,eu_country,eu_party,eu_birthdates)

# View dataset
head(data)

eu_names,eu_group,eu_country,eu_party,eu_birthdates
Magdalena ADAMOWICZ,Group of the European People's Party (Christian Democrats),Poland,Independent,
Asim ADEMOV,Group of the European People's Party (Christian Democrats),Bulgaria,Citizens for European Development of Bulgaria,03-12-1968
Isabella ADINOLFI,Non-attached Members,Italy,Movimento 5 Stelle,
Matteo ADINOLFI,Identity and Democracy Group,Italy,Lega,24-12-1963
Alex AGIUS SALIBA,Group of the Progressive Alliance of Socialists and Democrats in the European Parliament,Malta,Partit Laburista,31-01-1989
Mazaly AGUILAR,European Conservatives and Reformists Group,Spain,VOX,20-09-1949


# 5. Analysis

What to do with your scraped data? For example, you can calculate the average age per parliamentary group. 


In [0]:
library(stringr)
age <- 2020 - as.integer(substr(eu_birthdates, 7, 10)) 
summary(age)

# Now we have the age of every parliament member. If we want to know the average age per parliamentary group, we need to differ between the groups
tapply(age, eu_country, mean)

data <- as.data.frame(cbind(eu_country,age))
summary(data)

# Finally, we can visualize our means
ggplot(filter(data, !is.na(age)) + 
  stat_summary(
    aes(x = eu_country, y = age),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = mean 
  )

ERROR: ignored

# 6. Outlook

<b>Scraping as suitable method in the post-API era</b><li> Web scraping offers a good opportunity for data collection - even in the post-API era</li><li> Web scraping is particularly easy when a website is properly programmed</li><li>Always allow enough time for data cleansing (often takes longer with scraped data)</li><li>Besides extracting text and links, it is also possible to fill out and send forms</li> <br /><b>Ethical and legal aspects</b><li>It is particularly important to act in a legally and ethically responsible manner (e.g. personal data) <li>Before scraping, check the terms of a website. If scraping is prohibited, the page should not be scraped<li>Never scrape data behind a login or paywall (even if you technically could)</b><br /><br /><b>Using Xpath</b><li>XPath can be used to navigate through elements and attributes in an XML document</li><li>With XPath, you can extract data based on text elements' contents, and not only on the page structure <li>So when you are scraping the web and you run into a hard-to-scrape website, XPath may just save the day</li><li>You can use the Selector Gadget to find the Xpath (just click on Xpath in the box)</li><li>Instead of using the CSS class in your code <b>html_nodes(object_name, "CSS class")</b> you use the Xpath <b>html_nodes(object_name, xpath='XpathClass')</b></li><br />Example: You want to scrape a specific element of a web page. The CSS class of this specific element is assigned a multiple times on the web page. Therefore, more than the specific element is scraped with the CSS class. Using Xpath is your solution because the Xpath is unique for every element on a web page. <br /><br /><b>Further reading</b><br />
Munzert, S., Ruoba, C., Meiboner, P., & Nyhuis, D. (2015). Automated Data Collection with R: A practical Guide to Web Scraping and Text Mining. Chichester, United Kingdom: John Wiley & Sons. 

For data wrangling and cleansing see https://r4ds.had.co.nz/

