# Intro to web scraping in R

**First, Need to load library (rvest)**

# **Step 1: Install rvest**

In [2]:
library(tidyverse)
library(rvest)

“running command 'timedatectl' had status 1”
“Failed to locate timezone database”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.5     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m  masks [34mstats[39m::filter()
[31m✖[39m [34mpurrr[39m::[32mflatten()[39m masks [34mjsonlite[39m::flatten()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m     masks [34mstats[39m::lag()


Attaching package: ‘rvest’


The following object is masked from ‘package:readr’:

    guess_encoding




# **Step 2: Retrieve the HTML Page**

In [3]:
# retrieving the target web page 
document <- read_html("https://scrapeme.live/shop")

The **read_html()** function retrieves the HTML downloaded using the URL passed as a parameter, then parses it and assigns the resulting data structure to the document variable.

In [4]:
# print out document contents
print(document)

{html_document}
<html lang="en-GB">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="archive post-type-archive post-type-archive-product woocomme ...


# **Step 3: Identify and Select the Most Important HTML elements**

In [5]:
# selecting the list of product HTML elements 
html_products <- document %>% html_elements("li.product")

Notice from the HTML code that a li.product HTML element includes:
- An **a** that stores the product URL.
- An **img** that contains the product image.
- A **h2** that keeps the product name.
- A **span** that stores the product price.

In [6]:
print(html_products)

{xml_nodeset (16)}
 [1] <li class="post-759 product type-product status-publish has-post-thumbna ...
 [2] <li class="post-729 product type-product status-publish has-post-thumbna ...
 [3] <li class="post-730 product type-product status-publish has-post-thumbna ...
 [4] <li class="post-731 product type-product status-publish has-post-thumbna ...
 [5] <li class="post-732 product type-product status-publish has-post-thumbna ...
 [6] <li class="post-733 product type-product status-publish has-post-thumbna ...
 [7] <li class="post-734 product type-product status-publish has-post-thumbna ...
 [8] <li class="post-735 product type-product status-publish has-post-thumbna ...
 [9] <li class="post-736 product type-product status-publish has-post-thumbna ...
[10] <li class="post-737 product type-product status-publish has-post-thumbna ...
[11] <li class="post-738 product type-product status-publish has-post-thumbna ...
[12] <li class="post-739 product type-product status-publish has-post-thumbna .

This executes the html_elements() rvest function on document by using the **R %>% pipe operator**. Specifically, html_elements() returns the list of HTML elements found applying a CSS selector or XPath expression.

In [7]:
# selecting the "a" HTML element storing the product URL 
a_element <- html_products %>% html_element("a") 
# selecting the "img" HTML element storing the product image 
img_element <- html_products %>% html_element("img") 
# selecting the "h2" HTML element storing the product name 
h2_element <- html_products %>% html_element("h2") 
# selecting the "span" HTML element storing the product price 
span_element <- html_products %>% html_element("span")

In [8]:
print(a_element)

{xml_nodeset (16)}
 [1] <a href="https://scrapeme.live/shop/Bulbasaur/" class="woocommerce-LoopP ...
 [2] <a href="https://scrapeme.live/shop/Ivysaur/" class="woocommerce-LoopPro ...
 [3] <a href="https://scrapeme.live/shop/Venusaur/" class="woocommerce-LoopPr ...
 [4] <a href="https://scrapeme.live/shop/Charmander/" class="woocommerce-Loop ...
 [5] <a href="https://scrapeme.live/shop/Charmeleon/" class="woocommerce-Loop ...
 [6] <a href="https://scrapeme.live/shop/Charizard/" class="woocommerce-LoopP ...
 [7] <a href="https://scrapeme.live/shop/Squirtle/" class="woocommerce-LoopPr ...
 [8] <a href="https://scrapeme.live/shop/Wartortle/" class="woocommerce-LoopP ...
 [9] <a href="https://scrapeme.live/shop/Blastoise/" class="woocommerce-LoopP ...
[10] <a href="https://scrapeme.live/shop/Caterpie/" class="woocommerce-LoopPr ...
[11] <a href="https://scrapeme.live/shop/Metapod/" class="woocommerce-LoopPro ...
[12] <a href="https://scrapeme.live/shop/Butterfree/" class="woocommerce-Loop .

# **Step 4: Extract the Data from the HTML Elements**
rvest applies the last function of the queue statement to each HTML element selected with html_element() from html_products. html_attr() returns the string stored in a single attribute. Similarly, html_text2() returns the text in an HTML element as it looks in a browser.

In [9]:
# extracting data from the list of products and storing the scraped data into 4 lists 
product_urls <- html_products %>% 
	html_element("a") %>% 
	html_attr("href") 
product_images <- html_products %>% 
	html_element("img") %>% 
	html_attr("src") 
product_names <- html_products %>% 
	html_element("h2") %>% 
	html_text2() 
product_prices <- html_products %>% 
	html_element("span") %>% 
	html_text2()

In [10]:
# converting the lists containg the scraped data into a dataframe 
products <- data.frame( 
	product_urls, 
	product_images, 
	product_names, 
	product_prices 
)

# **Step 5: Export the Scraped Data to CSV**
Before converting the products variable into CSV format, change its column names using names(). It allows you to change the names associated with every dataframe component so that the exported CSV file will be easier to read.

In [11]:
# changing the column names of the data frame before exporting it into CSV 
names(products) <- c("url", "image", "name", "price")

Export the dataframe object to a CSV file using the **write.csv()** method, which instructs your R web crawler to produce a **products.csv** file containing the scraped data.

In [12]:
# export the data frame containing the scraped data to a CSV file 
write.csv(products, file = "./products.csv", fileEncoding = "UTF-8")