-
Notifications
You must be signed in to change notification settings - Fork 104
/
Copy pathiframe_tutorial.Rmd
58 lines (45 loc) · 1.67 KB
/
iframe_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
title: "Iframe Tutorial: Extract Information Inside Iframes"
author: "yusuzech"
date: "July 30, 2018"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rvest)
```
In this tutorial, we are going to use this website: https://scrapethissite.com/pages/frames/ to extract information inside iframes.
## 1 Normal method should fail
At first, let's extract each turtle's name as shown in the image below:

Let's try the following codes:
```{r}
library(rvest)
my_url <- "https://scrapethissite.com/pages/frames/"
turtle_names <- html_session(my_url) %>%
html_elements(".family-name") %>%
html_text()
print(turtle_names)
```
It failed and returned nothing. The reason is that those information are actually from another HTML file and is imbedded in the current page you are reading. So you can't extract anything from current website.
## 2 Find the iframe
To extract turtles' name, we need to find the link to the iframe. Let's use Chrome Developer Tool to find the linke to iframe:
We can use ctrl+F and search for keyword "iframe" and here we find the link.

Let's do this in R and this time it should succeed:
```{r}
library(rvest)
library(stringr)
my_url <- "https://scrapethissite.com/pages/frames/"
#extract source as hown in the image above
iframe_src <- html_session(my_url) %>%
html_element("#iframe") %>%
html_attr("src")
#get the url to that iframe
iframe_url <- str_c("https://scrapethissite.com",iframe_src)
#extract turtle names:
turtle_names <- html_session(iframe_url) %>%
html_elements(".family-name") %>%
html_text()
print(turtle_names)
```