Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
136 lines (103 sloc) 5 KB
---
title: "SelectorGadget"
author: "Hadley Wickham"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{SelectorGadget}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE}
embed_png <- function(path, dpi = NULL) {
meta <- attr(png::readPNG(path, native = TRUE, info = TRUE), "info")
if (!is.null(dpi)) meta$dpi <- rep(dpi, 2)
knitr::asis_output(paste0(
"<img src='", path, "'",
" width=", round(meta$dim[1] / (meta$dpi[1] / 96)),
" height=", round(meta$dim[2] / (meta$dpi[2] / 96)),
" />"
))
}
knitr::opts_chunk$set(comment = "#>", collapse = TRUE)
```
```{r results="asis", echo=FALSE}
# directly adding css to output html without ruining css style https://stackoverflow.com/questions/29291633/adding-custom-css-tags-to-an-rmarkdown-html-document
cat("
<style>
img {
border: 0px;
outline: 0 ;
}
</style>
")
```
SelectorGadget is a JavaScript bookmarklet that allows you to interactively figure out what css selector you need to extract desired components from a page.
## Installation
To install it, open this page in your browser, and then drag the following link to your bookmark bar: <a href="javascript:(function(){var%20s=document.createElement('div');s.innerHTML='Loading...';s.style.color='black';s.style.padding='20px';s.style.position='fixed';s.style.zIndex='9999';s.style.fontSize='3.0em';s.style.border='2px%20solid%20black';s.style.right='40px';s.style.top='40px';s.setAttribute('class','selector_gadget_loading');s.style.background='white';document.body.appendChild(s);s=document.createElement('script');s.setAttribute('type','text/javascript');s.setAttribute('src','https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js');document.body.appendChild(s);})();">SelectorGadget</a>.
## Use
To use it, open the page
1. Click on the element you want to select. SelectorGadget will make a first
guess at what css selector you want. It's likely to be bad since it only
has one example to learn from, but it's a start. Elements that match the
selector will be highlighted in yellow.
2. Click on elements that shouldn't be selected. They will turn red.
Click on elements that *should* be selected. They will turn green.
3. Iterate until only the elements you want are selected. SelectorGadget
isn't perfect and sometimes won't be able to find a useful css selector.
Sometimes starting from a different element helps.
For example, imagine we want to find the actors listed on an IMDB movie page, e.g. [The Lego Movie](http://www.imdb.com/title/tt1490017/).
1. Navigate to the page and scroll to the actors list.
```{r, echo = FALSE}
embed_png("selectorgadget-1-s.png")
```
2. Click on the SelectorGadget link in the bookmarks. The SelectorGadget
console will appear at the bottom of the screen, and element currently
under the mouse will be highlighted in orange.
```{r, echo = FALSE}
embed_png("selectorgadget-2-s.png")
```
3. Click on the element you want to select (the name of an actor). The
element you selected will be highlighted in green. SelectorGadget guesses
which css selector you want (`a` in this case), and highlights
all matches in yellow (see total count equal to 592 as indicated on
on the "Clear" button). This seems to be a little too excessive.
```{r, echo = FALSE}
embed_png("selectorgadget-3-s.png")
```
4. Scroll around the document to find elements that you don't want to match
and click on them. For example, we don't to match the character the actor
contributed to, so we click on it and it turns red. The css selector updates to
`.primary_photo+ td a`.
```{r, echo = FALSE}
embed_png("selectorgadget-4-s.png")
embed_png("selectorgadget-5-s.png")
```
Once we've determined the css selector, we can use it in R to extract the values we want:
```{r}
library(rvest)
lego_url <- "http://www.imdb.com/title/tt1490017/"
html <- read_html(lego_url)
cast <- html_nodes(html, ".primary_photo+ td a")
length(cast)
cast[1:2]
```
Finally, we can extract the text from the selected HTML nodes.
Looking carefully at this output, we see twice as many matches as we expected. That's because we've selected both the table cell and the text inside the cell. We can experiment with selectorgadget to find a better match or look at the html directly.
```{r}
html_text(cast, trim = TRUE)
```
Let's say we're also interested in extracting the links to the actors' pages. We can access html attributes of the selected nodes using `html_attrs()`.
```{r}
cast_attrs <- html_attrs(cast)
length(cast_attrs)
cast_attrs[1:2]
```
As we can see there's only one attribute called `href` which contains relative url to the actor's page. We can extract it using `html_attr()`, indicating the name of the attribute of interest. Relative urls can be turned to absolute urls using `url_absolute()`.
```{r}
cast_rel_urls <- html_attr(cast, "href")
length(cast_rel_urls)
cast_rel_urls[1:2]
cast_abs_urls <- html_attr(cast, "href") %>%
url_absolute(lego_url)
cast_abs_urls[1:2]
```
You can’t perform that action at this time.