# Julia scraps too!

## A worked through example

Scraping with Julia is very similar to scraping with R (or to scraping with any other language).

An advantage is the broadcasting and mapping are built in in Julia through the dot `.` operator and the `map()` functions. (If this is completely new to you, it's time to go look it up in the previous labs or looking through julia documentations.)

The packages we introduce this time are: `Cascadia` for searching and dealing with the html pages, `Gumbo` for transforming the downloaded page into something Julia and Cascadia can deal with, and `HTTP` to handle internet connections.

In [None]:
using Pkg 

In [None]:
Pkg.add("HTTP")

In [None]:
Pkg.add("Gumbo")

In [None]:
Pkg.add("Cascadia")

In [None]:
using HTTP, Gumbo, Cascadia

The first step is getting a page. This is done over an "HTTP get request" (refer to the R part of this lab to learn more about API queries.)

In [None]:
stallman_page = HTTP.get("https://stallman.org/")

Now that we have all the page, we get the part we are interested in: the body of the response. And we parse it, using Gumbo, to something we can deal with (instead of Gumbo native format). Notice that we go through a `String` conversion before parsing. Try and think why we need to do it (it has to do with the type of the objects we create).

In [None]:
parsed_page = stallman_page.body |>
  String |>
  parsehtml

Than, as for the R part, we use the browser to identify what we look for. In this case I want to extract all the link where Stallman speaks about "What's bad about X". They are in a css class called "c2" and the links are stored in a subclass of that named "a" (this is often the case for links, and I think "a" stands for "anchor"). I discovered it by looking at the source page of Stallman's website using the "inspector mode" in Firefox.

In [None]:
# we create a selector to get the thing we want
sel_for_c2 = Selector(".c2 a")
# and we extract each matching node in the XML document
c2_part_stallman_page = eachmatch(sel_for_c2,parsed_page.root)

Now we need a little bit of care. By default, `eachmatch()` returns an array. If you read carefully the output of the previous cell, you see:

> 1-element Array{Gumbo.HTMLNode,1}:  
> Gumbo.HTMLElement{:div}:

In this case it is an array of length 24, because 24 bits of html were of the right class ("c2") in the page. To get to any of those we can index the array:

In [None]:
c2_part_stallman_page[1]

Now, if you read carefully, the result is a Gumbo.HTMLelement, and so we can use Cascadia to work on it.

All this blocks of html are cointaned within a `<a> ... </a>` delimeter. The links themselved are in a "href" attribute.

The information is there, we are quite close! If we want to extract the "href" attribute from one element we can use `getattr()`:

In [None]:
bad_about_links = getattr(c2_part_stallman_page[1],"href")

Now, we may be tempted to use broadcasting (i.e., adding a dot, ., after the function we call) and apply `getattr()` to all the elements in the array. And it may even work in lucky situation! It does for us in this situation, :-)

In [None]:
badabout_links = getattr.(c2_part_stallman_page,"href")

But be careful, it does not work in general because you are never sure that every elements in your array contains the right information. If only one of those elements does not have a link the function would fail.

In Julia, the functionalities of `purrr`'s `map()` are given by the base available `map`. Let's use it to read all those links. First we need to define what function to apply on each of those strings.

In [None]:
const baseURL = "https://stallman.org"
# warning! I didn't include the last "/" because all the links have already it!

We first try out on 1 link:

In [None]:
first_link = badabout_links[1] # let's focus only on the first link
response = HTTP.request("GET","$baseURL$first_link") # we read the right page glueing together the base url and the link we got before

And parse it as we have done above:

In [None]:
# we get the body of the reponse, convert it into a String, and parse it as a Gumbo HTML document
pagebad = response.body |> String |> parsehtml 

and we extract all the text we can find in that page.

In [None]:
result_string = nodeText(pagebad.root) 
result_string |> println

Ok, let's transform this flow into a function, and output everything to a Data Frame

In [None]:
using DataFrames

In [None]:
function get_badness(link)
  response = HTTP.request("GET","$baseURL$link") # we read the right page glueing together the base url and the link we got before
  pagebad = response.body |> String |> parsehtml
  result_string = nodeText(pagebad.root)
  df = DataFrame(Link = link, Badness = result_string)
  return df
end

In [None]:
badabout_links[24] |> get_badness #|> println

It seems to work, at least for one element at a time. Let's see on two elements at a time:

In [None]:
allbad = vcat(map(get_badness,badabout_links[1:2])...)
# notice the use of ... at the end of vcat
# if you don't know what it does, look at the man page for it

Yup, now let's do all of them:

In [None]:
allbad = map(get_badness,badabout_links)

No luck: some of the stuff we are trying to scrape is not an html file as we expect but who-knows-what. We need to play a tad safer.  
In Julia we can do this by using a `try - catch` construct: we `try` to do something, and if we get an error instead of a lucky result we `catch` it and do something else.

In [None]:
function get_badness(link)
  try
    response = HTTP.request("GET","$baseURL$link") # we read the right page glueing together the base url and the link we got before 
    pagebad = response.body |> String |> parsehtml
    result_string = nodeText(pagebad.root)
    df = DataFrame(Link = link, Badness = result_string)
    return df 
  catch
    result_string = "no html content" 
    df = DataFrame(Link = link, Badness = result_string)
    return df
  end
end


let's try that:

In [None]:
allbad = vcat(map(get_badness,badabout_links)...)

### Your turn

Now that we have driven through a scraping example together, try selecting another page (or something else in Stallman's page) and scrape it on your own.

# API

are just the same of R. Here, the package that you will most often rely on is HTTP.jl. Yet, the logic you will employ (using a function to paste together the necessary URL and waiting for an answer from a remote server) is always the same.

You can learn more about HTTP at its webpage: https://juliaweb.github.io/HTTP.jl/stable/