# Web Scraping in Julia

This notebook will show how to do web scraping in Julia. It piggybacks off of existing identical notebooks in R and Python (see [here](https://github.com/tyleransom/DScourseS18/blob/master/WebData/TrumpLies.Rmd) for R and [here](https://github.com/justmarkham/trump-lies/blob/master/trump_lies.ipynb) for Python). The task at hand is to scrape "lies" and "explanations" from a *New York Times* article that evaluates President Trump's statements during the first 10 months of his presidency.

A screenshot of the text we're interested in scraping is below:

![figure](images/trumpLies.png)

## Load required packages
To start, we're going to need the following packages:

In [43]:
using Cascadia
using Gumbo
using Requests
using DataFrames

A quick primer on what these packages are used for:

* **Cascadia** is a package for selecting CSS tags
* **Gumbo** is a package for parsing HTML files
* **Requests** is a package for downloading webpages and storing them as text
* **DataFrames** is the package that allows us to store arrays of numbers or strings as data frames

## Begin scraping!

Now let's scrape the [webpage we're interested in](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html)!

In [44]:
r = get("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html");
webpage = parsehtml(convert(String, r.data));

Let's check now and see what types of objects these are:

In [45]:
println(typeof(r))
println(typeof(webpage))

HttpCommon.Response
Gumbo.HTMLDocument


So we see that the output of `get()` from the Requests package is something called an `HttpCommon.Response` and once we convert the data field of that object (`r.data`) to a string and parse it as HTML, the resulting object is a `Gumbo.HTMLDocument`.

**Types** are the fundamental units of the Julia ecosystem, so it's always important to know what type the object you're looking at actuall is. Thus, it's common to frequently use the `typeof()` command when working in Julia.

Now let's use the Cascadia package to select the CSS tag we're interested in. A preliminary step in this process is to use [SelectorGadget](http://selectorgadget.com/) to find which CSS tag holds the data you're interested in scraping.

In our case, we found that the CSS tag that holds all of the data we want is the ".short-desc" tag.

In [46]:
allrecords = matchall(Selector(".short-desc"),webpage.root);
println(typeof(allrecords))
println(size(allrecords))

Array{Gumbo.HTMLNode,1}
(180,)


So now we have an array of dimension 180 by 1 of records associated with the ".short-desc" CSS tag. We can browse what the first one looks like:

In [47]:
allrecords[1]

Gumbo.HTMLElement{:span}:
<span class="short-desc">
  <strong>
    Jan. 21 
  </strong>
  “I wasn't a fan of Iraq. I didn't want to go into Iraq.”
  <span class="short-truth">
    <a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"target="_blank">
      (He was for an invasion before he was against it.)
    </a>
  </span>
</span>


This looks a lot like what we want: it has the date the lie was said, the lying quote, a URL for the explanation of how it was a lie, as well as a short statement of the explanation. This is exactly what we wanted.

## How Gumbo works: parent and children nodes
Gumbo works a bit differently from other HTML parsers (e.g. `rvest` in R or `BeautifulSoup` in Python). Gumbo organizes the structure of the HTML file as a "tree." Each node has parent nodes and children nodes.

Once you get the hang of it, it's actually very straightforward to access the data you need. Let's start by looking again at the first record:

In [48]:
firstrecord = allrecords[1]
println(fieldnames(firstrecord))
println(firstrecord.children)
println(size(firstrecord.children))

Symbol[:children, :parent, :attributes]
Gumbo.HTMLNode[Gumbo.HTMLElement{:strong}:
<strong>
  Jan. 21 
</strong>
, HTML Text: “I wasn't a fan of Iraq. I didn't want to go into Iraq.” , Gumbo.HTMLElement{:span}:
<span class="short-truth">
  <a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"target="_blank">
    (He was for an invasion before he was against it.)
  </a>
</span>
]
(3,)


So we see that the fieldnames of `firstrecord` are `children`, `parent`, and `attributes`. We can show its children, which are:

1. a node consisting of a `strong` HTML element (this is what the dates are surrounded by) 
2. a node containing some HTML text (the lie)
3. a node consisting of a span tagged as "short-truth" 

We then verify that these three are the only "children."

### Accessing children
To access children, we simply nest the indexing like so:

In [49]:
firstrecord[1]

Gumbo.HTMLElement{:strong}:
<strong>
  Jan. 21 
</strong>


So the first child of `firstrecord` is the date, surrounded by a `strong` HTML tag.

### Converting HTMLElements to strings or numbers
So how can we extract the date from this element? The function we need (from the Gumbo package) is `nodeText()`:

#### Extracting the Date

In [50]:
nodeText(firstrecord[1])

"Jan. 21 "

Bingo! That's exactly what we need. 

#### Extracting the Lie
Now let's extract the lie:

In [51]:
nodeText(firstrecord[2])

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

Aside from a trailing space (same with the date) and some funky Unicode quotation marks, this looks good!

Now let's tackle the third child. This one has our two remaining elements of interest (the link explanation, as well as the explanation itself) lumped together:

In [52]:
firstrecord[3]

Gumbo.HTMLElement{:span}:
<span class="short-truth">
  <a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"target="_blank">
    (He was for an invasion before he was against it.)
  </a>
</span>


But recall that our `firstrecord` is just a tree with parent and child nodes! Let's check to see if the third child of `firstrecord` has any children itself:

In [53]:
println(firstrecord[3].children)
println(firstrecord[3][1])

Gumbo.HTMLNode[Gumbo.HTMLElement{:a}:
<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"target="_blank">
  (He was for an invasion before he was against it.)
</a>
]
<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"target="_blank">(He was for an invasion before he was against it.)</a>

It looks like it doesn't, but it does have the text we are interested in nested within the HTML `<a>` tag, and the URL we are interested nested in the `href` attribute of the `<a>` tag.

### Accessing HTML attributes
The way to access what's in an HTML attribute is to use the `attributes` field of the child:

In [54]:
firstrecord[3][1].attributes

Dict{AbstractString,AbstractString} with 2 entries:
  "href"   => "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-sa…
  "target" => "_blank"

The attribute field of every node may contain data we're interested in. In this case, we want the `"href"` which is the URL we'd like. Note that attributes is a dictionary, so all we need to do is index by the key.

#### Extracting the URL

In [55]:
println(firstrecord[3][1].attributes["href"])
println(typeof(firstrecord[3][1].attributes["href"]))
url = firstrecord[3][1].attributes["href"];

https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the
String


So this gives us the string that contains the URL.

#### Extracting the explanation
To extract the explanation, it turns out that all we need to do is repeat the process for the date and lie:

In [56]:
println(nodeText(firstrecord[3][1]))

(He was for an invasion before he was against it.)


## Cleaning up the text and putting it all together in a loop
Now we've got all of the elements we need, but we do need to clean things up just a bit since there were Unicode quotation marks as well as trailing spaces and extraneous parentheses in each of our objects.

### Adding 2017 to the dates
To concatenate two strings in Julia, we can use the `*` operator. Note that we also want to remove the trailing space, so we can execute the following code:

In [57]:
date = nodeText(firstrecord[1])
date = date[1:end-1]*", 2017"

"Jan. 21, 2017"

That gives us the format that we're looking for. 

### Dates in Julia
To convert a string date like `"Jan. 21, 2017"` to a date format, we need to use the `Date()` function in the following way:

In [58]:
date = Date(date, "u. d, y")

2017-01-21

The first argument of the `Date()` function is the string that needs to be converted; while the second argument is the format of that string. `u` refers to abbreviated month (while `U` would be the full month name, like "January"). `d` and `y` are self-explanatory, but it's crucial to have the comma there so that the `Date()` function knows where to find the elements it needs.



In [59]:
date = nodeText(allrecords[1][1])[1:end-1]*", 2017"

"Jan. 21, 2017"

### String operations
The common operations we will use in this example are `[1:end-1]` (which trims the last character of a string) and the `replace(var,string1,string2)` function, which replaces `string1` in variable `var` with `string2`:

In [60]:
lie = nodeText(firstrecord[2])[1:end-1]
x = lie[1]
y = lie[end]
lie = replace(lie,x,"")
lie = replace(lie,y,"")
lie

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

In [61]:
explanation = nodeText(firstrecord[3])
explanation = replace(explanation,"(","")
explanation = replace(explanation,")","")

"He was for an invasion before he was against it."

### Concatenating data frames
Now that we have all of the string operations required to clean the text data, all we need to do is store each of the elements of interest in a data frame, and then concatenate the data frames together.

To start, we need to initialize an empty dataframe:

In [62]:
df = DataFrame(date=DateTime[], lie=String[], explanation=String[], url=String[])

Unnamed: 0,date,lie,explanation,url


Then, we can create a temporary data frame containing `firstrecord`'s data and concatenate them together:

In [63]:
temp = DataFrame(date=date, lie=lie, explanation=explanation, url=url)
df = [df; temp]

Unnamed: 0,date,lie,explanation,url
1,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go into Iraq.,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the


## Putting it all together in a loop and exporting to CSV
The remaining task is to, instead of referencing `firstrecord`, to loop through all of the records and then add to the end of the data frame at each concatenation.

Once the complete data frame is created, then we can export to CSV.

The full code is below:

In [64]:
using Cascadia
using Gumbo
using Requests
using DataFrames

r = get("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
webpage = parsehtml(convert(String, r.data))

allrecords = matchall(Selector(".short-desc"),webpage.root)

df = DataFrame(date=DateTime[], lie=String[], explanation=String[], url=String[])

for q = 1:size(allrecords,1)
    # Step 1: extract date and convert to DateTime format
    date = nodeText(allrecords[q][1])[1:end-1]*", 2017"
    date = replace(date, "Sept.", "Sep.")
    date = try Date(date, "u. d, y") catch Date(date, "U d, y") end

    # Step 2: extract lie
    lie = nodeText(allrecords[q][2])[1:end-1]
    x = lie[1]
    y = lie[end]
    lie = replace(lie,x,"")
    lie = replace(lie,y,"")

    # Step 3: extract the explanation
    explanation = nodeText(allrecords[q][3])
    explanation = replace(explanation,"(","")
    explanation = replace(explanation,")","")

    # Step 4: extract the URL
    url = allrecords[q][3][1].attributes["href"]

    # Step 5: put all together in a dataframe
    temp = DataFrame(date=date, lie=lie, explanation=explanation, url=url)
    df = [df; temp]
end

# Export the dataframe to a CSV
writetable("trumpLies.csv", df)

head(df)

Unnamed: 0,date,lie,explanation,url
1,2017-01-21T00:00:00,I wasn't a fan of Iraq. I didn't want to go into Iraq.,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the
2,2017-01-21T00:00:00,A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.,Trump was on the cover 11 times and Nixon appeared 55 times.,http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/
3,2017-01-23T00:00:00,Between 3 million and 5 million illegal votes caused me to lose the popular vote.,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html
4,2017-01-25T00:00:00,"Now, the audience was the biggest ever. But this crowd was massive. Look how far back it goes. This crowd was massive.",Official aerial photos show Obama's 2009 inauguration was much more heavily attended.,https://www.nytimes.com/2017/01/21/us/politics/trump-white-house-briefing-inauguration-crowd-size.html
5,2017-01-25T00:00:00,Take a look at the Pew reports (which show voter fraud.),The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics/unauthorized-immigrant-voting-trump-lie.html
6,2017-01-25T00:00:00,You had millions of people that now aren't insured anymore.,"The real number is less than 1 million, according to the Urban Institute.",https://www.nytimes.com/2017/03/13/us/politics/fact-check-trump-obamacare-health-care.html
