# Advanced Programming for Language Technologists
Artur Kulmizev, Spring 2020

### Pre-requisites:

For this lecture, we'll be working with several Python libraries. A good way to keep track of the libraries you may use for various projects is to use virtual environments. You can do this in Python by first installing the `virtualenv` library:

`$ pip install --user virtualenv`

Following the installation, you can create a virtual environment as follows:

`$ virtualenv web_scraping`

After doing this, a directory named `web_scraping` will be created in your local directory. This will hold all the necessary packages that you'll be working with on this project. 

To load a virtual env, issue the following command:

`source web_scraping/bin/activate`

After loading an env, you should its name at the beginning of your shell prompt:

`(web_scraping)$`

Once you're inside an environment, you can install packages as follows:

`$ pip install --user beautifulsoup4 selenium`

After the packages have completed installing, their installation will exist within this folder, isolated from the rest of the system. After deactivating the environment, your Python settings will revert to how they were before. To deactivate an environment, you can simply run `(web_scraping)$ deactivate`. 

#### Anaconda

Anaconda is a Python distribution that comes bundled with many essential data science and machine learning packages. It can be downloaded [here](https://www.anaconda.com/distribution/#download-section). It is recommended that you install Anaconda on your personal computers since you'll be working with many relevant packages that it includes. Anaconda also features `conda`, which is a package manager and environment manager in-one. You can create and activate a `conda` virtual environment like so:

`$ conda create -n web_scraping`
`$ source activate web_scraping`

Once inside the environment, you can also use `conda` to install packages:

`(web_scraping) $ conda install beautifulsoup4`

To deactivate the environment, simply do:

`(web_scraping) $ deactivate`

## Lecture 1: Web Scraping

### _What is web scraping?_

Web scraping refers to the practice of writing computer programs that can extract data from web pages. People typically scrape the web for images or text, but other types of data - like geolocation and demographics - are also very prevalent. 

### _How is web scraping relevant to LT?_

As language technologists-in-training, so far you've mostly worked with curated corpus text. This type of text (whether annotated or not) is immensely useful for training machine learning-based NLP systems, but is often very limited in terms of access and scope. 

Say you wanted to use Twitter data to investigate dialect differences across various regions of Switzerland. You'll likely need a dataset or corpus of Tweets, along with user-provided geolocation or something of the sort. Perhaps there is an existing dataset you could use, but, in most cases, there likely isn't. Similarly, what if you did manage to get your hands on a perfect corpus, but it only collected Tweets from 2010? Given the rapid pace of language change on the internet, it is unlikely that the data would be sufficiently up-to-date for your purposes. In such cases, you'll probably need to get the data yourself. 

### _How can we do it?_

Typically, when you've determined that you need to acquire your own data from the internet, there are two scenarios that arise. 

#### Easy Scenario: the website provides an API:

API stands for Application Programming Interface. APIs are essentially enterprise-provided tools and guidelines that help users programatically interact with their services. Instead of attempting to navigate the maze of data structures in a particular website from scratch, users can use a website's API to make the process significantly easier - most of the work has already been done for them. For example, [Twitter offers an API](https://developer.twitter.com/en/docs) that specifies how one can retrieve Tweets from a specific time/date, post one's own Tweets, and find relevant information, like geolocation, etc. This can then be used to write programs in the user's language of choice, say, Python. 

Unfortunately, APIs are often only provided by enterprise systems like Twitter, Google, and Facebook. As such, someone looking to crawl disparate sources of information (e.g. English-language blog posts about the Iraq War circa 2001-2004) will need to get creative.

#### Hard Scenario: the website provides no API:

In the case that you don't have access to a website's API, you'll probably have to interact with webpages directly. Webpages are built upon HTML, which stands for Hypertext Markup Language. Scraping is typically done by processing lots of pages, parsing their HTML structure, and returning relevant bits of that structure, like paragraphs of text. Though this might seem daunting, there are a multitude of libraries that can help you make this process relatively painless. For the sake of this tutorial, we'll focus on these. 

### HTML review

A very simple HTML page might look something like this.

```
<!DOCTYPE html>
<html>
<head>
    <title>My Title</title>
</head>
<body>

    <h1>My First Heading</h1>
    <p>My first paragraph.</p>

</body>
</html> 
```

HTML documents (just webpages) consist of elements that tell browsers how to display content. Elements are typically things like headings, paragraphs, lists, that can be used to format, structure, and stylize webpages. (Most) elements have appear between between tags, which specify their beginning and ends, like so: `<h1>This is a heading</h1>`. A typical HTML document consists of the following structure:

* a `<!DOCTYPE html>` declaration, which specifies to the browser that the document is indeed an HTML5 document.
* an `<html>` element that contains the entire page's contents.
* a `<head>` element that contains information and metadata about the page, like its title, etc.
* a `<body>` element that contains the main part of the page: its structure, headings, text, etc.

The goal of web scraping is to extract information that is structured in this format. It is important to note that very, very few webpages have simple HTML-parsable formats. Also, HTML isn't the only language at play in web development - far from it. A typical webpage might consist of a complicated interplay of various languages: HTML, JavaScript, Ruby, etc. This is what makes scraping a challenging task.

### How can we parse HTML with Python?

Naively? Regex. 

In [1]:
import re

my_page = '''
<!DOCTYPE html>
<html>
<head>
    <title>My Title</title>
</head>
<body>

    <h1>My First Heading</h1>
    <p>My first paragraph.</p>

    <h1>My Second Heading</h1>
    <p>My second paragraph.</p>
    
</body>
</html> 
'''

#Search for the title of the page

title_search = re.compile(r'<h1.*?>(.+?)</h1>')
title = title_search.findall(my_page)

print(title)

['My First Heading', 'My Second Heading']


This works well for very simple, toy HTML pages. With more complex HTML pages, the code becomes messy and unpredictable, so regex is a waste of time. 

Better to use an HTML parser, like Beautiful Soup. 

In [2]:
from bs4 import BeautifulSoup

# We can call BeautifulSoup on our page to parse the document

soup = BeautifulSoup(my_page, "html.parser")

# This gives us a parsed object that we can interact with

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   My Title
  </title>
 </head>
 <body>
  <h1>
   My First Heading
  </h1>
  <p>
   My first paragraph.
  </p>
  <h1>
   My Second Heading
  </h1>
  <p>
   My second paragraph.
  </p>
 </body>
</html>



In [4]:
# We can extract any element out of the document...

print(soup.title)

# ... as well as its corresponding text

print(soup.title.get_text())

<title>My Title</title>
My Title


In [5]:
# We can use soup to find the first instance of an element...

print(soup.p)

# ... or all instances of an element...

print(soup.find_all("p"))

# ... and their corresponding text

for paragraph in soup.find_all("p"):
    print(paragraph.get_text())

<p>My first paragraph.</p>
[<p>My first paragraph.</p>, <p>My second paragraph.</p>]
My first paragraph.
My second paragraph.


What if we want to parse live web pages? We can do the following using `urllib`:

In [6]:
import urllib.request

site_string = "https://cl.lingfil.uu.se/~miryam/"

miryam_site = urllib.request.urlopen(site_string)
soup2 = BeautifulSoup(miryam_site, "html.parser")

print(soup2.title.string)

 Miryam de Lhoneux


### Exceptions

Let's try loading the webpages of several computational linguists at Uppsala and see what they call their page:

In [17]:
for name in ["gongbo", "miryam", "sara", "artur"]:
    site_string = f"https://cl.lingfil.uu.se/~{name}/"
    cl_site = urllib.request.urlopen(site_string)
    soup = BeautifulSoup(cl_site, "html.parser")
    print(soup.title.string)

Gongbo Tang
 Miryam de Lhoneux
Sara Stymne


HTTPError: HTTP Error 404: Not Found

It seems our code ran for the first three names, but failed on mine. The last site caused `urrlib` to raise an `HTTPError`, which was raised here: `--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)`. Let's look at some other types of exceptions in Python:

In [9]:
# ZeroDivisionError

12/0

ZeroDivisionError: division by zero

In [9]:
# NameError 

variable_we_did_not_define + 0

NameError: name 'variable_we_did_not_define' is not defined

In [10]:
# TypeError

12 + "cat"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

We can handle exceptions using `try` and `except` statements. `try` tells Python to try running your code, with the expectation that something might go awry. `except` tells Python that, when something does go awry, you have code to handle the situation accordingly. Let's try them in our code above:

In [10]:
for name in ["gongbo", "miryam", "sara", "artur"]:
    site_string = f"https://cl.lingfil.uu.se/~{name}/"
    try:
        cl_site = urllib.request.urlopen(site_string)
        soup = BeautifulSoup(cl_site, "html.parser")
        print(soup.title.string)
    except:
        print("Site not found.")

Gongbo Tang
 Miryam de Lhoneux
Sara Stymne
Site not found.


For more specificity, we can include the exact type of exception we're hoping to catch:

In [11]:
from urllib.error import HTTPError

for name in ["gongbo", "miryam", "sara", "artur"]:
    site_string = f"https://cl.lingfil.uu.se/~{name}/"
    try:
        cl_site = urllib.request.urlopen(site_string)
        soup = BeautifulSoup(cl_site, "html.parser")
        print(soup.title.string)
    except HTTPError:
        print("Site not found.")

Gongbo Tang
 Miryam de Lhoneux
Sara Stymne
Site not found.


We can also use `else` statements to have more control over the program. Like in `if-else` statments, `try-else` works as a fallback to when an exception is *not* caught. For the purposes of this code, then, we can just move `print(soup.title.string)` inside `else`: 

In [12]:
for name in ["gongbo", "miryam", "sara", "artur"]:
    site_string = f"https://cl.lingfil.uu.se/~{name}/"
    try:
        cl_site = urllib.request.urlopen(site_string)
        soup = BeautifulSoup(cl_site, "html.parser")
    except HTTPError:
        print("Site not found.")
    else:
        print(soup.title.string)

Gongbo Tang
 Miryam de Lhoneux
Sara Stymne
Site not found.


Finally, we can use `finally` statements to execute code regardless of the status of the exception checker. This code will always execute. 

In [13]:
for name in ["gongbo", "miryam", "sara", "artur"]:
    site_string = f"https://cl.lingfil.uu.se/~{name}/"
    try:
        cl_site = urllib.request.urlopen(site_string)
        soup = BeautifulSoup(cl_site, "html.parser")
    except HTTPError:
        print("*SITE NOT FOUND.*")
    else:
        print(soup.title.string)
    finally:
        print(f"Finished processing page for {name}.")

Gongbo Tang
Finished processing page for gongbo.
 Miryam de Lhoneux
Finished processing page for miryam.
Sara Stymne
Finished processing page for sara.
*SITE NOT FOUND.*
Finished processing page for artur.


### A more challenging example: scraping presidential speeches

So far, we've been working with fairly straightforward HTML pages that are easy to scrape. In most situations, this won't be the case. For the next module, we will be working with the text transcripts of all US presidential speeches. However, there is no corpus that collects this information. Luckily, The Miller Center (a University of Virginia research center that studies the history of the US presidency) has made all popular presidential speeches public at [this address](https://millercenter.org/the-presidency/presidential-speeches). We can use these, but since the raw text of these speeches is not available for download, we will have to scrape the necessary content ourselves. 

Judging from the site, it appears that actual transcripts live on separate pages, which are structured like this: https://millercenter.org/the-presidency/presidential-speeches/january-8-2020-statement-iran. It is these pages that we will need to parse using BeautifulSoup. However, in order to do so, we need to know the links to each individual page. Looking back at the main page, it seems that the links to the transcript pages appear inside a `div` element class called `views-row`. If we look inside these `views-row` blocks, we should be able to extract the anchor `href` elements that point to the corresponding transcript links. Unfortunately, the structure of the page is set in such a way that only 12 `views-row` blocks are loaded at a time. The way to load a new set of blocks is to scroll down the page. This mechanism is called an _infinite scroll_ and is notoriously difficult to scrape. What we need to do in this case is to _keep scrolling down_ until we see the full set of links on the page. This is a tedious task to do manually, so we can use a package called `selenium`, which allows you to write programs that take control of your browser. Using `selenium`, we will open the site and continually press the Page Down key (<kbd>COMMAND</kbd>+<kbd>DOWN</kbd> on Mac OSX) until we've loaded the full set of transcript links. Once the full page is loaded, we can extract the `href` elements from all `views-row` blocks and create a list of links, which we will later individually parse using BeautifulSoup. 

Let's look at how this works:

In [14]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

In [None]:
# Tell selenium that we will open the page using Safari.
browser = webdriver.Safari()

# Open the page in Safari.
browser.get('https://millercenter.org/the-presidency/presidential-speeches')

# Find the element that we need to interact with, in this case the 'body'.
elem = browser.find_element_by_tag_name('body')

# Specify the number of times we need to scroll down. 200 should be OK.
no_of_pagedowns = 200

# While we still need to scroll...
while no_of_pagedowns: 
    
    # Press COMMAND+DOWN to scroll down.
    elem.send_keys(Keys.COMMAND+Keys.DOWN)
    
    # Wait a second so that the new set of links can load.
    time.sleep(1.0)
    
    # De-increment the number of page-downs.
    no_of_pagedowns-=1

# Now that the entire page is loaded, extract all links in the block class 
post_links = browser.find_elements_by_css_selector(".views-row a")

# Initialize an empty list to populate with links.
links = []

# For all anchors we found...
for link in post_links: 
    
    # Extract the 'href' element...
    href = link.get_attribute('href')
    
    # ... and append its contents to the links list. 
    links.append(href)

Now that we have the full list of links, we should probably save it so that we don't have to re-run the extraction script. We can do so using JSON, which is a syntax for storing data structures like lists, dictionaries, etc. This storing process is called *serialization*. When we serialize an object, we _encode_ it into a series of bytes that can later be _decoded_ or _deserialized_. Let's try to do this for our list in Python. 

In [89]:
import json

with open("speech_links.json", "w") as outfile:
    json.dump(links, outfile)

Now our list is saved in the file named `speech_links.json`. If we'd like to de-serialize a JSON file and load it into a Python object, we can do so in a similar way:

In [90]:
with open("links_output.json", "r") as infile:
    links = json.load(infile)

This code opens the JSON file, deserializes it, and loads it into a Python list object, which we can interact with. 

Now let's define a function that can take a link, parse its HTML and extract a cleaned version of the transcript text. We can use a dictionary to make it easy to search for presidents by their names. 

In [104]:
# Import unicodedata library to help with cleaning scraped text
import unicodedata

# Define a function that takes a miller center speech transcript link as an argument
def get_speeches(link):
    
    # Get HTML from live web page
    html_body = urllib.request.urlopen(link)
    
    # Parse HTML with BS
    soup = BeautifulSoup(html_body,'html.parser')
    
    # Select appropriate element for transcript
    transcript = soup.select('.transcript-inner p')
    
    # If the speech is captured on video, we need to instead click on "View Transcript"
    if len(transcript)<=0:
        transcript = soup.select('.view-transcript p')
        
    # Go through every <p> that BS finds, extract the text, and clean it
    transcripts = [unicodedata.normalize("NFKD", elem.get_text()) for elem in transcript]
    
    # Go through list of <p>'s and join them into a single string
    speech = ' '.join(transcripts)
    
    # Get the speechgiver's name
    president_name = soup.select('.president-name')[0].get_text()
    speech_date = soup.select(".episode-date")[0].get_text()
    # Create a dictionary consisting of two items: 
    # * the president's name
    # * the president's speech
    speech_dic = {'Name': president_name, 'Speech': speech, 'Date': speech_date}
    
    return speech_dic

In [105]:
get_speeches(links[0])

{'Name': 'George Washington',
 'Speech': 'Fellow Citizens of the Senate and the House of Representatives:  Among the vicissitudes incident to life, no event could have filled me with   greater anxieties than that of which the notification was transmitted by your   order, and received on the fourteenth day of the present month. On the one hand,   I was summoned by my Country, whose voice I can never hear but with veneration   and love, from a retreat which I had chosen with the fondest predilection, and,   in my flattering hopes, with an immutable decision, as the asylum of my declining   years: a retreat which was rendered every day more necessary as well as more   dear to me, by the addition of habit to inclination, and of frequent interruptions   in my health to the gradual waste committed on it by time. On the other hand,   the magnitude and difficulty of the trust to which the voice of my Country called   me, being sufficient to awaken in the wisest and most experienced of her citi

Now let's loop through our list of links and extract the speeches. We will save these dictionaries as single `.json` files in a directory called `us_presidential_speeches`. 

In [106]:
import os

# speech_dir = "./us_presidential_speeches"

# os.mkdir(speech_dir)

for i, link in enumerate(links[::-1]):
    speech_idx = str(i)
    if len(speech_idx) == 1:
        speech_idx = f"00{speech_idx}"
    elif len(speech_idx) == 2:
        speech_idx = f"0{speech_idx}"
    else:
        pass
    with open(speech_dir+f"/speech_{speech_idx}.json", "w") as speech_out:
        speech_dic = get_speeches(link)
        json.dump(speech_dic, speech_out)

We'll leave off here. Next lecture, we'll see how we can work with this raw text, transform it into a numeric format, and visualize how different speeches may relate to each other quantitatively. 

Web scraping code largely adapted (stolen) from [here](https://github.com/hajir-almahdi/the-data-behind-presidental-charisma/blob/master/The-Data-Behind-Presidental-Charisma.ipynb). 