# Introduction to Data Science – Lecture 12 – Web Scraping
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/* 

In this lecture we will explore how we can extract data from a web-page using automatic scraping and crawling with [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

First, we'll talk a bit about HTML though. 

## HTML and the DOM

We will scrape web-pages that are (partially) written in HTML and represented in the DOM. DOM stands for  Document Object Model, while HTML stands for “HyperText Markup Language”. 25 years ago, that used to be a meaningful description of what HTML actually did: it has links (hypertext), and it is a markup language. The latest version of HTML, however, the HTML5 standard, does much, much more: graphics, audio, video, etc. So it is easier to think of HTML as “whatever it is that web browsers know how to interpret”, and just not think about the actual term.

### Elements

The important thing about HTML is that the markup is represented by elements. An HTML element is a portion of the content that is surrounded by a pair of tags of the same name. Like this:

```html
<strong>This is an HTML element.</strong>
```

In this element, strong is the name of the tag; the open tag is `<strong>`, and the matching closing tag is `</strong>`. The way you should interpret this is that the text “This is an HTML element” should be “strong”, i.e., typically this will be bold text.

HTML elements can and commonly do nest:

```html
<strong>This is strong, and <u>this is underlined and strong.</u></strong>
```

In addition to the names, opening tags can contain extra information about the element. These are called attributes:

```html
<a href='http://www.google.com'>A link to Google's main page</a>
```

In this case, we’re using the `a` element which stood for “anchor”, but now is almost universally used as a “link”. The attribute `href` means “HTML reference”, which actually makes sense for a change. The meaning given to each attribute changes from element to element.

Important attributes for our purposes are `id` and `class`. The id attribute gives the attribute a unique identifier, which can then be used to access the element programmatically. Think of it as making the element accessible via a global variable.  

The class is conceptually similar but is intendent to be applied to a whole “class” of elements. 

HTML pages require some boilerplate. Here is a minimal page: 

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body>
Hello World! What's up?
</body>
</html>
``` 

The `<head>` contains meta-information such as the titel of the site, the `<body>` contains the actual data. 

### Hierarchy

Data in HTML is often structured hirearchically: 

```html
<body>
  <article>
    <span class="date">Published: 2016-08-25</span>
    <span class="author">Led Zeppelin</span>
    <h1>Ramble On</h1>
    <div class="content">
    Leaves are falling all around, It's time I was on my way. 
    Thanks to you, I'm much obliged for such a pleasant stay. 
    But now it's time for me to go. The autumn moon lights my way. 
    For now I smell the rain, and with it pain, and it's headed my way. 
    <div>
  </article>
  <article>
    <span class="date">Published: 2016-08-23</span>
    <span class="author">Radiohead</span>
    <h1>Burn the Witch</h1>
    <div class="content">
    Stay in the shadows
    Cheer at the gallows
    This is a round up
    This is a low flying panic attack
    Sing a song on the jukebox that goes
    Burn the witch
    Burn the witch
    We know where you live
    <div>
  </article>
</body>
```

Here, the title of the song is nested three levels deep: `body > article > h1`.

### Tables

Data is also often stored in HTML tables. `<tr>` indicates a row (table row), `<th>` and `<td>` are used to demark cells, either header cells (`<th>`) or regular cells (`<td>`). 

```html
<table>
    <tr>
        <th></th>
        <th>The Beatles</th>
        <th>Led Zeppelin</th>
    </tr>
    <tr>
        <td># Band Members</td>
        <td>4</td>
        <td>4</td>
    </tr>
</table>
```

### The DOM

As we have seen above, a markup document looks a lot like a tree: it has a root, the HTML element, and elements can have children that are containing elements themselves.

While HTML is a textual representation of a markup document, the DOM is a programming interface for it. Also the DOM represents the state of a page as it's rendered, that (nowadays) doesn't mean that there is an underlying HTML document that corresponds to that exactly. Rather, the DOM is dynamically generated with, e.g., JavaScript. 

In this class we will use “DOM” to mean the tree created by the web browsers to represent the document.

#### Inspecting the DOM in a browser

Perhaps the most important habit when scraping is to investigate the source of a page using the Developer Tools. In this case, we’ll look at the Element tree, by clicking on the menu bar: View → Developer → Developer Tools.

Alternatively, you can right click on any part of the webpage, and choose “Inspect Element”. Notice that there can be a big difference between what is in the DOM and what is in the source.

Take a look at the DOM of [this html page](lyrics.html). Next, we'll scrape the data from this page! 

# Scraping with BeautifulSoup

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library design for computationally extracting data from html documents. It supports navigating in the DOM and retreiving exactly the data elements you need.

Let's start with a simple example using the [lyrics.html](lyrics.html) file.

In [1]:
#install the bs4 package through pip.  I think the beautifulsoup pacakage is a much older version?
from bs4 import BeautifulSoup

# we tell BeautifulSoup and tell it which parser to use
song_soup = BeautifulSoup( open("lyrics.html"), "html.parser" )
# the output corresponds exactly to the html file
song_soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Lyrics</title>
</head>
<body>
<article id="zep">
<span class="date">Published: 2016-08-25</span>
<span class="author">Led Zeppelin</span>
<h1>Ramble On</h1>
<div class="content">
    Leaves are falling all around, It's time I was on my way.
    Thanks to you, I'm much obliged for such a pleasant stay.
    But now it's time for me to go. The autumn moon lights my way.
    For now I smell the rain, and with it pain, and it's headed my way.
    </div>
</article>
<article id="radio">
<span class="date">Published: 2016-08-23</span>
<span class="author">Radiohead</span>
<h1>Burn the Witch</h1>
<div class="content">
    Stay in the shadows
    Cheer at the gallows
    This is a round up
    This is a low flying panic attack
    Sing a song on the jukebox that goes
    Burn the witch
    Burn the witch
    We know where you live
    </div>
</article>
</body>
</html>

In [2]:
# sometimes that can be hard to read, so we can format it
print( song_soup.prettify() )

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Lyrics
  </title>
 </head>
 <body>
  <article id="zep">
   <span class="date">
    Published: 2016-08-25
   </span>
   <span class="author">
    Led Zeppelin
   </span>
   <h1>
    Ramble On
   </h1>
   <div class="content">
    Leaves are falling all around, It's time I was on my way.
    Thanks to you, I'm much obliged for such a pleasant stay.
    But now it's time for me to go. The autumn moon lights my way.
    For now I smell the rain, and with it pain, and it's headed my way.
   </div>
  </article>
  <article id="radio">
   <span class="date">
    Published: 2016-08-23
   </span>
   <span class="author">
    Radiohead
   </span>
   <h1>
    Burn the Witch
   </h1>
   <div class="content">
    Stay in the shadows
    Cheer at the gallows
    This is a round up
    This is a low flying panic attack
    Sing a song on the jukebox that goes
    Burn the witch
    Burn the witch
    We know where you live

We can access content by tags:

In [3]:
# get the title tag
song_soup.title

<title>Lyrics</title>

And get the text out of the tag:

In [5]:
song_soup.title.string

'Lyrics'

Directly accessing an element works for the first occurence of a tag, we don't get the others. 

In [6]:
song_soup.div

<div class="content">
    Leaves are falling all around, It's time I was on my way.
    Thanks to you, I'm much obliged for such a pleasant stay.
    But now it's time for me to go. The autumn moon lights my way.
    For now I smell the rain, and with it pain, and it's headed my way.
    </div>

Again, we can retreive the text content of an element: 

In [7]:
print( song_soup.div.string )


    Leaves are falling all around, It's time I was on my way.
    Thanks to you, I'm much obliged for such a pleasant stay.
    But now it's time for me to go. The autumn moon lights my way.
    For now I smell the rain, and with it pain, and it's headed my way.
    


We can use attributes to find specific element:

In [13]:
zep = song_soup.find( id="zep" )
print(zep.prettify())

<article id="zep">
 <span class="date">
  Published: 2016-08-25
 </span>
 <span class="author">
  Led Zeppelin
 </span>
 <h1>
  Ramble On
 </h1>
 <div class="content">
  Leaves are falling all around, It's time I was on my way.
    Thanks to you, I'm much obliged for such a pleasant stay.
    But now it's time for me to go. The autumn moon lights my way.
    For now I smell the rain, and with it pain, and it's headed my way.
 </div>
</article>



 We can also get only the text, not the html markup: 

In [14]:
text = song_soup.find( id="zep" ).get_text()
print( text )


Published: 2016-08-25
Led Zeppelin
Ramble On

    Leaves are falling all around, It's time I was on my way.
    Thanks to you, I'm much obliged for such a pleasant stay.
    But now it's time for me to go. The autumn moon lights my way.
    For now I smell the rain, and with it pain, and it's headed my way.
    



We can also use find_all to get all instances of a tag:

In [15]:
h1s = song_soup.find_all( "h1" )
h1s

[<h1>Ramble On</h1>, <h1>Burn the Witch</h1>]

This returns a list of beautiful soup elements:

In [16]:
type( h1s[0] )

bs4.element.Tag

It's easy to get the text out of this:

In [17]:
string_h1s = [ tag.get_text() for tag in h1s ]
string_h1s

['Ramble On', 'Burn the Witch']

Since `find_all` is so commonly used, you can use a shortcut by just calling directly on an object:

In [18]:
song_soup( "div" )

[<div class="content">
     Leaves are falling all around, It's time I was on my way.
     Thanks to you, I'm much obliged for such a pleasant stay.
     But now it's time for me to go. The autumn moon lights my way.
     For now I smell the rain, and with it pain, and it's headed my way.
     </div>,
 <div class="content">
     Stay in the shadows
     Cheer at the gallows
     This is a round up
     This is a low flying panic attack
     Sing a song on the jukebox that goes
     Burn the witch
     Burn the witch
     We know where you live
     </div>]

We can address the elements in the returned object directly:

In [19]:
song_soup("div")[1]

<div class="content">
    Stay in the shadows
    Cheer at the gallows
    This is a round up
    This is a low flying panic attack
    Sing a song on the jukebox that goes
    Burn the witch
    Burn the witch
    We know where you live
    </div>

Or iterate over it:

In [20]:
for p in song_soup.find_all("div"):
    print("---")
    print(p)

---
<div class="content">
    Leaves are falling all around, It's time I was on my way.
    Thanks to you, I'm much obliged for such a pleasant stay.
    But now it's time for me to go. The autumn moon lights my way.
    For now I smell the rain, and with it pain, and it's headed my way.
    </div>
---
<div class="content">
    Stay in the shadows
    Cheer at the gallows
    This is a round up
    This is a low flying panic attack
    Sing a song on the jukebox that goes
    Burn the witch
    Burn the witch
    We know where you live
    </div>


### CSS Selectors

We can also use CSS selectors. CSS Selectors apply, among others, to elements, classes, and IDs.

Below is an example of how CSS is used to style different elements. 


```CSS
/* Element Selector */
article {
  color: FireBrick;
}

/* ID selector */
#myID {
  color: Tomato;
}

/* Class selector */
.myClass {
  color: Aquamarine;
}

/* Child selector. Only DIRECT children match */
p > b {
  color: SteelBlue;
}

/* Descendant selector. Every time a b is nested within a div this matches */
div b {
  color: green;
}

```

[Here is an example](https://jsfiddle.net/gxhqv26m/1/) with all the important selectors.

Let's try this in Python:


In [21]:
# selecting all elements of class .content
song_soup.select( ".content" )

[<div class="content">
     Leaves are falling all around, It's time I was on my way.
     Thanks to you, I'm much obliged for such a pleasant stay.
     But now it's time for me to go. The autumn moon lights my way.
     For now I smell the rain, and with it pain, and it's headed my way.
     </div>,
 <div class="content">
     Stay in the shadows
     Cheer at the gallows
     This is a round up
     This is a low flying panic attack
     Sing a song on the jukebox that goes
     Burn the witch
     Burn the witch
     We know where you live
     </div>]

In [24]:
# selecting all divs that are somewhere below the id radio in the tree
song_soup.select( "#radio div" )

[<div class="content">
     Stay in the shadows
     Cheer at the gallows
     This is a round up
     This is a low flying panic attack
     Sing a song on the jukebox that goes
     Burn the witch
     Burn the witch
     We know where you live
     </div>]

OK, now we know how to extract information out of a website. Now let's look at a complete example. 

## Fetching a Website

Downloading websites is easy and very efficient. It turns out, that you can cause quite high load on a server when you scrape a lot. So webmasters usually publish what kinds of scraping they allow on their websites. You should check out a websites terms of service and the `robots.txt` of a domain before crawling excessively. Terms of service are usually broad, so searching for “scraping” or “crawling” is a good idea.

Let's take a look at [Google Scholar's robots.txt](https://scholar.google.com/robots.txt):

```
User-agent: *
Disallow: /search
Allow: /search/about
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
...
Disallow: /scholar
Disallow: /citations?
...
```

Here it specifies that you're not allowed to crawl a lot of the pages. The `/scholar` subdirectory is especially painful because it prohibits you from generating queries dynamically. 

It's also common that sites ask you to delay crawiling: 

```
Crawl-delay: 30 
Request-rate: 1/30 
```

You should respect those restrictions. Now, no one can stop you from running a request through a crawler, but sites like google scholar will block you VERY quickly if you request to many pages in a short time-frame.

An alternative strategy to dynamic crawling (as we're doing in the next example) is to download a local copy of the website and crawl that. This ensures that you hit the site only once per page. A good tool to achieve that is [wget](https://www.gnu.org/software/wget/). 

### Example: Utah's course enrollments

We're going to build a dataset of classes offered this fall at the U and look at the enrollment numbers. We'll use the catalog listed here:  
https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/

The U doesn't seem to care whether/how we crawl the websites, the [fineprint](https://www.utah.edu/disclaimer/) doesn't mentione it and there is no `robots.txt`: http://www.utah.edu/robots.txt

We'll use the [`urllib.request`](https://docs.python.org/3.0/library/urllib.request.html) library to retreive the websites.

In [4]:
import urllib.request
url = "https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/"
# here we actually access the website
with urllib.request.urlopen( url ) as response:
    html = response.read()
    html = html.decode( 'utf-8' )

# save the file
with open( 'class_schedule.html', 'w' ) as new_file:
    new_file.write(html)

# here it's already a local operation
soup = BeautifulSoup( html, 'html.parser' )

Let's take a look at the first 5000 lines of this page: 

In [5]:
print( soup.prettify()[0:1000] )

<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   Class Schedules
  </title>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300,300i,700" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,600,600i,700,700i,800,800i" rel="stylesheet"/>
  <link href="/uofu/stu/ClassSchedules/css/main.css" rel="stylesheet"/>
  <link href="/uofu/stu/ClassSchedules/css/addins.css" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://cdnjs.cloudflare.com/ajax/libs/tablesaw/3.1.2/tablesaw.css" integrity="sha256-RKWkj+/VPTTBpmJvgbTZhLSjgt7/i5qqwdo6uULEVT4=" rel="stylesheet"/>
  <link href="/uofu/stu/ClassSchedules/images/favicon.ico" rel="icon" type="image/png"/>
  <link crossorigin="anonymous" href="https://use.fontawesome.com/r

If you play around with the inspector in chrome you can find the div elements that correspond to a subject description + link:
```
CS: (Computer Science, https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=CS)
```

Let's try to grab that info for all subjects

In [None]:
# OLD (Now Broken) VERSION

# # subjects = {}

# for subject in soup.find_all( class_="subject-list" ):
#     # the url is relative. 
#     # We can get the tail by retrieving the link out the href attribute of the a tag
#     link_tail = subject.find("a").get("href")
#     # concatenate the base URL and the tail of the link
#     link = url + link_tail
#     # the subject shortname is embedded within the <a> tag
#     subject_short = subject.find("a").get_text()
#     # the subject name is embedded within the span 
#     subject_long = subject.span.get_text()
#     # write it to the dictionary
#     subjects[subject_short] = (subject_long, link)

# subjects

Oof... The U changed the formatting of this data since the last time I taught this class...  Scraping is VERY FRAGILE.  It looks like NONE of the elements have any sort of useful semantic data in the class/id, so let's try to pull our course links by looking at the href attribute

In [7]:
subjects = {}
for atag in soup.find_all( "a" ):
    link = atag.get( "href" )
    if link and link.startswith( "class_list.html?subject" ):
        subjectShort, subjectLong = atag.get_text().split(' - ', maxsplit=1)
        subjects[subjectShort] = (subjectLong, url + link)
display( subjects )

{'ACCTG': ('Accounting',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=ACCTG'),
 'AEROS': ('Aerospace Studies',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=AEROS'),
 'ANAT': ('Neurobiology and Anatomy',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=ANAT'),
 'ANES': ('Anesthesiology',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=ANES'),
 'ANTH': ('Anthropology',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=ANTH'),
 'ARAB': ('Arabic',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=ARAB'),
 'ARCH': ('Architecture',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=ARCH'),
 'ART': ('Art',
  'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=ART

In [8]:
subjects["MATH"]

('Mathematics',
 'https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1208/class_list.html?subject=MATH')

That's what we want. 

As an aside: we could have taken a different approach here. Note how the URL has a deterministic query parameter that matches the subject:

```
class_list.html?subject=MATH
```

We could use this to also retrieve the links if we only had the subject shortnames. 

#### Getting a list of classes

Next, it's time to get the courses. Let's look at the [website for CS](https://student.apps.utah.edu/uofu/stu/ClassSchedules/main/1184/class_list.html?subject=CS).

We'll fetch this class list in a fucntion where that we pass the subject name:

In [29]:
def getWebsiteAsSoup(url):
    """ 
    Retrieve a website and return it as a BeautifulSoup object.   
    """
    
    #user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'
    #headers = {'User-Agent': user_agent}
      
    # values = {'name': 'Michael Foord',
    #       'location': 'Northampton',
    #       'language': 'Python' }
    
    #data = urllib.parse.urlencode(values)
    #data = data.encode('ascii')

    #req = urllib.request.Request(url, data, headers)
    req = urllib.request.Request( url )
    with urllib.request.urlopen( req ) as response:
        classlist_html = response.read()
    
    #print(classlist_html)
    
    class_soup = BeautifulSoup(classlist_html, 'html.parser')
    with open('class_list.html', 'w') as new_file:
        new_file.write(str(class_soup))
        
    return class_soup        

Let's run this function for CS and look at the output:

In [30]:
class_soup = getWebsiteAsSoup( subjects["CS"][1] )

In [31]:
print( class_soup.select('.class-info')[1].prettify() )

<div class="class-info card mt-3" id="19247">
 <div class="card-body row d-none d-md-block">
  <div class="col-12 p-0">
   <div class="buttons">
    <a class="btn btn-secondary btn-sm" href="description.html?subj=CS&amp;catno=1030&amp;section=010">
     Class Details
    </a>
   </div>
   <h3>
    <a href="sections.html?subj=CS&amp;catno=1030">
     CS 1030
    </a>
    -
    <span>
     010
    </span>
    <a href="https://class-tools.app.utah.edu/syllabus/1208/19247/syllabus.pdf">
     Foundations of CS
    </a>
   </h3>
  </div>
  <div class="col-12 p-0">
   <ul class="row breadcrumb-list list-unstyled">
    <li class="col-sm-auto">
     Class Number:
     <a id="19247">
     </a>
    </li>
    <li class="col-sm-auto">
     Instructor:
     <span>
      <a href="http://faculty.utah.edu/u6032171/teaching/index.hml" target="_blank">
       BROWN, CHRISTOPHER S
      </a>
     </span>
    </li>
    <li class="col-sm-auto">
     Instructor:
     <span>
      <a href="http://faculty.utah

Let's see if we can find out the number of empty seats for each instructor.

Looks like instructors are in a link that starts with http://faculty.utah.edu/

In [32]:
classes = class_soup.select( '.class-info' )

In [33]:
def instructors( classes ):
    instructors = []
    for ci in classes:
        instTag = [at.get_text() for at in ci.select("a") 
                   if at.get("href") and at.get("href").startswith("http://faculty.utah.edu")]
        if len( instTag ) > 0: instructors.append( instTag[0] )
    return instructors


In [34]:
instructors( classes )

['RILOFF, ELLEN',
 'BROWN, CHRISTOPHER S',
 'BROWN, CHRISTOPHER S',
 'BROWN, CHRISTOPHER S',
 'BROWN, CHRISTOPHER S',
 'BROWN, CHRISTOPHER S',
 'Huang, Tsung-Wei',
 'Huang, Tsung-Wei',
 'Huang, Tsung-Wei',
 'Huang, Tsung-Wei',
 'Huang, Tsung-Wei',
 'Huang, Tsung-Wei',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'PHILLIPS, BEI WANG',
 'PHILLIPS, BEI WANG',
 'PHILLIPS, BEI WANG',
 'PHILLIPS, BEI WANG',
 'PHILLIPS, BEI WANG',
 'PHILLIPS, BEI WANG',
 'PARKER, ERIN',
 'PARKER, ERIN',
 'PARKER, ERIN',
 'PARKER, ERIN',
 'PARKER, ERIN',
 'PARKER, ERIN',
 'PARKER, ERIN',
 'Elhabian, Shireen',
 'GOPALAKRISHNAN, GANESH',
 'COTTER, NEIL E',
 'XIANG, YU',
 'PHILLIPS, JEFF',
 'KOPTA, DANIEL',
 'KOPTA, DANIEL',
 'KOPTA, DANIEL',
 'KOPTA, DANIEL',
 'KOPTA, DANIEL',
 'KOPTA, DANIEL',
 'JENSEN, PETER',
 'JENSEN, PETER',
 'JENSEN, PETER',
 '

Lots of repeats... if we look we'll see that lots of our entries are for labs which we don't care about.  Let's filter those

In [35]:
def isLecture(ci):
    for st in ci.select("span"):
        if st.get_text() == "Lecture":
            return True
    return False
lectures = filter(isLecture, classes)
lecturingInstructors = instructors(lectures)
display(lecturingInstructors)

['RILOFF, ELLEN',
 'BROWN, CHRISTOPHER S',
 'Huang, Tsung-Wei',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'JOHNSON, DAVID',
 'PHILLIPS, BEI WANG',
 'PHILLIPS, BEI WANG',
 'PARKER, ERIN',
 'GOPALAKRISHNAN, GANESH',
 'COTTER, NEIL E',
 'XIANG, YU',
 'PHILLIPS, JEFF',
 'KOPTA, DANIEL',
 'JENSEN, PETER',
 'FLATT, Matthew',
 'WIESE, ELIANE S',
 'Yu, Cunxi',
 'NAZM BOJNORDI, MAHDI',
 'NAZM BOJNORDI, MAHDI',
 'NAZM BOJNORDI, MAHDI',
 'NAZM BOJNORDI, MAHDI',
 "DE ST GERMAIN, H. JAMES 'JIM'",
 'Sullivan, Blair',
 'SADAYAPPAN, SADAY',
 'SADAYAPPAN, SADAY',
 'Zhang, Mu',
 'JOSHI, SWAROOP',
 'JOSHI, SWAROOP',
 "DE ST GERMAIN, H. JAMES 'JIM'",
 'YUKSEL, CEM',
 'BHASKARA, ADITYA',
 'HOLLERBACH, JOHN',
 'RILOFF, ELLEN',
 'SRIKUMAR, VIVEK',
 'Cardona-Rivera, Rogelio E',
 'LEX, ALEXANDER',
 'GAILLARDON, PIERRE-EMMANUEL J',
 'RAKAMARIC, Zvonimir',
 'BHASKARA, ADITYA',
 'BHASKARA, ADITYA',
 'BERZINS, MARTIN',
 'SADAYAPPAN, SADAY',
 'HOLLERBACH, JOHN',
 'RILOFF, ELLEN',
 'SRIKUMAR, VIVEK',
 'SRIKUMAR, VIVEK'

So, this was super brittle and complicated!  The webpage was clearly NOT designed with us in mind.  Hopefully this convinces you that scraping is a last-resort method of getting data

## Scraping Wrap-Up

Scraping is a way to get information from website that were not designed to make data accessible. As such, it can often be **brittle**: a website change will break your scraping script. It is also often not welcome, as a scaper can cause a lot of traffic. 

The way we scraped information here also made the **assumption that HTML is generated consistently** based just on the URL. That is, unfortunately, less and less common, as websites adapt to browser types, resolutions, locales, but also as a lot of content is loaded dynamically e.g., via web-sockets. For example, many websites now auotmatically load more data once you scroll to the bottom of the page. These websites couldn't be scraped with our approach, instead, a browser-emulation approach, using e.g., [Selenium]() would be necessary. [Here is a tutorial](https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72) on how to do that. 

Finally, many services make their data available through a well-defined interface, an API. Using an API is always a better idea than scraping, but scraping is a good fallback!

## Exercise 1: Exceptional Olympians

Scrape data from [this wikipedia site](https://en.wikipedia.org/wiki/List_of_multiple_Olympic_medalists) about exceptional Olympic medalists. 

1. Download the html using urllib. 
2. Parse this html with BeautifulSoup.
3. Extract the html that corresponds to the big table from the soup.
4. Parse the table into a pandas dataframe. Hint: both the "No." and the "Total." column use row-spans which are tricky to parse, both with a pandas reader and manually. For the purpose of this exercise, exclude all rows that are not easy to parse (the first one is Bjørn Dæhlie).
5. Create a table that shows for each country how many gold, silver, bronze, and total medals it won in that list.
