# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Mining the web, part 1 (13 points)

Perhaps the richest source of openly available data today is [the Web](http://www.computerhistory.org/revolution/networking/19/314)! In this lab, you'll explore some of the basic programming tools you need to scrape web data.

## Review: The Requests module

In Lab 1, you used Python's [Requests module](http://requests.readthedocs.io/en/latest/user/quickstart/) to download a file.

For instance, here is a code fragment to download the GT home page and print the first 250 characters. You might also want to [view the source](http://www.computerhope.com/issues/ch000746.htm) of Georgia Tech's home page to get a nicely formatted view, and compare its output to what you see above.

In [None]:
import requests

response = requests.get ('http://www.gatech.edu/')
webpage = response.text  # or response.content for raw bytes

print (webpage[0:250]) # Prints the first hundred characters only

**Exercise 1** (3 points). Given the contents of the GT home page as above, write a function that returns a list of links (URLs) of the "top stories" on the page.

For instance, on Friday, September 9, 2016, here was the front page:

![www.gatech.edu as of Fri Sep 9, 2016](./www.gatech.edu--2016-09-09--annotated-medium.png)

The top stories cycle through in the large image placeholder shown above. We want your function to return the list of URLs behind each of the "Full Story" links, highlighted in red. If no URLs can be found, the function should return an empty list.

In [None]:
import re # Maybe you want to use a regular expression?

def get_gt_top_stories (webpage_text):
    """Given the HTML text for the GT front page, returns a list
    of the URLs of the top stories or an empty list if none are
    found.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
top_stories = get_gt_top_stories (webpage)
print ("Links to GT's top stories:", top_stories)

In [None]:
assert "http://www.news.gatech.edu/features/beltline-impact" in top_stories

In [None]:
assert len (get_gt_top_stories ('')) == 0

## A more complex example

Go to [Yelp!](http://www.yelp.com) and look up `ramen` in `Atlanta, GA`. Take note of the URL:

![Yelp! search for ramen in ATL](./yelp-search-example.png)

This URL encodes what is known as an _HTTP "get"_ method (or request). It basically means a URL with two parts: a _command_ followed by one or more _arguments_. In this case, the command is everything up to and including the word `search`; the arguments are the rest, where individual arguments are separated by the `&` or `#`.

> "HTTP" stands for "HyperText Transport Protocol," which is a standardized set of communication protocols that allow _web clients_, like your web browser or your Python program, to communicate with _web servers_.

In this next example, let's see how to build a "get request" with the `requests` module. It's pretty easy!

In [None]:
url_command = 'http://yelp.com/search'
url_args = {'find_desc': "ramen",
            'find_loc': "atlanta, ga"}
response = requests.get (url_command, params=url_args)

print ("==> Downloading from: '%s'" % response.url) # confirm URL
print ("\n==> Excerpt from this URL:\n\n%s\n" % response.text[0:100])

**Exercise 2** (6 points). Given a search topic, location, and a rank $k$, return the name of the $k$-th item of a Yelp! search. If there is no $k$-th item, return `None`.

> The demo query above only gives you a website with the top 10 items, meaning you could only use it for $k \leq 10$. Figure out how to modify it to solve the problem when $k > 10$.

In [None]:
def find_yelp_item (topic, location, k):
    """Returns the k-th suggested item from Yelp! in Atlanta for the given topic."""
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert find_yelp_item ('fried chicken', 'Atlanta, GA', -1) is None # Tests an invalid value for 'k'

> Search queries on Yelp! don't always return the same answers, since the site is always changing! Also, your results might not match a query you do via your web browser (_why not?_). As such, you should manually check your answers.

In [None]:
item = find_yelp_item ('fried chicken', 'Atlanta, GA', 1)
print (item)

# Correct answer as of September 10, 2016:
#assert item == 'Gus’s World Famous <span class="highlighted">Fried</span> <span class="highlighted">Chicken</span>'

In [None]:
item = find_yelp_item ('fried chicken', 'Atlanta, GA', 5)
print (item)

# Correct answer as of September 10, 2016:
#assert item == 'Mary Mac’s Tea Room'

In [None]:
item = find_yelp_item ('fried chicken', 'Atlanta, GA', 17)
print (item)

# Correct answer as of September 10, 2016:
#assert item == 'Fox Bros. Bar-B-Q'

## Parsing HTML: Beautiful Soup

HTML files are, of course, highly structured files. As such, you can imagine there are much more systematic methods and tools to process them. One such package is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/). The following is a quick tutorial on how to use it.

Any HTML document may be modeled as a tree:

![HTML as a tree](./html-slide.png)

> For whatever reason, [computer scientists usually view trees upside down](https://www.quora.com/Why-are-trees-in-computer-science-generally-drawn-upside-down-from-how-trees-are-in-real-life), with the "root" at the top and the "leaves" at the bottom.

The Beautiful Soup package gives you a data structure for traversing this tree. For instance, consider an HTML file with the contents below, shown both as code and pictorially.

In [None]:
some_page = """
<html>
  <body>
    <p>First paragraph.</p>
    <p>Second paragraph, which links to the <a href="http://www.gatech.edu">Georgia Tech website</a>.</p>
    <p>Third paragraph.</p>
  </body>
</html>
"""

![Two visual representations of `some_page`](./html-viz.png)

**Exercise 3** (1 point). Besides HTML files, what else have we seen in this class that could be represented by a tree? Briefly and roughly explain what and how.

YOUR ANSWER HERE

Here is how you might use Beautiful Soup to inspect the structure of `some_page`.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup (some_page, "lxml")
print (type (soup.html.body.contents), '::', soup.html.body.contents)

**Exercise 4** (1 point). Write some code to display the contents of `soup`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Exercise 5** (1 point). Write a statement that navigates to the tag representing the GT website link. Store this resulting tag object in a variable called `link`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

> Observe how the following test code checks your result!

In [None]:
print (link)

import bs4
assert type (link) is bs4.element.Tag
assert link.name == 'a'
assert link['href'] == 'http://www.gatech.edu'
assert link.contents == ['Georgia Tech website']

### Other navigation tools

This lab includes a static copy of the Yelp! results for a search of "universities" in ATL. Here is some code that opens that file and prints the number 1 result.

In [None]:
uni_html_text = open ('yelp_atl_unies.html', 'r').read ()
uni_soup = BeautifulSoup (uni_html_text, "lxml")

print ("The number 1 ATL university according to Yelp!:")

uni_1 = uni_soup.html.body \
    .contents[7] \
    .contents[9] \
    .contents[3] \
    .contents[1] \
    .contents[3] \
    .contents[1] \
    .contents[1] \
    .contents[7] \
    .contents[3] \
    .contents[5] \
    .contents[1] \
    .contents[1] \
    .contents[1] \
    .contents[1] \
    .contents[3] \
    .contents[1] \
    .contents[1] \
    .contents[1] \
    .contents[0] \
    .contents[0]
    
print (uni_1)

We hope it is self-evident that the above method to navigate to a particular tag or element is not terribly productive or robust.

Here is an alternative. Inspect the raw HTML and observe that every non-ad search result appears in a tag of the form,

```html
<span class="indexed-biz-name">1.         <a class="biz-name js-analytics-click" data-analytics-label="biz-name" href="/biz/georgia-institute-of-technology-atlanta-2" data-hovercard-id="gBX8UvhOwtdD5tGJeU-hxg" ><span >Georgia Institute of Technology</span></a>
</span>
```

Beautiful Soup gives us a way to search for specific tags.

In [None]:
indexed_unies = uni_soup.find_all (attrs={'class': 'indexed-biz-name'})
print (indexed_unies)

**Exercise 6** (4 points). Based on the above, write a function that, given a Yelp! search results page such as `uni_soup` above, returns the name of the number 1 indexed search result.

In [None]:
def get_top_yelp_result (soup):
    """Given a Yelp! search result as a Beautiful Soup page,
    returns the name of the number 1 indexed result.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
print (get_top_yelp_result (uni_soup))
assert get_top_yelp_result (uni_soup) == 'Georgia Institute of Technology'

This mini-tutorial only scratches the surface of what is possible with Beautiful Soup. As always, refer to the [package's documentation](https://www.crummy.com/software/BeautifulSoup/) for all the awesome deets!