## APIs and Web Scraping

## Introduction

Welcome to the homework on web scraping. While many people might view working with data (including scraping, parsing, storing, etc.) a necessary evil to get to the "fun" stuff (i.e. modeling), we think that if presented in the right way this munging can be quite empowering. Imagine you never had to worry or ask those _what if_ questions about data existing or being accessible... but that you can get it yourself!

By the end of this exercise hopefully you should look at the wonderful world wide web without fear, comforted by the fact that anything you can see with your human eyes, a computer can see with its computer eyes...
 
### Objectives

More concretely, this homework will teach you (and test you on):

* HTTP Requests (and lifecycle)
* RESTful APIs
    * Authentication (OAuth)
    * Pagination
    * Rate limiting
* JSON vs. HTML (and how to parse each)
* HTML traversal (CSS selectors)

### Library Documentation

* Standard Library: 
    * [io](https://docs.python.org/2/library/io.html)
    * [time](https://docs.python.org/2/library/time.html)
    * [json](https://docs.python.org/2/library/json.html)

* Third Party
    * [requests](http://docs.python-requests.org/en/master/)
    * [Beautiful Soup (version 4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
    * [yelp-fusion](https://www.yelp.com/developers/documentation/v3/get_started)

**Note:** You may come across a `yelp-python` library online. The library is deprecated and incompatible with the current Yelp API, so do not use the library.

## Working with APIs

Since everyone loves food (presumably), the ultimate end goal of this homework will be to acquire the data to answer some questions and hypotheses about the restaurant scene in Pittsburgh (which we will get to later). We will download __both__ the metadata on restaurants in Pittsburgh from the Yelp API and with this metadata, retrieve the comments/reviews and ratings from users on restaurants.

But first things first, let's do the "hello world" of making web requests with Python to get a sense for how to programmatically access web pages: an (unauthenticated) HTTP GET to download a web page.

---

In [5]:
# setup library imports
import io, time, json
import requests
from bs4 import BeautifulSoup
import re
import mugrade
import os
import dotenv
import time
import datetime
import playwright
import math

In [6]:
dotenv.load_dotenv()

True

In [23]:
YELP_API_KEY = dotenv.get_key(".env", "YELP_API_KEY")

## Q0: Basic HTTP Requests

Fill in the funtion to use `requests` to download and return the raw HTML content of the URL passed in as an argument. As an example try the following NYT article (on Facebook's algorithmic news feed): [http://www.nytimes.com/2016/08/28/magazine/inside-facebooks-totally-insane-unintentionally-gigantic-hyperpartisan-political-media-machine.html](http://www.nytimes.com/2016/08/28/magazine/inside-facebooks-totally-insane-unintentionally-gigantic-hyperpartisan-political-media-machine.html)

> Your function should return a tuple of: (`<status_code>`, `<raw_html>`)

```python
>>> facebook_article = retrieve_html('http://www.nytimes.com/2016/08/28/magazine/inside-facebooks-totally-insane-unintentionally-gigantic-hyperpartisan-political-media-machine.html')
>>> print(facebook_article)
(200, u'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-magazine...')
```

In [7]:
@mugrade.local_tests
def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    response = requests.get(url)

    return response.status_code, response.text

Running local tests for function retrieve_html():
  Test 1 PASSED
  Test 2 PASSED
  Test 3 PASSED
  Test 4 PASSED


Now while this example might have been fun, we haven't yet done anything more than we could with a web browser. To really see the power of programmatically making web requests we will need to interact with a API. For the rest of this homework we will be working with the [Yelp API](https://www.yelp.com/developers/documentation/v3/get_started) and Yelp data (for an extensive data dump see their [Academic Dataset Challenge](https://www.yelp.com/dataset_challenge)). The reasons for using the Yelp API are 3 fold:

1. Incredibly rich dataset that combines:
    * entity data (users and businesses)
    * preferences (i.e. ratings)
    * geographic data (business location and check-ins)
    * temporal data
    * text in the form of reviews
    * and even images.
2. Well [documented API](https://www.yelp.com/developers/documentation/v3/get_started) with thorough examples.
3. Extensive data coverage so that you can find data that you know personally (from your home town/city or account). This will help with understanding and interpreting your results.

## Authentication

To access the Yelp API however we will need to go through a few more steps than we did with the first NYT example. Most large web scale companies use a combination of authentication and rate limiting to control access to their data to ensure that everyone using it abides. The first step (even before we make any request) is to setup a Yelp account if you do not have one and get API credentials.

## Yelp API Access

1. Create a Yelp account (if you do not have one already)
2. [Generate API keys](https://www.yelp.com/developers/v3/manage_app) (if you haven't already). You will only need the API Key (not the Client ID or Client Secret) -- more on that later.


Now that we have our accounts setup we can start making requests! There are various authentication schemes that APIs use, including:

* No authentication
* [HTTP basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)
* Cookie based user login
* OAuth (v1.0 & v2.0, see this [post](http://stackoverflow.com/questions/4113934/how-is-oauth-2-different-from-oauth-1) explaining the differences)
* API keys
* Custom Authentication

For the NYT example, since it is a publicly visible page we did not need to authenticate. HTTP basic authentication isn't too common for consumer sites/applications that have the concept of user accounts (like Facebook, LinkedIn, Twitter, etc.) but is simple to setup quickly and you often encounter it on with individual password protected pages/sites. I'm sure you have seen [this](http://i.stack.imgur.com/QnUZW.png) before somewhere.

Cookie based user login is what the majority of services use when you login with a browser (i.e. username and password). Once you sign in to a service like Facebook, the response stores a cookie in your browser to remember that you have logged in (HTTP is stateless). Each subsequent request to the same domain (i.e. any page on `facebook.com`) also sends the cookie that contains the authentication information to remind Facebook's servers that you have already logged in.

Many REST APIs however use OAuth (authentication using tokens) which can be thought of a programmatic way to "login" _another_ user. Using tokens, a user (or application) only needs to send the login credentials once in the initial authentication and as a response from the server gets a special signed token. This signed token is then sent in future requests to the server (in place of the user credentials).

A similar concept common used by many APIs is to assign API Keys to each client that needs access to server resources. The client must then pass the API Key along with _every_ request it makes to the API to authenticate. This is because the server is typically relatively stateless and does not maintain a session between subsequent calls from the same client. Most APIs (including Yelp) allow you to pass the API Key via a special HTTP Header: `Authorization: Bearer <API_KEY>`. Check out the [docs](https://www.yelp.com/developers/documentation/v3/authentication) for more information.

Yelp used to use OAuth tokens but has now switched to API Keys. **For the sake of backwards compatibility Yelp still provides a Client ID and Secret for OAuth, but you will not need those for this assignment.** 

---


## Q1: Authenticated HTTP Request with the Yelp API

Using the Yelp API, fill in the following function stub to make an authenticated request to the [search](https://docs.developer.yelp.com/reference/v3_business_search) endpoint. Note that you'll need to pass your API key to the function, but you can store this any way you like (it's best practice to store the API key in a separate file, but fine to simply include it in your function for this simple example).

When writing the python request, you'll need to pass in a custom header as well as a parameter. Here are some examples for [response headers](http://docs.python-requests.org/en/master/user/quickstart/#response-headers) and [passing parameters in urls](http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls)

```python
>>> num_records, data = yelp_search('Pittsburgh')
>>> print(num_records)
240
>>> print(list(map(lambda x: x['name'], data)))
['Gaucho Parrilla Argentina', 'täkō', 'Noodlehead', "Bae Bae's Kitchen", "DiAnoia's Eatery", ...]
```

(Note that there are of course more than 240 businesses in Pittsburgh, but the Yelp API will limit the count to return this amount.  More on this later).

In [8]:
@mugrade.local_tests
def yelp_search(query):
    """
    Make an authenticated request to the Yelp API.

    Args:
        query (string): Search term

    Returns:
        total (integer): total number of businesses on Yelp corresponding to the query
        businesses (list): list of dicts representing each business
    """
    YELP_API_KEY = os.getenv("YELP_API_KEY")
    url = f"https://api.yelp.com/v3/businesses/search?location={query}&limit=50"
    
    if not YELP_API_KEY:
        raise Exception("no api key")
    
    headers = {"accept": "application/json", "Authorization": f"Bearer {YELP_API_KEY}"}
    data = requests.get(url, headers=headers).json()
 
    businesses = data["businesses"]
    total = len(businesses)
    return total, businesses 


Running local tests for function yelp_search():
  Test 1 PASSED
  Test 2 PASSED


## Parameterization and Pagination

And before we can get any reviews on restaurants, we need to actually get the metadata on ALL of the restaurants in Pittsburgh. Notice above that while Yelp told us that there are more than 1000, the response contained far fewer actual `Business` objects. This is due to pagination and is a safeguard against returning __TOO__ much data in a single request (what would happen if there were 100,000 restaurants?) and can be used in conjuction with _rate limiting_ as well as a way to throttle and protect access to Yelp data.

If an API has 1,000,000 records, but only returns 10 records per page and limits you to 5 requests per second... how long will it take to acquire ALL of the records contained in the API?

One of the ways that APIs are an improvement over plain web scraping is the ability to make __parameterized__ requests. Just like the Python functions you have been writing have arguments (or parameters) that allow you to customize its behavior/actions (an output) without having to rewrite the function entirely, we can parameterize the queries we make to the Yelp API to filter the results it returns.

---

## Q2: Aquire all of the restaurants in Pittsburgh (on Yelp)

Again using the [API documentation](https://www.yelp.com/developers/documentation/v3/business_search) for the `search` endpoint, fill in the following function to retrieve all of the _Restaurants_ (using the categories parameter)** for a given query, within a radius of 1500 meters.  To get the correct results here, **you will need to specificially issue the query with the `categories` parameter set to `restaurants` and the `radius` set to `1500`;**  If you fail to do these, your answers will not match those in our tests.

You will need to account for __pagination__ and __[rate limiting](https://www.yelp.com/developers/faq)__ to:

1. Retrieve all of the Business objects (# of business objects should equal `total` in the response). Paginate by querying 20 restaurants each request.
2. Pause slightly (at least 200 milliseconds) between subsequent requests so as to not overwhelm the API (and get blocked).  

As always with API access, make sure you follow all of the [API's policies](https://www.yelp.com/developers/api_terms) and use the API responsibly and respectfully.

**DO NOT MAKE TOO MANY REQUESTS TOO QUICKLY OR YOUR KEY MAY BE BLOCKED**

Again, you can test your function with an individual neighborhod in Pittsburgh (I recommend Polish Hill). Pittsburgh itself has a lot of restaurants... meaning it will take a lot of time to download them all.

```python
>>> data = all_restaurants('Polish Hill, Pittsburgh')
>>> print(len(data))
77
>>> print([x['name'] for x in data])
['Church Brew Works', "Salem's Market & Grill", 'Poulet Bleu', 'Morcilla', ...]
```

In [9]:
@mugrade.local_tests
def all_restaurants(query):
    """
    Retrieve ALL the restaurants on Yelp for a given query.

    Args:
        query (string): Search term

    Returns:
        results (list): list of dicts representing each business
    """
    YELP_API_KEY = os.getenv("YELP_API_KEY")
    url = f"https://api.yelp.com/v3/businesses/search"


    if not YELP_API_KEY:
        raise ValueError("missing Yelp api key")
    
    headers = {"accept": "application/json", "Authorization": f"Bearer {YELP_API_KEY}"}
    params = {"location": query, "limit": 20, "offset": 0, "radius": 1500, "categories": "restaurants"}
    # delay in seconds
    request_delay = 0.2
    # in order to get consecutive results
    offset_increment = params["limit"]
    businesses = []

    # offset is restricted to 1000 by API despite that we can get only 240 businesses 
    while params["offset"] <= 1000:
        try:
            response = requests.get(url, headers=headers, params=params)
            response.raise_for_status()
            data = response.json()
            businesses += data["businesses"]
            params["offset"] += offset_increment  
            time.sleep(request_delay)
        except requests.RequestException as error:
            print(f"Error occured: {error}")
            break
        
    print(f"Total business retreived: {len(businesses)}")
    return businesses

Running local tests for function all_restaurants():
Error occured: 400 Client Error: Bad Request for url: https://api.yelp.com/v3/businesses/search?location=Polish+Hill%2C+Pittsburgh&limit=20&offset=240&radius=1500&categories=restaurants
Total business retreived: 240
  Test 1 PASSED
  Test 2 PASSED


---

Now that we have the metadata on all of the restaurants in Pittsburgh (or at least the ones listed on Yelp), we can retrieve the reviews and ratings. The Yelp API gives us aggregate information on ratings but it doesn't give us the review text or individual users' ratings for a restaurant. For that we need to turn to web scraping, but to find out what pages to scrape we first need to parse our JSON from the API to extract the URLs of the restaurants.

In general, it is a best practice to seperate the act of __downloading__ data and __parsing__ data. This ensures that your data processing pipeline is modular and extensible (and autogradable). This decoupling also solves the problem of expensive downloading but cheap parsing (in terms of computation and time).

---

## Working with Web Pages (and HTML)

Think of APIs as similar to accessing a application's database itself (something you can interactively query and receive structured data back). But the results are usually in a somewhat raw form with no formatting or visual representation (like the results from a database query). This is a benefit _AND_ a drawback depending on the end use case. For data science and _programatic_ analysis this raw form is quite ideal, but for an end user requesting information from a _graphical interface_ (like a web browser) this is very far from ideal since it takes some cognitive overhead to interpret the raw information. And vice versa, if we have HTML it is quite easy for a human to visually interpret it, but to try to perform some type of programmatic analysis we first need to parse the HTML into a more structured form.

As a general rule of thumb, if the data you need can be accessed or retrieved in a structured form (either from a bulk download or API) prefer that first. But if the data you want (and need) is not as in our case we need to resort to alternative (messier) means.

Going back to the "hello world" example of question 1 with the NYT, we will do something similar to retrieve the HTML of the Yelp site itself (rather than going through the API) programmatically as text. 

---

## Q3: Parse a Yelp restaurant Page

Using `BeautifulSoup`, parse the HTML of a single Yelp restaurant page to extract the reviews in a structured form as well as the total number of pages.  A call to this function could look like the following

```python
reviews, num_pages = parse_yelp_page("https://www.yelp.com/biz/the-porch-at-schenley-pittsburgh")
```

You will want to fill in the following function stubs to parse a single page of reviews and return:
* the reviews as a structured Python dictionary
* the total number of pages of reviews.

For each review be sure to structure your Python dictionary as follows (to be graded correctly). The order of the keys doesn't matter, only the keys and the data type of the values:

```python
{
    'author': 'Aaron W.' # str
    'rating': 4          # int
    'date': '2019-01-03' # str, yyyy-mm-dd
    'description': "Wonderful!" # str
}
```

Return reviews in the order that they are present on the page.  There can be issues with Beautiful Soup using various parsers, for maximum conpatibility (and fewest errors) initialize the library with the default (and Python standard library parser): `BeautifulSoup(markup, "html.parser")`. You may notice that the HTML is automatically generated. Yelp uses a modern web application technology called [React](https://reactjs.org/), which generates the markup from Javascript code. This is a common hazard of scraping data from HTML, because the resulting code is not actually that readable.

#### Hints:
1. As a general strategy, you should try using the Chrome browser (others probably have similar features), and the "View -> Developer -> Inspect Elements" command.  This lets you mouse over individual elements in the text and find the corresponding html entry.

2. Use the `.find()` and `.find_all()` commands in BeautifulSoup liberally.  In addition to finding exact tags, e.g. `element.find("tag")` you can specify search terms in addition, including with regular exprssions.  For example, to find all sub-elements that have  `class = " raw__<something>"` you can use `element.find("tag", class_=re.compile(r"raw"))`.

3. You can also search the element with a regular expression using `element.find("tag", string=re.compile(r"<regex>")`

4. You can get the text content of a tag, converted to text (and preserving newlines), using `element.get_text("\n")`

#### Static downloads

Finally, we have learned through extensive experience that requiring a class to parse live content pages over a 2 week assignment nearly _always_ will lead to a situation where people add and delete reviews during the assignment period, causing the grader tests to go out of date.  There are also sometimes issues in actually getting the HTML properly from headless nodes like the ones that run on Colab.  Thus, while you would normally, as the first line in your function, issue a command like the following:
```python
html = retrieve_html(url)[1]
```
we are also providing a `parse_yelp_page_dict` dictionary along with the assigment that has all the necessary pages (correct as of the assignment release).  So instead of the line above you could call:

```python
html = parse_yelp_page_dict[url]
```
These are literally just downloads of the Yelp page HTML as of the time the assignment is released.  While it will be nice to test your function using the real Yelp page, to see it working in action, you probably should do your final testing and submission using these pre-downloaded files.  But note that regardless of whether you use these saved files or the live web version, you will definitely need to use a web browser on the real system to inspect the elements of the live page.  Attempting to manually parse the HTML as it's is downloaded to these files would be extremely difficult.

In [10]:
import pickle
import gzip

with open("THE PORCH AT SCHENLEY.html", "r", encoding="utf-8") as file:
    html_content = file.read()

with gzip.open("parse_yelp_page_dict.pkl.gz", "wb") as binary_dump:
    pickle.dump(html_content, binary_dump)

In [11]:
from playwright.async_api import async_playwright

async def fetch_html(url, request_delay=0.5):
    """
    Fetches html of single review page. Has standard request delay of 0.5 second. 
    Needs to be async in order to work with Jupyter.

    Args:
        url (string): URL string corresponding to a Yelp review page
        request_delay: time to wait before doing a request

    Returns:
        string: page html content
    """
    async with async_playwright() as p:
        # browser = await p.chromium.launch(headless=True)
        with open('session_data.json', 'r') as file:
            session_data = json.load(file)


        browser = await p.chromium.launch(headless=True, args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--no-sandbox",
            "--disable-gpu",
            "--disable-features=site-per-process"
        ])
        context = await browser.new_context()

        # await context.add_cookies(session_data["cookies"])

        page = await context.new_page()
        for key, value in session_data["localStorage"].items():
            await context.add_init_script(f"window.localStorage.setItem('{key}', '{value}')")        

        for key, value in session_data["sessionStorage"].items():
            await context.add_init_script(f"window.sessionStorage.setItem('{key}', '{value}')")        
    
        
        time.sleep(request_delay)


        await page.goto(url)
        page_html = await page.content()
        sanitize_filename = lambda c: c if str.isalpha(c) else '_'

        with open(f"html/{sanitize_filename(url)}", "w", encoding="utf-8") as file:
            file.write(page_html)
        await browser.close()
        return page_html

In [None]:
async def parse_yelp_page(url):
    """
    Parse the reviews on a single page of a restaurant.
    
    Args:
        url (string): URL string corresponding to a Yelp restaurant

    Returns:  
        tuple(list, int): a tuple of two elements
            first element: list of dictionaries corresponding to the extracted review information
            second element: Number of pages total
    """

    headers = {"accept": "application/json", "Authorization": f"Bearer {YELP_API_KEY}"}
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        html = response.json()  
    except requests.RequestException as error:
        print(f"Error occured: {error}")

    # html = await fetch_html(url)
    html_soup = BeautifulSoup(html)
    reviews_number = int(str(html_soup.select("#reviews > section > div.y-css-mhg9c5 > ul > li:nth-child(n)"
                               "> div > div.y-css-9vtc3g > div > div.y-css-8x4us > div > div"
                               "> div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.y-css-mhg9c5"
                               "> div.user-passport-info.y-css-mhg9c5 > span > a")).split()[0])
    print("Reviews:", reviews_number)
    pages = math.ceil(reviews_number // 10)
    reviews_soup = html_soup.select("#reviews > section > div.y-css-mhg9c5")
    reviews_info = []
    for review in reviews_soup:
        print(review)
        author = html_soup.select("#reviews > section > div.y-css-mhg9c5 > ul > li:nth-child(1)"
                                 "> div > div.y-css-9vtc3g > div > div.y-css-8x4us > div > div"
                                 "> div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.y-css-mhg9c5"
                                 "> div.user-passport-info.y-css-mhg9c5 > span > a")
        # same for those
        rating = 0
        date = ""
        description = ""
        reviews_info.append({"author": author,
                             "rating": rating,
                             "date": date,
                             "description": description})
    
    return reviews_info, pages  

In [24]:
import asyncio
# result = await parse_yelp_page("https://www.yelp.com/biz/holbox-los-angeles-2")
result = await parse_yelp_page("https://www.yelp.com/biz/red-o-cantina-santa-monica")
print(result)


Error occured: 403 Client Error: Forbidden for url: https://www.yelp.com/biz/red-o-cantina-santa-monica


UnboundLocalError: cannot access local variable 'html' where it is not associated with a value

---

## Q 3.5: Extract all of the Yelp reviews for a Single Restaurant

So now that we have parsed a single page, and figured out a method to go from one page to the next we are ready to combine these two techniques and actually crawl through web pages! 

Using `requests`, programmatically retrieve __ALL__ of the reviews for a __single__ business (provided as a parameter). Just like the API was paginated, the HTML paginates its reviews (it would be a very long web page to show 300 reviews on a single page) and to get all the reviews you will need to parse and traverse the HTML. As input your function will receive a URL corresponding to a Yelp business. As output return a list of dictionaries (structured the same as question 3 containing the relevant information from the reviews.

Return reviews in the order that they are present on the page.

You will need to get the number of pages on the first request and generate the URL for subsequent pages automatically. Use the Yelp website to see how the URL changes for subsequent pages.

In [None]:
@mugrade.local_tests
async def extract_yelp_reviews(url, request_delay=0.5):
    """
    Retrieve ALL of the reviews for a single business on Yelp.

    Parameters:
        url (string): Yelp URL corresponding to the business of interest.

    Returns:
        reviews (list): list of dictionaries containing extracted review information
    """ 
    reviews, pages_number = await parse_yelp_page(url) 
    for iteration in range(2, pages_number+1):
        url_next = f"{url}?start={iteration}"
        try:
            review, _ = await parse_yelp_page(url_next)
            reviews.append(review)
            time.sleep(request_delay)
        except requests.RequestException as error:
            print(f"Error occured: {error}")
            break
    print(f"Total reviews retreived: {len(reviews)}")
    return reviews


Running local tests for function extract_yelp_reviews():


NameError: name 'reviews_number' is not defined