## Continuation of Lab 3 to complete review extraction

## Parameterization and Pagination

And before we can get any reviews on restaurants, we need to actually get the metadata on ALL of the restaurants in Pittsburgh. Notice above that while Yelp told us that there are ~14900, the response contained far fewer actual `Business` objects. This is due to pagination and is a safeguard against returning __TOO__ much data in a single request (what would happen if there were 100,000 restaurants?) and can be used in conjuction with _rate limiting_ as well as a way to throttle and protect access to Yelp data.

> If an API has 1,000,000 records, but only returns 10 records per page and limits you to 5 requests per second... how long will it take to acquire ALL of the records contained in the API?

One of the ways that APIs are an improvement over plain web scraping is the ability to make __parameterized__ requests. Just like the Python functions you have been writing have arguments (or parameters) that allow you to customize its behavior/actions (an output) without having to rewrite the function entirely, we can parameterize the queries we make to the Yelp API to filter the results it returns.

## Aquire all of the restaurants in Houston (on Yelp)

Again using the [API documentation](https://www.yelp.com/developers/documentation/v3/business_search) for the `search` endpoint, fill in the following function to retrieve all of the _Restuarants_ (using categories) for a given query. Again you should use your `read_api_key()` function outside of the `all_restaurants()` stub to read the API Key used for the requests. You will need to account for __pagination__ and __[rate limiting](https://www.yelp.com/developers/faq)__ to:

1. Retrieve all of the Business objects (# of business objects should equal `total` in the response). Paginate by querying 20 restaurants each request.
2. Pause slightly (at least 200 milliseconds) between subsequent requests so as to not overwhelm the API (and get blocked).  

As always with API access, make sure you follow all of the [API's policies](https://www.yelp.com/developers/api_terms) and use the API responsibly and respectfully.

** DO NOT MAKE TOO MANY REQUESTS TOO QUICKLY OR YOUR KEY MAY BE BLOCKED **

> Again, you can test your function with an individual neighborhod in Houston (I recommend Rice Village). Houston has a lot of restaurants... meaning it will take a lot of time to download them all.

Hint: 
- time.sleep(n) is the function that will implement the pause between calls to yelp_search


In [3]:
import requests
def read_api_key(file):
    f = open(file,'r')
    api_key = f.read().replace('\n','')
    f.close()
    return api_key

api_key = read_api_key('api_key.txt')


import json
def yelp_search(api_key, query, location,offset=0):
    """
    Make an authenticated request to the Yelp API.

    Args:
        query (string): Search term

    Returns:
        total (integer): total number of businesses on Yelp corresponding to the query
        businesses (list): list of dicts representing each business
    """
    url = "https://api.yelp.com/v3/businesses/search"
    headers = {"Authorization": "Bearer %s" % api_key}
    url_params = {
        'term': query.replace(' ', '+'),
        'location': location.replace(' ', '+'),
        'offset':offset
    }
    response = requests.request('GET', url, headers=headers, params=url_params)
    data = json.loads(response.content)
    num = data['total']
    objects = data['businesses']
    
    return (num, objects)


import time
def all_restaurants(api_key, query, location):
    """
    Retrieve ALL the restaurants on Yelp for a given query.

    Args:
        query (string): Search term

    Returns:
        results (list): list of dicts representing each business
    """
    
    # Write solution here (about 10-12 lines)
    all_rest = []
    n=0
    num = float('Inf')
    while n < num:
        (num, objects) = yelp_search(api_key, query, location,offset=n)
        for dictionary in objects:
            all_rest.append(dictionary)
        n += 20    
    time.sleep(0.200) 
    return all_rest
        
        

## Test all_restaurants function

In [10]:
data = all_restaurants(api_key, 'Indian','Houston, TX')

Now that we have the metadata on all of the restaurants in Houston (or at least the ones listed on Yelp), we can retrieve the reviews and ratings. The Yelp API gives us aggregate information on ratings but it doesn't give us the review text or individual users' ratings for a restaurant. For that we need to turn to web scraping, but to find out what pages to scrape we first need to parse our JSON from the API to extract the URLs of the restaurants.

In general, it is a best practice to seperate the act of __downloading__ data and __parsing__ data. This ensures that your data processing pipeline is modular and extensible (and autogradable ;). This decoupling also solves the problem of expensive downloading but cheap parsing (in terms of computation and time).

## Parse the API Responses and Extract the URLs

Because we want to seperate the __downloading__ from the __parsing__, fill in the following function to parse the URLs pointing to the restaurants on `yelp.com`. As input your function should expect a string of [properly formatted JSON](http://www.json.org/) (which is similar to __BUT__ not the same as a Python dictionary) and as output should return a Python list of strings. The input JSON will be structured as follows (extracted from the [sample](https://www.yelp.com/developers/documentation/v3/business_search) on the Yelp API page):

```json
 [
    {
      "rating": 4,
      "price": "$",
      "phone": "+14152520800",
      "id": "four-barrel-coffee-san-francisco",
      "is_closed": false,
      "categories": [
        {
          "alias": "coffee",
          "title": "Coffee & Tea"
        }
      ],
      "review_count": 1738,
      "name": "Four Barrel Coffee",
      "url": "https://www.yelp.com/biz/four-barrel-coffee-san-francisco",
      "coordinates": {
        "latitude": 37.7670169511878,
        "longitude": -122.42184275
      },
      "image_url": "http://s3-media2.fl.yelpcdn.com/bphoto/MmgtASP3l_t4tPCL1iAsCg/o.jpg",
      "location": {
        "city": "San Francisco",
        "country": "US",
        "address2": "",
        "address3": "",
        "state": "CA",
        "address1": "375 Valencia St",
        "zip_code": "94103"
      },
      "distance": 1604.23,
      "transactions": ["pickup", "delivery"]
    }
  ]
```

In [84]:
import pandas
def parse_api_response(data):
    """
    Parse Yelp API results to extract restaurant URLs.
    
    Args:
        data (string): String of properly formatted JSON.

    Returns:
        Pandas dataframe with name, rating, price, review_count and URL in the input data
    """
    # Write solution here (about 12 lines of code)
    
    new_dict = {}
    new_list = []
    counter = 0
    for obj in data:
        new_dict[counter] = [
            obj['name'],
            obj['rating'],
            obj.get('price','n/a'),
            obj['review_count'],
            obj['url']
        ]
        counter += 1
    return pandas.DataFrame.from_dict(new_dict, orient='index').rename(index=str, columns={0: "name", 1: "rating", 2: "price", 3: "review_count", 4: "url"}).set_index('name')
     


## Test parse_api_response function

In [85]:
df= parse_api_response(data)
df

Unnamed: 0_level_0,rating,price,review_count,url
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Surya India,4.0,$$,196,https://www.yelp.com/biz/surya-india-houston?a...
Tarka Indian Kitchen,4.0,$,350,https://www.yelp.com/biz/tarka-indian-kitchen-...
Govinda's Vegetarian Cuisine,4.5,$$,244,https://www.yelp.com/biz/govindas-vegetarian-c...
Desi Kitchen,4.5,$$,19,https://www.yelp.com/biz/desi-kitchen-houston-...
Indika,4.0,$$$,408,https://www.yelp.com/biz/indika-houston?adjust...
Sai Bhog,4.5,,22,https://www.yelp.com/biz/sai-bhog-houston?adju...
India's Restaurant,3.5,$$,377,https://www.yelp.com/biz/indias-restaurant-hou...
Cowboys & Indians Tex-In Kitchen,4.0,$$,221,https://www.yelp.com/biz/cowboys-and-indians-t...
Pondicheri,3.5,$$,940,https://www.yelp.com/biz/pondicheri-houston?ad...
Mayuri Express,4.0,$$,36,https://www.yelp.com/biz/mayuri-express-housto...


## Sort restaurants by rating and price

In [86]:
# sort df descending in rating, and ascending in price
df.sort_values(['rating', 'price'], ascending=[0, 1])

Unnamed: 0_level_0,rating,price,review_count,url
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dilpasand Mithai & Snacks,5.0,$,2,https://www.yelp.com/biz/dilpasand-mithai-and-...
India in a Box,5.0,$$,22,https://www.yelp.com/biz/india-in-a-box-pearla...
Deli Deluxe,4.5,$,64,https://www.yelp.com/biz/deli-deluxe-houston-2...
Flying Idlis,4.5,$,133,https://www.yelp.com/biz/flying-idlis-houston-...
Desi Bun Kabab & Grill Cafe,4.5,$,94,https://www.yelp.com/biz/desi-bun-kabab-and-gr...
Khan BBQ & Grill,4.5,$,108,https://www.yelp.com/biz/khan-bbq-and-grill-ho...
Masala Munchies,4.5,$,23,https://www.yelp.com/biz/masala-munchies-houst...
Tacos 'N' Frankies,4.5,$,60,https://www.yelp.com/biz/tacos-n-frankies-hous...
Egg N More,4.5,$,27,https://www.yelp.com/biz/egg-n-more-houston?ad...
Doshi House Café,4.5,$,337,https://www.yelp.com/biz/doshi-house-caf%C3%A9...


As we can see, JSON is quite trivial to parse (which is not the case with HTML as we will see in a second) and work with programmatically. This is why it is one of the most ubiquitous data serialization formats (especially for ReSTful APIs) and a huge benefit of working with a well defined API if one exists. But APIs do not always exists or provide the data we might need, and as a last resort we can always scrape web pages...

# Optional part of the lab

## Working with Web Pages (and HTML)

Think of APIs as similar to accessing a application's database itself (something you can interactively query and receive structured data back). But the results are usually in a somewhat raw form with no formatting or visual representation (like the results from a database query). This is a benefit _AND_ a drawback depending on the end use case. For data science and _programatic_ analysis this raw form is quite ideal, but for an end user requesting information from a _graphical interface_ (like a web browser) this is very far from ideal since it takes some cognitive overhead to interpret the raw information. And vice versa, if we have HTML it is quite easy for a human to visually interpret it, but to try to perform some type of programmatic analysis we first need to parse the HTML into a more structured form.

> As a general rule of thumb, if the data you need can be accessed or retrieved in a structured form (either from a bulk download or API) prefer that first. But if the data you want (and need) is not as in our case we need to resort to alternative (messier) means.

Going back to the "hello world" example of question 1 with the NYT, we will do something similar to retrieve the HTML of the Yelp site itself (rather than going through the API) programmatically as text. 

## Extract reviews of from a Yelp restaurant Page

Using `BeautifulSoup`, parse the HTML of a Yelp restaurant page to extract information about the restaurant as well as all the reviews in a structured form. This requires you to follow the URL to the next page of reviews until there are no more pages. Fill in following function stub to parse a restaurant page  and return a dictionary:
* information: name, address, telephone, cuisine, aggregate rating
* the reviews as a structured Python dictionary


For each review be sure to structure your Python dictionary as follows. The order of the keys doesn't matter, only the keys and the data type of the values:

```python
{
    'review_id': str
    'user_id': str
    'rating': float
    'date': str ('yyyy-mm-dd')
    'text': str
}

# Example
{
    'review_id': '12345'
    'user_id': '6789'
    'rating': 4.7
    'date': '2016-01-23'
    'text': "Wonderful!"
}
```

> There can be issues with Beautiful Soup using various parsers, for maximum compatibility (and fewest errors) initialize the library with the default (and Python standard library parser): `BeautifulSoup(markup, "html.parser")`

In [None]:
def extract_reviews(url):
    """
    Parse the reviews of a restaurant.
    
    Args:
        url (string): url corresponding to a Yelp restaurant

    Returns:
        dictionary:
            keys 1:5: name, address, telephone, aggregate rating, cuisine
            key 6: list of dictionaries corresponding to the extracted review information
            
    """
    
    # Write solution here (20 lines of code)
    pass

## Test extract_reviews on any restaurant you have gathered in the previous part

In [None]:
#extract_reviews(you_url_here)