# Collecting Yelp reviews for Houston restaurants
- Continuation of Yelp Scraper classwork to complete Yelp review extraction

## Parameterization and Pagination

And before we can get any reviews on restaurants, we might want to get the metadata on ALL of the Indian restaurants in Houston. According to Yelp, there are ~264, but our function yelp_search from yelp_classwork.ipynb only returns 20 of them. This is due to pagination and is a safeguard against returning __TOO__ much data in a single request (what would happen if there were 100,000 restaurants?) and can be used in conjuction with _rate limiting_ as well as a way to throttle and protect access to Yelp data.

> If an API has 1,000,000 records, but only returns 10 records per page and limits you to 5 requests per second... how long will it take to acquire ALL of the records contained in the API?
50 records per second, 20,000 seconds for 1,000,000 records.

One of the ways that APIs are an improvement over plain web scraping is the ability to make __parameterized__ requests. Just like the Python functions you have been writing have arguments (or parameters) that allow you to customize its behavior/actions (an output) without having to rewrite the function entirely, we can parameterize the queries we make to the Yelp API to filter the results it returns.

## Acquire all of the restaurants on Yelp for a specific query/location
Again using the API documentation for the search endpoint, fill in the following function (all_restaurants) to retrieve all of the restaurants meeting a specific query term and in a given location. Again you should use your read_api_key() function outside of the all_restaurants() stub to read the API Key used for the requests. You will need to account for pagination and rate limiting to:

- Given a search term (e.g., 'Indian'), and a location (e.g., 'Houston') and your api_key
    - Retrieve all of the restaurants that match the query and location. 
    - Paginate by querying 20 restaurants at each request.
    - Pause slightly (at least 200 milliseconds) between subsequent requests so as to not overwhelm the API (and get blocked).

As always with API access, make sure you follow all of the API's policies and use the API responsibly and respectfully.

**DO NOT MAKE TOO MANY REQUESTS TOO QUICKLY OR YOUR KEY MAY BE BLOCKED**

Again, you can test your function with Indian restaurants in Houston, since there are just 264 of them on Yelp. If you use a more popular cuisine, select a more restricted geographical location (e.g., 'Rice Village, Houston') to restrict the number to less than 300.

Hint: *time.sleep(n) is the function that will implement the pause between calls to yelp_search*

## Preliminary step 1: Read your api_key

In [15]:
def read_api_key(file):
    f = open(file,'r')
    api_key = f.read().replace('\n','')
    f.close()
    return api_key

api_key = read_api_key('api_key.txt')

## Preliminary step 2: Copy your solution to yelp_search from Lab 3 here

In [16]:
import requests
import json

def yelp_search(api_key, query, location,offset=0):
    """
    Make an authenticated request to the Yelp API.

    Args:
        query (string): Search term

    Returns:
        total (integer): total number of businesses on Yelp corresponding to the query
        businesses (list): list of dicts representing each business
    """
    
    # Write solution here (6-8 lines of code)
    # START YOUR CODE
    # Write solution here (5-8 lines of code)
    # START YOUR CODE
    url_params = {
        'term': query.replace(' ', '+'),
        'location': location.replace(' ', '+'),
        'limit' : 20,
        'offset' : offset
    }
    headers = {'Authorization': 'Bearer %s' % api_key}
    url = "https://api.yelp.com/v3/businesses/search"
    response = requests.request('GET', url, headers=headers, params=url_params)
    
    return (json.loads(response.content)["total"], json.loads(response.content)["businesses"])
    
    # END YOUR CODE 

# Problem 1: construct the get_all_restaurants function (10 points)
- make a call to the API using your yelp_search function and get the first 20 restaurants (with offset = 0).
- make repeated calls to the API using yelp_search with the offset augmented by 20 each time, till you have covered all the restaurants.
- make sure you have a time.sleep() call for at least a second between successive calls to yelp_search
- assemble the results into a list of JSON objects, each object representing a restaurant


In [38]:
import time

def get_all_restaurants(api_key, query, location):
    """
    Retrieve ALL the restaurants on Yelp for a given query (search term) and location.

    Args:
        api_key: your Yelp api_key (for calling yelp_search)
        query (string): Search term
        location (string): location

    Returns:
        results (list): list of JSON objects each representing a restaurant
    """
   
    # Write solution here (about 10-12 lines)
    # START YOUR CODE
    
    #Let's get the number of restaurants
    num_business = yelp_search(api_key, query, location)[0]
    result = []
    for offset in range(0, num_business, 20):
            time.sleep(200*(10**(-6)))
            businesses = yelp_search(api_key, query, location, offset)[1]
            for business in businesses:
                result.append(business)
    print(num_business)
    return result
    # END YOUR CODE

# Problem 2: test the get_all_restaurants function (5 points)

In [39]:
results = get_all_restaurants(api_key, 'Indian','Houston, TX')#return a list of discotnaries because the print on the 
#next line expects a dictonary
print('Keys: ',results[0].keys())

# Write code for printing name, rating and review_count for items in results (1-2 lines of code)
# START YOUR CODE
print("---Output---")
for business in results:
    print("Name: %s, Rating: %.1f, review_count: %i" %(business['name'], business['rating'], business['review_count']))
# END YOUR CODE
# show the output below  

267
Keys:  dict_keys(['id', 'alias', 'name', 'image_url', 'is_closed', 'url', 'review_count', 'categories', 'rating', 'coordinates', 'transactions', 'price', 'location', 'phone', 'display_phone', 'distance'])
---Output---
Name: Surya India, Rating: 4.0, review_count: 318
Name: Aga's Restaurant & Catering, Rating: 4.5, review_count: 1742
Name: Musaafer, Rating: 4.0, review_count: 170
Name: Tarka Indian Kitchen, Rating: 4.5, review_count: 76
Name: Tarka Indian Kitchen, Rating: 4.0, review_count: 558
Name: Govinda's Vegetarian Cuisine, Rating: 4.5, review_count: 389
Name: Maharaja Bhog, Rating: 4.0, review_count: 678
Name: Kiran's, Rating: 4.0, review_count: 259
Name: Himalayan Taj, Rating: 4.5, review_count: 30
Name: Khyber North Indian Grill, Rating: 4.0, review_count: 331
Name: Pondicheri, Rating: 3.5, review_count: 1111
Name: Flying Idlis - Downtown, Rating: 5.0, review_count: 26
Name: Nirvana Indian Restaurant, Rating: 4.0, review_count: 285
Name: India's Restaurant, Rating: 3.5, rev

# Problem 3: Parse  results of get_all_restaurants to get meta_data (10 points)

Fill in the funcion extract_meta_data
Because we want to seperate the __downloading__ from the __parsing__, fill in the following function to parse the URLs pointing to the restaurants on `yelp.com`. As input your function should expect a string of [properly formatted JSON](http://www.json.org/) (which is similar to __BUT__ not the same as a Python dictionary) and as output should return a dataframe with . The input JSON will be structured as follows as shown at  (https://www.yelp.com/developers/documentation/v3/business_search) on the Yelp API page):

```json
[
    {'id': 'LWshhOwxnlPm5JWJZ95TRg',
      'alias': 'surya-india-houston',
      'name': 'Surya India',
      'image_url': 'https://s3-media3.fl.yelpcdn.com/bphoto/j0HI3Z2v7kdql1A58a0ugA/o.jpg',
      'is_closed': False,
      'url': 'https://www.yelp.com/biz/surya-india-houston?adjust_creative=xMb9__p6q0fOsegfkh_Yg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=xMb9__p6q0fOs-egfkh_Yg',
      'review_count': 267,
      'categories': [{'alias': 'indpak', 'title': 'Indian'}],
      'rating': 4.0,
      'coordinates': {'latitude': 29.7685441737493,
                      'longitude': -95.4102170839906},
      'transactions': ['pickup', 'delivery'],
      'price': '$$',
      'location': {'address1': '700 Durham Dr',
                   'address2': 'Ste 200',
                   'address3': '',
                   'city': 'Houston',
                   'zip_code': '77007',
                   'country': 'US',
                   'state': 'TX',
                   'display_address': ['700 Durham Dr', 'Ste 200', 'Houston, TX 77007']},
      'phone': '+17138646667',
      'display_phone': '(713) 864-6667',
      'distance': 4890.165402890303},
    
    {'id': 'hFC_CJ5N9x9Tfol5RBYbdA',
      'alias': 'tarka-indian-kitchen-houston',
      'name': 'Tarka Indian Kitchen',
      'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/iq6Cty7_fNjAHwvvDeGUBw/o.jpg',
      'is_closed': False,
      'url': 'https://www.yelp.com/biz/tarka-indian-kitchen-houston?adjust_creative=xMb9__p6q0fOs-egfkh_Yg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=xMb9__p6q0fOs-egfkh_Yg',
      'review_count': 479,
      'categories': [{'alias': 'indpak', 'title': 'Indian'}],
      'rating': 4.0,
      'coordinates': {'latitude': 29.80303, 'longitude': -95.41111},
      'transactions': ['pickup', 'delivery'],
      'price': '$',
      'location': {'address1': '721 W 19th St',
                   'address2': 'Ste 7',
                   'address3': '',
                   'city': 'Houston',
                   'zip_code': '77008',
                   'country': 'US',
                   'state': 'TX',
                   'display_address': ['721 W 19th St', 'Ste 7', 'Houston, TX 77008']},
      'phone': '+13468022096',
      'display_phone': '(346) 802-2096',
      'distance': 6643.782465168795}]
```

In [19]:
import pandas as pd

def extract_meta_data(results):
    """
    Parse results list from all_restaurants to extract restaurant meta-data.
    
    Args:
        results (list): list of properly formatted JSON objects (returned by all_restaurants)

    Returns:
        metadata_df: Pandas dataframe with id, name, rating, price, review_count, display_address and URL 
                     for each restaurant in results
    """
    
    # Write solution here (about 12 lines of code)
    # START YOUR CODE
    
    #used for-loop because we have a lot of lines
    id, name, rating, price, review_count, display_address, URL = [], [], [], [], [], [], []   
    for business in results:
        id.append(business['id'])
        name.append(business['name'])
        rating.append(business['rating'])
        price.append((lambda x: x['price'] if 'price' in x else None)(business))
        review_count.append(business['review_count'])
        display_address.append(" ".join(business['location']['display_address']))
        URL.append(business['url'])
    data = pd.DataFrame({'id':id, 'name':name, 'rating':rating, 'price':price, 'review_count':review_count, 'display_address':display_address, 'url':URL})
    
    return data
    
    # END YOUR CODE

# Problem 4: Test extract_meta_data (5 points)

In [37]:
# metadata_df = extract_meta_data(results)
metadata_df = extract_meta_data(results)
# write code to sort metadata_df first by ratings (descending order) and then by price (ascending order) (1 line)
# START YOUR CODE
metadata_df.sort_values(by = ['rating', 'price'], ascending = [False, True], na_position = 'last').head(10)
metadata_df.shape
# END YOUR CODE
# make sure you display the top 10 rows of the sorted dataframe below

(267, 7)

# Scraping reviews
The function extract_meta_data extracts ['id',  'name',  'rating',  'price', 'review_count', 'display_address','url']  for all of the Indian restaurants in Houston (or at least the ones listed on Yelp), but does not provide  individual user reviews and ratings. There are two approaches to getting that information.

- Approach 1: Use the Yelp API with url = "https://api.yelp.com/v3/businesses/" + business_id + "/reviews" and use the same approach as we did for yelp_search. This approach only yields three reviews for each restaurant, and no matter how many times you call the function, you get the same three reviews. I have implemented this approach at the very end of this notebook.

- Approach 2: Use web-scraping (like the NYT article example in Lab 3)
  - Go to the URL of the restaurant's Yelp page (which the extract_meta_data function retrieves).
  - Retrieve the page and all 20 reviews on it, then go get the next 20 by adding '&start=20' to the URL (then 40, ...)
  - And so on, until there are no more reviews to retrieve. 

# Problem 5: Get reviews given restaurant URL and offset (10 points)

Using `BeautifulSoup`, parse the HTML of a Yelp restaurant page to extract all 20 reviews from offsetin that page.

Return a DataFrame with 20 rows and columns: author, datePublished, description, reviewRating. 
```python
	author	datePublished	description	reviewRating
0	Farhan I.	2019-09-04	Best Indian restaurant near downtown (includin...	5
1	Kathy B.	2019-08-31	I walked inside with my husband and son, this ...	1
2	Tina N.	2019-06-25	I loveee spicy food. To the point where one da...	3
3	Spat B.	2019-06-24	A good friend of mine had never had Indian foo...	1
4	Jared F.	2019-09-14	The first time that I've ever gone out to eat ...	1
```

There can be issues with Beautiful Soup using various parsers, for maximum compatibility (and fewest errors) use `BeautifulSoup(response.content, "html.parser")` where `response` is what a GET request on the restaurant URL yields.

In [21]:
# return a pandas DataFrame of all 20 reviews in a page with a given URL and with offset start
import bs4
from bs4 import BeautifulSoup
import re
def get_reviews_df(url,start):
    """
    Use requests.get to obtain response from specified URL with offset start `url+'&start='+str(start)`
    Parse response.content with Beautiful Soup using html.parser into an object called `soup`
    Get all tags in `soup` named `script`
    For each of these tags: if the object is a JSON string, load it into a json_object using `json.loads()`
    Extract reviews from json_object from key `review`
    Convert json_object['review'] into a Pandas dataframe

    Args:
        url: URL to restaurant page on Yelp 
        start: offset number (gets 20 reviews from that offset)

    Returns:
        review_df: DataFrame with author, datePublished, description, reviewRating (20 rows)
    
    """
    # 8 lines of code expected
    # START YOUR CODE
    response = requests.get(url+"&start="+str(start))
    soup = BeautifulSoup(response.content, 'html.parser')
    data1 = pd.DataFrame(columns = ['author', 'datePublished', 'description', 'reviewRating'])
    scripts = soup.select('script[type="application/ld+json"]')
    jsobjs = [json.loads(item.string) for item in scripts] 
    for jsobj in jsobjs:
        if('review' in jsobj.keys()):
            data1 = data1.append(jsobj['review']) 
    data1["reviewRating"] = data1.apply(lambda x: x.reviewRating['ratingValue'],axis = 1)
    return data1

    # END YOUR CODE

# Problem 6: Test get_reviews_df (5 points)

In [42]:
review_df = get_reviews_df(metadata_df.iloc[1].url,0)
# check that the length of review_df is 20 (1 line of code)
len(review_df)
# START YOUR CODE
review_df.head(20)
# END YOUR CODE
# show the first 10 rows of your dataframe below.

Unnamed: 0,author,datePublished,description,reviewRating
0,Typhani W.,2021-02-15,Dude! This place was so good and so clean! My ...,5
1,Dale B.,2021-01-02,Brought the family here to try some new cuisin...,5
2,Monifa M.,2020-10-24,"The real deal, Authentic Indian food!\nI order...",5
3,Christien W.,2021-03-09,Aga&apos;s is definitely the best Indo-Pak res...,5
4,Justine H.,2021-03-05,Better than Himalaya and delivers to my doorst...,5
5,Allyson C.,2020-12-28,Our friends moved to Spring a few months ago. ...,5
6,Jennifer V.,2020-12-01,I am not very proficient in Indian cuisine so ...,3
7,Cynthia J.,2021-01-13,I was in the mood for Indian food today. I kno...,5
8,Samir M.,2021-01-16,New to Houston and overwhelmed by this spot. B...,5
9,Rabab R.,2021-02-24,"As a south Asian, I&apos;m always on the looko...",4


# Problem 7: Get all reviews of a specific restaurant (10 points)

In [43]:
def get_all_reviews_df(metadata_df,index):
    """
    Make repeated calls to get_reviews_df with offsets 0, 20, 40, ... till the total number of reviews
    Remember to pause between calls to get_reviews_df with time.sleep
    Concatenate the dataframes returned in each call to form a composite dataframe all_reviews_df
    Arguments: 
        metadata_df: meta data on restaurants extracted before
        index: row of metadata_df
    Returns:
        all_reviews_df: all the reviews of that restaurant in row index of metadata_df
    """

    url = metadata_df.iloc[index]['url']
    nreviews = metadata_df.iloc[0]['review_count']
    # START YOUR CODE about 6 lines of code expected
    data1 = pd.DataFrame()
    for offset in range(0, nreviews+1, 20):
        time.sleep(200*(10**(-6)))
        data1 = pd.concat([data1, get_reviews_df(url, offset)], ignore_index = True)#use concat instead of append
    return data1
    # END YOUR CODE


# Problem 8: Test get_all_reviews_df (5 points)

In [47]:
reviews = get_all_reviews_df(metadata_df,1)
# sort reviews in descending order of ratings (1 line of code)
# START YOUR CODE
reviews.head(40)
# END YOUR CODE
# show the first 10 rows of your sorted dataframe below

Unnamed: 0,author,datePublished,description,reviewRating
0,Typhani W.,2021-02-15,Dude! This place was so good and so clean! My ...,5
1,Dale B.,2021-01-02,Brought the family here to try some new cuisin...,5
2,Monifa M.,2020-10-24,"The real deal, Authentic Indian food!\nI order...",5
3,Christien W.,2021-03-09,Aga&apos;s is definitely the best Indo-Pak res...,5
4,Justine H.,2021-03-05,Better than Himalaya and delivers to my doorst...,5
5,Allyson C.,2020-12-28,Our friends moved to Spring a few months ago. ...,5
6,Jennifer V.,2020-12-01,I am not very proficient in Indian cuisine so ...,3
7,Cynthia J.,2021-01-13,I was in the mood for Indian food today. I kno...,5
8,Samir M.,2021-01-16,New to Houston and overwhelmed by this spot. B...,5
9,Rabab R.,2021-02-24,"As a south Asian, I&apos;m always on the looko...",4
