## Collecting Yelp reviews for Houston restaurants

Welcome to the lab on web scraping. While many people might view working with data (including scraping, parsing, storing, etc.) a necessary evil to get to the "fun" stuff (i.e. modeling), I think that if presented in the right way this munging can be quite empowering. Imagine you never had to worry or ask those _what if_ questions about data existing or being accessible... but that you can get it yourself!

By the end of this lab hopefully you should look at the wonderful world wide web without fear, comforted by the fact that anything you can see with your  eyes, a computer can see with its  eyes...
 
## Objectives

But more concretely, this lab will teach you how to:

* HTTP Requests (and lifecycle)
* RESTful APIs
    * Authentication (OAuth)
   
(this lab is based on the Practical Data Science Course at CMU, designed by Z Kolter) 

## Working with APIs

Since everyone loves food (presumably), the ultimate end goal of this assignment will be to acquire the data to answer some questions and hypotheses about the restaurant scene in Houston (which we will get to later). We will download __both__ the metadata on restaurants in Houston from the Yelp API and with this metadata, retrieve the comments/reviews and ratings from users on restaurants.

But first things first, let's do the "hello world" of making web requests with Python to get a sense for how to programmatically access web pages: an (unauthenticated) HTTP GET to download a web page.

## Basic HTTP Requests

Fill in the funtion to use `requests` to download and return the raw HTML content of the URL passed in as an argument. As an example, try the following NYT article: [https://www.nytimes.com/2018/04/11/technology/personaltech/i-downloaded-the-information-that-facebook-has-on-me-yikes.html](https://www.nytimes.com/2018/04/11/technology/personaltech/i-downloaded-the-information-that-facebook-has-on-me-yikes.html)

> Your function should return a tuple of: (`<status_code>`, `<raw_html>`)

```python
>>> facebook_article = retrieve_html('https://www.nytimes.com/2018/04/11/technology/personaltech/i-downloaded-the-information-that-facebook-has-on-me-yikes.html')
>>> print(facebook_article)
(200, u'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-magazine...')
```

In [17]:
import requests

def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    
    # Write solution here (2 lines of code expected)
    content = requests.get(url)
    return (content.status_code, content.text)

## Test retrieve_html function

In [22]:
url = 'https://www.nytimes.com/2018/04/11/technology/personaltech/i-downloaded-the-information-that-facebook-has-on-me-yikes.html'
(status,nyt_article) = retrieve_html(url) 
print(status)

200


## How to extract articles using BeautifulSoup

Using `BeautifulSoup`, parse the HTML of the retrieved URL to extract the article. 
Fill in following function stub to get the article out. You will have to look at the structure of the HTML returned by the NYT to determine which elements to extract and how to piece together the items of interest.

- to convert the HTML string `article` into a soup object: use `BeautifulSoup(article,"lxml")`
- to find all paragraph tags in a soup object: use `soup.find_all("p")`
- to extract text from a paragraph object: use `p.getText()`



In [70]:
from bs4 import BeautifulSoup


def extract_article(article):
    """
    Input: html_string
    Output: all paragraphs in the parsed html_string
    """
    # 5 lines of code expected.
    art = BeautifulSoup(article,"lxml")
    paragraphs = art.find('article').find_all("p")
    text = []  
    for para in paragraphs:
        text.append(para.getText())        
    return ' '.join(text[4:-5])


## Test extract_article function

In [71]:
extract_article(nyt_article)

'When I downloaded a copy of my Facebook data last week, I didn’t expect to see much. My profile is sparse, I rarely post anything on the site, and I seldom click on ads. (I’m what some call a Facebook “lurker.”) But when I opened my file, it was like opening Pandora’s box. With a few clicks, I learned that about 500 advertisers — many that I had never heard of, like Bad Dad, a motorcycle parts store, and Space Jesus, an electronica band — had my contact information, which could include my email address, phone number and full name. Facebook also had my entire phone book, including the number to ring my apartment buzzer. The social network had even kept a permanent record of the roughly 100 people I had deleted from my friends list over the last 14 years, including my exes. There was so much that Facebook knew about me — more than I wanted to know. But after looking at the totality of what the Silicon Valley company had obtained about yours truly, I decided to try to better understand h

## Moving to APIs

Now while this example might have been fun, we haven't yet done anything more than we could with a web browser. To really see the power of programmatically making web requests we will need to interact with a API. For the rest of this homework we will be working with the [Yelp API](https://www.yelp.com/developers/documentation/v3/get_started) and Yelp data (for an extensive data dump see their [Academic Dataset Challenge](https://www.yelp.com/dataset_challenge)). The reasons for using the Yelp API are three fold:

1. Incredibly rich dataset that combines:
    * entity data (users and businesses)
    * preferences (i.e. ratings)
    * geographic data (business location and check-ins)
    * temporal data
    * text in the form of reviews
    * and even images.
2. Well [documented API](https://www.yelp.com/developers/documentation/v3/get_started) with thorough examples.
3. Extensive data coverage so that you can find data that you know personally (from your home town/city or account). This will help with understanding and interpreting your results.

## Authentication

To access the Yelp API however we will need to go through a few more steps than we did with the first NYT example. Most large web scale companies use a combination of authentication and rate limiting to control access to their data to ensure that everyone using it abides. The first step (even before we make any request) is to setup a Yelp account if you do not have one and get API credentials.

## Yelp API Access

1. [Create a Yelp account](https://www.yelp.com/signup) (if you do not have one already)
2. [Generate API keys](https://www.yelp.com/developers/v3/manage_app) (if you haven't already). You will only need the API Key (not the Client ID or Client Secret) -- more on that later. This step will ask you to create an App. Just make one up, indicate your industry (Education), provide a short description of your app, and then your email, and you are good to go.


Now that we have our accounts setup we can start making requests! There are various authentication schemes that APIs use, listed here in relative order of complexity:

* No authentication
* [HTTP basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)
* Cookie based user login
* OAuth (v1.0 & v2.0, see this [post](http://stackoverflow.com/questions/4113934/how-is-oauth-2-different-from-oauth-1) explaining the differences)
* API keys
* Custom Authentication

For the NYT example, since it is a publicly visible page we did not need to authenticate. HTTP basic authentication isn't too common for consumer sites/applications that have the concept of user accounts (like Facebook, LinkedIn, Twitter, etc.) but is simple to setup quickly and you often encounter it on with individual password protected pages/sites. I'm sure you have seen this before somewhere:

![http-basic](http://i.stack.imgur.com/QnUZW.png)

Cookie based user login is what the majority of services use when you login with a browser (i.e. username and password). Once you sign in to a service like Facebook, the response stores a cookie in your browser to remember that you have logged in (HTTP is stateless). Each subsequent request to the same domain (i.e. any page on `facebook.com`) also sends the cookie that contains the authentication information to remind Facebook's servers that you have already logged in.

Many REST APIs however use OAuth (authentication using tokens) which can be thought of a programmatic way to "login" _another_ user. Using tokens, a user (or application) only needs to send the login credentials once in the initial authentication and as a response from the server gets a special signed token. This signed token is then sent in future requests to the server (in place of the user credentials).

A similar concept common used by many APIs is to assign API Keys to each client that needs access to server resources. The client must then pass the API Key along with _every_ request it makes to the API to authenticate. This is because the server is typically relatively stateless and does not maintain a session between subsequent calls from the same client. Most APIs (including Yelp) allow you to pass the API Key via a special HTTP Header: "Authorization: Bearer <API_KEY>". Check out the [docs](https://www.yelp.com/developers/documentation/v3/authentication) for more information.


##  Authenticated HTTP Request with the Yelp API

First, store your Yelp credentials in a local file (kept out of version control) which you can read in to authenticate with the API. This file can be any format/structure since you will fill in the function stub below.

For example, you may want to store your key in a file called `api_key.txt`:

You can then read from the file using:
```python
def read_api_key(file):
    f = open(file,'r')
    api_key = f.read().replace('\n','')
    f.close()
    return api_key
```

**KEEP THE API KEY FILE PRIVATE AND OUT OF VERSION CONTROL**

In [182]:
def read_api_key(file):
    f = open(file,'r')
    api_key = f.read().replace('\n','')
    f.close()
    return api_key

api_key = read_api_key('api_key.txt')

Using the Yelp API, fill in the following function stub to make an authenticated request to the [search](https://www.yelp.com/developers/documentation/v3/business_search) endpoint.

> As a test, search for Indian restaurants  in Houston. You should find 254 total depending on when you search (but this will actually differ from the number of actual Business objects returned... more on this in the next section)

When writing the python request, you'll need to pass in a custom header as well as a parameter dictionary. See 
https://github.com/Yelp/yelp-fusion/blob/master/fusion/python/sample.py

```python
>>> api_key = read_api_key('api_key.txt')
>>> num_records, data = yelp_search(api_key, 'Indian','Houston, TX')
>>> print(num_records)
254
>>> for x in data: 
print x['name']
Surya India
Tarka Indian Kitchen
Govinda's Vegetarian Cuisine
Indika
Cowboys & Indians Tex-In Kitchen
...
```

In [183]:
import json
def yelp_search(api_key, query, location,offset=0):
    """
    Make an authenticated request to the Yelp API.

    Args:
        query (string): Search term

    Returns:
        total (integer): total number of businesses on Yelp corresponding to the query
        businesses (list): list of dicts representing each business
    """
    url = "https://api.yelp.com/v3/businesses/search"
    headers = {"Authorization": "Bearer %s" % api_key}
    url_params = {
        'term': query.replace(' ', '+'),
        'location': location.replace(' ', '+')
    }
    response = requests.request('GET', url, headers=headers, params=url_params)
    data = json.loads(response.content)
    num = data['total']
    objects = data['businesses']
    
    return (num, objects)



## Test yelp_search function

In [184]:
#print(yelp_search(read_api_key('api_key.txt'),'Indian','Houston, TX'))
num_records, data = yelp_search(api_key, 'Indian','Houston, TX')
print(num_records)
for x in data: 
    print (x['name'])

254
Surya India
Tarka Indian Kitchen
Govinda's Vegetarian Cuisine
Sai Bhog
India's Restaurant
Indika
Desi Kitchen
Cowboys & Indians Tex-In Kitchen
Pondicheri
Maharaja Bhog
Sangam Chettinad Indian Cuisine
Kiran's
Mayuri Express
Shiva Indian Restaurant
Tandoori Hut
Khyber North Indian Grill
Deli Deluxe
Nirvana Indian Restaurant
Aga's Restaurant & Catering
Shri Balaji Bhavan


## End of "Hello World" of the Yelp API
Now that we have completed the "hello world" of working with the Yelp API, we are ready to really fly! The next lab will have a bit less direction since there are a variety of ways to retrieve the requested information but you should have all the component knowledge at this point to work with the API. Yelp being a fairly general platform actually has many more business than just restaurants, but by using the flexibility of the API we can ask it to only return the restaurants.