# Creating a Dataset of Brewery Data via Yelp Fusion API

### *Note that due to daily Yelp API query restrictions, this notebook must be run four times to obtain full brewery data. 

###The DSCI 511 Brewery Team:

Wynton Britton

Russell Destremps

Hao Deng

Evan Falkowski

DSCI 511, Data Acquisition and Pre-Processing


In [1]:
# import required libraries 

import pandas as pd
import requests as rq
import os; import json; import sys
from pandas import DataFrame

## Bring in master brewery list 

In [2]:
### ***Note for funcitons created below, you will ahve to change the drive mapping within each function ***
# mount drive

from google.colab import drive
drive.mount('/content/drive')
#drive.mount('/content/gdrive')

Mounted at /content/drive


In [3]:
#Path must also be adjusted to current working directory

master_list = pd.read_csv("/content/drive/MyDrive/Python/DSCI_511/Project/team_project/brewery_master.csv")
#master_list = pd.read_csv('/content/gdrive/My Drive/Drexel/Brewery_Project/brewery_master.csv')

## Data Source Exploration  

**YELP:**

The first step of accessing Yelp API infos is to create an app inside Yelp Fusion API in order to get Client ID and API Key. You need to access to https://www.yelp.com/fusion to register an app as well as check out the directory of Yelp Fusion API. 

Once you have the Client ID and API Key, a code demonstrated as below will allow you to sesrch & collect information from Yelp as your request.


**Yelp API Introduction:**

Yelp Fusion API has provided several business endpoints for users to pull different parameters through API call. In our research project, our goal is to collect brewery name, ratings and reviews from Yelp API. To get parameters we need, we need to take few steps on API calls in order to pull every parameter we need.

**Step 1:** Get Yelp Unique Business ID through Business Search
    Yelp has unique business ID for every single business in its database. We need their business id to get their business name from business detail. To start, we need to do business search by terms, which is brewery.
    
    Limitation:
    For parameters in business search, the maximum number of business return per time of API call is 50. We can search 1,000 business for a term in total. That will require 'offset' parameter to get the rest. Ex. In the first code section down below, the limit and offset parameters are set to 50. That means for the first call of business search. The call will return 50 records from 0-50. And for the next call it will return another 50 records from 51 - 100. So, in order to get 1,000 records we may need to call up to 20 times.
    For location, abbreviation of United States may allow us to have search the business nationwide. However, the extra would NOT show up if the number of businesses are over 1,000.
    
**Step 2:** Get business name from Business Details using Business ID
    We have our business id ready in step 1. Now, we need to get business name in order to proceed to step 3 for the reviews through Business Details API call.
    We take a list of business ids into the endpoint url then output with a list of business names that match to their business id.
    
**Step 3:** Get reviews from Reviews using Business name
    Now we have a list of business names that can be taken into Review API call to get reviews from Yelp.
    
    Limitation:
    Yelp only returns up to three review excerpts for a given business ordered by Yelp's default sort order.
    What is Yelp's default sort order?
    Yelp's default sort order shows reviews that help consumers make informed decisions. The order is determined by recency, user voting, and other review quality factors, which is why an older review may appear before a newer one. To personalize the experience for each individual user, we'll favor reviews from your friends and the users you follow. You do have the option to sort the reviews in a few other ways: by date, star rating, and those written by Elites.

**Step 4:** Merge data outcome into dataframe


Limitations of Yelp API:
  - The link below explains the daily limitations of the yelp API for a given client. A client being the developer or product requesting information from the API. Each client recieves one API key that allows at most 5,000 calls to be made with the key to any of Yelp's API's. The number of calls allowed daily, the number of calls remaining for the given day, and the amount of time remaining before a reset are all maintained through Yelp and visible to developer. Each key's daily limit is reset at midnight or 00:00 UTC, which is 19:00 ET. In our scenario, our data was in total 8,000 breweries and therefore had to split up the data into four buckets. With each bucket containing a list of 2,000 breweries we were able to collect the data all in one day by each creating a key. We first used the key to get the business ID, and then using the ID made another call to get the reviews.

- https://www.yelp.com/developers/documentation/v3/rate_limiting

In [4]:
#################################
# We had to keep in mind, that whatever size we split the master_list into, we 
# would be calling that amount twice. For our given list, we split the data into
# four buckets for a total of 2000 breweries in each. Allowing us to stay under
# the daily limit of 5000 when called twice. 

l=len(master_list) #size of df
n=int(l/4) #number of copies

copy1=master_list.iloc[0:n]  
copy2=master_list.iloc[n:n*2]
copy3=master_list.iloc[n*2:n*3]
copy4=master_list.iloc[n*3:l]

#print(len(copy1));print(len(copy2));print(len(copy3));print(len(copy4))
copy1.head()

Unnamed: 0,type,location_name,region,latitude,longitude
0,Brewpub,101 Brewery,WA,47.822407,-122.875356
1,Brewpub,122 West Brewing Co,WA,48.762557,-122.485773
2,Brewpub,12Degree Brewing,CO,39.978215,-105.131876
3,Brewpub,15 24 Brew House,KS,39.376021,-97.127491
4,Brewpub,16 Stone Brewpub,NY,43.241849,-75.256302


In [None]:
# Use our copy to obtain name, lat, long, and use those to search yelp api. 
# Define the API Key, the EndPoint, and the Header. All offered from Yelp's site. 

API_KEY = '-1v4CcJvsknOCxy_E0wskNC6FAT-UIs8P93vNqDEW4_XIXHLSHvhxY7nwFq5GPCOYBSwRIDGoftUmwKfdVla4G9VT2iC1aRFpdbmnkPqqVSqJxc2xGTFSMFJDFfFX3Yx'
ENDPOINT = 'https://api.yelp.com/v3/businesses/search'
HEADERS = {'Authorization': 'bearer %s' % API_KEY}

# Create the lists for the given copy in use.
cp1_id = [];cp1_rating = [];cp1_num = []

# Loop over the copy in use and retain the name, the latitude,& the longitude. 
for i in copy1.iterrows():
  brew_name=i[1][1]
  lat = i[1][3]
  lon = i[1][4]

# For the given brewery in the copy, apply the name, lat, and lon in reference
  parameters = {
      'term': brew_name,
      'radius': 1000,
      'latitude': lat,
      'longitude':lon 
  }

# Make the call to the business search and retain the response for given brewery
  response = rq.get(url = ENDPOINT, params = parameters, headers = HEADERS)
  business_data = response.json()

# For each response, we come across three possible cases and handle them 
# accordingly. The cases are a complete output, an empty output, and an error 
# output. If we recieve anything other than a complete response, we enter an 
# 'Na'. If we recieve a successful response we retain the yelp ID, the average 
# rating,& the number of ratings. 
  if 'businesses' in business_data.keys() and len(business_data['businesses'])!=0:
    cp1_id.append(business_data['businesses'][0]['id']) #yelp api 
    cp1_rating.append(business_data['businesses'][0]['rating']) #average rating:
    cp1_num.append(business_data['businesses'][0]['review_count']) #number of rating
  else:
    cp1_id.append('Na'); cp1_rating.append('Na'); cp1_num.append('Na')

# The loop ends, and we apply our new information back into the copy of our data.
copy1['yelp_id'] = cp1_id; copy1['yelp_ave'] = cp1_rating; copy1['yelp_reviews'] = cp1_num; copy1.head()


In [None]:
## Pulling top yelp review and according comment of given yelp ID ##
# Define the same from before.
API_KEY = '-1v4CcJvsknOCxy_E0wskNC6FAT-UIs8P93vNqDEW4_XIXHLSHvhxY7nwFq5GPCOYBSwRIDGoftUmwKfdVla4G9VT2iC1aRFpdbmnkPqqVSqJxc2xGTFSMFJDFfFX3Yx'
HEADERS = {'Authorization': 'bearer %s' % API_KEY}
# Define the parameters
PARAMETERS_REVIEW = {'locale':'en_US'}
# Define lists that will retain info
cp1_review = []; cp1_rating=[]

# Make a request to Yelp reviews API, for given yelp id of given brewery
for i in cp1_id:
    ENDPOINT_REVIEW = 'https://api.yelp.com/v3/businesses/{}/reviews'.format(i)
    response_review = rq.get(url = ENDPOINT_REVIEW, headers = HEADERS)

# Store response and handle conditions accordingly. Possible conditions are 'Na' 
# causing an error messgae, or a successful collection of reviews.
    business_data_review = response_review.json()
    if 'reviews' in business_data_review.keys():
      cp1_review.append(business_data_review['reviews'])
    else:
      cp1_review.append('Na')

print(cp1_review[0])


In [None]:
# Reviews come in as 'Na', or a number of reviews ranging from 0 to 3. Wasn't 
# able to retain the x number of reviews and currently retaining the top rating
# along with it's comment. Filled in Na's for Na's.
cp1_text = []; cp1_rating = []; cp1_rt =[]; cp1_r=[]

for i in cp1_review:
  if i == 'Na':
    cp1_rt.append('Na')
    cp1_r.append('Na')
  else:
    cp1_rt.append(i[0]['text'])
    cp1_r.append(i[0]['rating'])

cp1_text=cp1_rt; cp1_rating=cp1_r

In [None]:
# Add our new found rating and it's according comment (or the 'Na') back into
# the copy of the dataframe from master_list. New headers as 'Ratings' and 'Review Text'.
copy1['Ratings'] = cp1_rating
copy1['Review_Text'] = cp1_text

# Check the output
copy1.head()

# Save dataframe to csv and change number in **FILE NAME** as well. 
copy1.to_csv('/content/gdrive/My Drive/Drexel/Brewery_Project/copy1YelpData.csv')
