# Yelp Web Scraping

**In the practice we will use requests+json to develop a simple but efficient Yelp crawler (offical API based)**

In lots of situation, we can get data from popular platforms such as google map, Yelp etc. using their official Web APIs. To use these Web APIs, you will need a developer account for the target platform, and in general, there are QPS and amount limit for the free quota. Of course you can always buy more quota if your want (a little expensive for students).

For google and Yelp's APIs such as Yelp's "Bussiness Search", each request will just return a small amount of data (the entire data is divied into a bunch of pages), and there is a limitation of the total results you can get in a single research. For example, in Seattle, WA, there may be 30000 restuarants, however in a single search your can get no more than 1000 resturants, and the result is divided into 20 pages, each page contains 50 records (one request, one page). To approximate the real number "30,000", you will need advanced techniques such as grid search algorithm or recursive search algorithm. We wouldn't learn about thse advanced techniques in this turtorial, but we will manage to get the allowed max num in one single serach by simply retrieve data in different pages.

In [None]:
#Read the following source code to learn basic skills of developing a web crawler based on official API

#pip install requests json pandas
#For colab, requests,json,pandas are pre-installed

import requests
import csv
import json
import math
import pandas as pd
import time


# API Path
BUSINESS_SEARCH="https://api.yelp.com/v3/businesses/search"
REVIEWS="https://api.yelp.com/v3/businesses/id/reviews"

# API Key (Please change to your own key)
# See https://www.yelp.com/developers/v3/manage_app
API_KEY='yWY2oA2ERfUifRPAPcN9OO9k3AT1qzbup94NbExuABZoHkzDHtVjWzy1XTxKS0DFVtmsrNSPMN7A_ToLNC1UJnOoM5ovlZ_mlBXu6Uyqb-jptttCbRPjIiZFOjshY3Yx'
HEADER = {
      'Authorization': 'Bearer %s' % API_KEY,
  }

# constant params
MAX=50
LIMIT=10


def yelp_search(term='business',location='Green Lake, Seattle, WA',pause=0.25,includeComments=False):
  """Extract information of Yelp (By Yifan)

  Retrieves basic information (max 1000 items) and attached comments (max 3 comments each item) of Yelp search
  Key reference: https://www.yelp.com/developers/documentation/v3/business_search
  
  Args:
    term: Search term, for example "food" or "restaurants". The term may also be business names, such as "Starbucks".
    location: Geographic area to be used when searching for businesses. Examples: "New York City", "NYC", "350 5th Ave, New York, NY 10118". Businesses returned in the response may not be strictly within the specified location.
    pause: Speed contral, single pause time (s) for avoiding overheated QPS
    includeComments: whether to retrived attached comments

  Returns:
    A pandas dataframe
    if includesComments=False: the dataframe will contains 12 columns(content_id,name,rating,reviews,phone,address,city,state,country,postcode,latitude,longitude)
    else: the dataframe will contains addtional 5 columns (comment_id,user_id,user_name,user_rating,comment)
    A list, which contains 9 items (content_id,name,rate,review,tags,address,city,state,postcode)
    the content_id is important - we need it to further request comments
  """

  params = {
        'term': term, 
        'location': location,
        'limit': LIMIT,
        }
  #determine how many terms are at the location and how many we can get
  # the "total" in current API is limited to 240, in other words, it no longer reflects the real number of businesses in the area
  # if it works, the alternative code for "total=MAX" should work
  #--------------------alternative code----------
  # total= json.loads(requests.request('GET',BUSINESS_SEARCH, headers=headers,params=params).text)['total']
  # print('there are {} {} at {}'.format(total,term,location))
  # if total>MAX:
  #   total=MAX
  
  total=MAX
  
  #calculate the offset list we need to retrive the research list
  offsets=[i*LIMIT for i in range(math.ceil(total/LIMIT))]
  
  # initial the result list
  rst=[]
  print('**********Start**********')
  #retrive data according to offset list
  for offset in offsets:
    print("Retrieving: {}/{}".format(offset,total))
    time.sleep(pause)
    params['offset']=offset
    response = requests.request('GET',BUSINESS_SEARCH, headers=HEADER,params=params)
    data = json.loads(response.text)['businesses'] # turn the responese's json string to dictionary
    for item in data:             # extract data we need in loop
      # get basic information
      content_id=item['id']
      name=item['name']
      rating=item['rating']
      reviews=item['review_count']
      phone=item['phone']
      address=item['location']['address1']
      city=item['location']['city']
      state=item['location']['state']
      country=item['location']['country']
      postcode=item['location']['zip_code']
      latitude=item['coordinates']['latitude']
      longitude=item['coordinates']['longitude']
      rst.append([content_id,name,rating,reviews,phone,address,city,state,country,postcode,latitude,longitude])
  # rst=rst[0:total] #drop redundant duplicates which is due to the mechanism of offset
  # reformate the result list to a pandas dataframe
  df=pd.DataFrame(data=rst,columns=["content_id","name","rating","reviews",
                                    "phone","address","city","state","country",
                                    "postcode","latitude","longitude"])
  df.drop_duplicates('content_id','first',inplace=True)
  print("{} {} at {} Retrieved.".format(len(df),term,location))
  #retrive additional comments
  if includeComments:
    rst=[]
    for i,content_id in zip(range(len(df['content_id'])),df['content_id']):
      if (i+1)%5==0:
        print("Comments Retrieving: {}/{}".format(i+1,len(df['content_id'])))
        time.sleep(pause)
      response = requests.request('GET',REVIEWS.replace('id',content_id), headers=HEADER)
      data = json.loads(response.text)['reviews']
      cmts=[]
      for comment in data:
        comment_id=comment['id']
        user_id=comment['user']['id']
        user_name=comment['user']['name']
        user_rating=comment['rating']
        comment=comment['text']
        rst.append([content_id,comment_id,user_id,user_name,user_rating,comment])
    cdf=pd.DataFrame(data=rst,columns=["content_id","comment_id","user_id",
                                     "user_name","user_rating","comment"])
    print('{} attached comments retrieved.'.format(len(cdf)))
    df=pd.merge(df,cdf,on='content_id')
  print('**********FINISH**********')
  return df

In [None]:
#Excute Demo: Extract basic information of Yelp search result
MAX=50 # Just for demo to reduce API usage. Please delete this line in real practice, or set "MAX=1000". 
df1=yelp_search(term='business',location='Green Lake, Seattle, WA',pause=0.25,includeComments=False)
df1

**********Start**********
Retrieving: 0/50
Retrieving: 10/50
Retrieving: 20/50
Retrieving: 30/50
Retrieving: 40/50
50 business at Green Lake, Seattle, WA Retrieved.
**********FINISH**********


  df.drop_duplicates('content_id','first',inplace=True)


Unnamed: 0,content_id,name,rating,reviews,phone,address,city,state,country,postcode,latitude,longitude
0,OuxP_uWXB-YG8EYGVOX7FA,Seattle CPA Professionals,4.5,25,12064207329,559 NE 80th St,Seattle,WA,US,98115,47.686613,-122.3204
1,lY3YoAx3XmBnlOv8MKysXw,Clear Sky Bookkeeping,5.0,4,14252432029,,Seattle,WA,US,98103,47.660009,-122.342557
2,m_x3dvMGCvXF0Fj6AeauoA,Morsel,4.5,1155,12062680154,5000 University Way NE,Seattle,WA,US,98105,47.665441,-122.312814
3,L2bswPTN84fdTny7y1EYRA,Works Progress,5.0,10,12064661624,8001 14th Ave NE,Seattle,WA,US,98115,47.68684,-122.313617
4,75nF3g8q4RQHD17yQQ47HQ,The Wise Owl Books and Music,5.0,3,12065803211,2223 N 56th St,Seattle,WA,US,98103,47.668829,-122.331963
5,q_rN813GkQD8ryeT9QeWug,Stretch and Staple,4.5,41,12066079277,8005 Greenwood Ave N,Seattle,WA,US,98133,47.687095,-122.355302
6,PQPVZyr-ssrcF28c680O9Q,Clear Skies Cleaning,5.0,136,12066697706,,Seattle,WA,US,98107,47.689575,-122.354844
7,0lHUizq6U2UIf4KLPTUdng,Up Time Technology,4.5,35,12065471817,2408 N 45th St,Seattle,WA,US,98103,47.661575,-122.329643
8,PpOAc_6PXtC8gHmae4P1Jw,Tula's Cleaning Service,5.0,31,12067870805,8800 Nesbit Ave N,Seattle,WA,US,98112,47.69254,-122.34314
9,MQRXvwKGLvkxhEmWN8Tdkg,David Drake - Windermere Property Management,5.0,21,12063946614,819 NE 65th St,Seattle,WA,US,98115,47.67561,-122.319092


In [None]:
#Excute Demo: Extract basic information with comments of Yelp search result
MAX=10 # Just for demo to reduce API usage. Please delete this line in real practice, or set "MAX=1000". 
df1=yelp_search(term='restaurants',location='Green Lake, Seattle, WA',pause=0.25,includeComments=True)
df1

**********Start**********
Retrieving: 0/10


  df.drop_duplicates('content_id','first',inplace=True)


10 restaurants at Green Lake, Seattle, WA Retrieved.
Comments Retrieving: 5/10
Comments Retrieving: 10/10
30 attached comments retrieved.
**********FINISH**********


Unnamed: 0,content_id,name,rating,reviews,phone,address,city,state,country,postcode,latitude,longitude,comment_id,user_id,user_name,user_rating,comment
0,Gn5erxCRML47GgbGYdxzFA,Bongos,4.5,1215,12064208548,6501 Aurora Ave N,Seattle,WA,US,98103,47.676745,-122.346925,gsqMRGFcQqEk_8_AU4xh6w,2Mnd6FEaHuI1p01f-LmG1g,Ana K S.,5,I am forever grateful to the Yelp community fo...
1,Gn5erxCRML47GgbGYdxzFA,Bongos,4.5,1215,12064208548,6501 Aurora Ave N,Seattle,WA,US,98103,47.676745,-122.346925,Bfu8DA-2VJHeaptKM2drdg,LMBh_gkkp_tHQuIw72l65g,Luba K.,5,"Yes, yes, this place is as good as everyone sa..."
2,Gn5erxCRML47GgbGYdxzFA,Bongos,4.5,1215,12064208548,6501 Aurora Ave N,Seattle,WA,US,98103,47.676745,-122.346925,fd2q02UYdyxfjiu8PHPEbg,VGPcMHDu2Ni0qzJtlprNXw,Katy H.,4,After hearing such good reviews I was pleasant...
3,RS-Hlsx7k90m5QODHDs5Cg,Tapas Lab,4.5,235,12067751744,7012 Woodlawn Ave NE,Seattle,WA,US,98115,47.67934,-122.32439,vCpqMiCoZzJaF9J4nSBWDA,iBim7ih7ue_EGczvtHVU7Q,Vy M.,5,Finally tried Tapas Lab and I am very pleased!...
4,RS-Hlsx7k90m5QODHDs5Cg,Tapas Lab,4.5,235,12067751744,7012 Woodlawn Ave NE,Seattle,WA,US,98115,47.67934,-122.32439,t2TXbaBTfGA5U2IqPboJwQ,fDPw0phD3xzEdMrAP0iutw,Vita L.,4,There aren't many vegan options. I was on a da...
5,RS-Hlsx7k90m5QODHDs5Cg,Tapas Lab,4.5,235,12067751744,7012 Woodlawn Ave NE,Seattle,WA,US,98115,47.67934,-122.32439,TSJx05vVg2n6eH1iGcPXOQ,X75ftsvDkgvjcjhjrMuakw,Will L.,5,I've been meaning to try this place for a whil...
6,DgNKaOrCZg4CB51QY2DHMQ,Eight Row,4.5,159,12062943178,7102 Woodlawn Ave NE,Seattle,WA,US,98115,47.68012,-122.32437,mBKSQGs-0Sn2lcEwyhXNSQ,C6LV0p8L6IfBfh1YFRkWzQ,Elizabeth L.,5,"Outstanding farm to table prix fixe, seasonal ..."
7,DgNKaOrCZg4CB51QY2DHMQ,Eight Row,4.5,159,12062943178,7102 Woodlawn Ave NE,Seattle,WA,US,98115,47.68012,-122.32437,YUQmZRh6iPYmupvQQs2mXA,weG2lafFtd00B9XklzHfWg,Steve W.,2,We were so looking forward to dinner here and ...
8,DgNKaOrCZg4CB51QY2DHMQ,Eight Row,4.5,159,12062943178,7102 Woodlawn Ave NE,Seattle,WA,US,98115,47.68012,-122.32437,u1FArf4TjMNYfdZwXwEJ5A,2E9njyTMI0qzQOKhe8mHDg,Benjamin H.,4,"Lovely space, very nice servers. I'm told it's..."
9,MR_l59qW3luE161L6_dPgQ,Restaurant Christine,4.5,41,12064204781,2227 N 56th St,Seattle,WA,US,98103,47.66876,-122.33183,jZe0guWQWfgeG_KEQaPbEg,mGmLgnaZfLwuO93Bsc-Ghw,Rikki N.,5,So far I've tried the pulled pork sammy with t...


Now the following code will filter the scraped data and store the columns that we need in the next step into a new data frame with myData.

In [None]:
myData=df1.filter(items=['name','rating','comment', 'latitude', 'longitude'])
myData

Unnamed: 0,name,rating,comment,latitude,longitude
0,Bongos,4.5,I am forever grateful to the Yelp community fo...,47.676745,-122.346925
1,Bongos,4.5,"Yes, yes, this place is as good as everyone sa...",47.676745,-122.346925
2,Bongos,4.5,After hearing such good reviews I was pleasant...,47.676745,-122.346925
3,Tapas Lab,4.5,Finally tried Tapas Lab and I am very pleased!...,47.67934,-122.32439
4,Tapas Lab,4.5,There aren't many vegan options. I was on a da...,47.67934,-122.32439
5,Tapas Lab,4.5,I've been meaning to try this place for a whil...,47.67934,-122.32439
6,Eight Row,4.5,"Outstanding farm to table prix fixe, seasonal ...",47.68012,-122.32437
7,Eight Row,4.5,We were so looking forward to dinner here and ...,47.68012,-122.32437
8,Eight Row,4.5,"Lovely space, very nice servers. I'm told it's...",47.68012,-122.32437
9,Restaurant Christine,4.5,So far I've tried the pulled pork sammy with t...,47.66876,-122.33183


Now you will download data in CSV or JSON and clean.

# Build a nested database

This is a process after cleaning the raw data

The following code is a method to create a nested data by a business name. The data includes name, lat, lon, and review and the column review has two nested data of ratings and comment.  
Our particular radial dendrogram creates nodes with the column with 'name'. Note that business names and comments are stored in 'name' column at a different depth in the data hierarchy. 

In [None]:
def add_root(df_root):
    for row in df_root.itertuples():
        yield {"name": 'review',
               "children":list(pd_to_review_dict(df_root))
        }
def pd_to_review_dict(df):
    for (name), df_name_grouped in df.groupby(["name"]):
        yield {
            "name": name,
            "latitude": df.loc[df['name'] == name, 'latitude'].iloc[0],
            "longitude": df.loc[df['name'] == name, 'longitude'].iloc[0],
            "children": list(split_line_items(df_name_grouped))
        }
        
def split_line_items(df_review):
    for row in df_review.itertuples():
        yield {
            "value": row.rating,
            "name": row.comment
        }

Now you will pass the data frame from the crawler into the method above by running the following code block and it will print the data in json in an output cell and name the data in json format to 'myYepData. 

In [None]:
df_pd = pd.DataFrame(myData)
review_list = list(add_root(df_pd))
review_dict = {}
for sub_dict in review_list:
        review_dict.update(sub_dict)
print(json.dumps(review_dict, indent=4))
myYelpData=json.dumps(review_dict)

{
    "name": "review",
    "children": [
        {
            "name": "Bongos",
            "latitude": 47.6767449533551,
            "longitude": -122.346925400198,
            "children": [
                {
                    "value": 4.5,
                    "name": "I am forever grateful to the Yelp community for helping us find this spot. It must feel like an echochamber of positive reviews for newcomers but this place..."
                },
                {
                    "value": 4.5,
                    "name": "Yes, yes, this place is as good as everyone says. Trust me, I've come here 3x and it was randomly closed...and I still came back to try it...and then came..."
                },
                {
                    "value": 4.5,
                    "name": "After hearing such good reviews I was pleasantly surprised that this food lives up to its reputation. The veggie platter was particularly amazing for..."
                }
            ]
        },
        

Check if myYelpData has created successfully.

In [None]:
myYelpData

'{"name": "review", "children": [{"name": "Bongos", "latitude": 47.6767449533551, "longitude": -122.346925400198, "children": [{"value": 4.5, "name": "I am forever grateful to the Yelp community for helping us find this spot. It must feel like an echochamber of positive reviews for newcomers but this place..."}, {"value": 4.5, "name": "Yes, yes, this place is as good as everyone says. Trust me, I\'ve come here 3x and it was randomly closed...and I still came back to try it...and then came..."}, {"value": 4.5, "name": "After hearing such good reviews I was pleasantly surprised that this food lives up to its reputation. The veggie platter was particularly amazing for..."}]}, {"name": "Eight Row", "latitude": 47.68012, "longitude": -122.32437, "children": [{"value": 4.5, "name": "Outstanding farm to table prix fixe, seasonal tasting menus feature local ingredients from independent farms and orchards. Creative dishes incorporating the..."}, {"value": 4.5, "name": "We were so looking forwar

Our data is ready to store in an physical form and download it.

In [None]:
from google.colab import files

with open('myYelpData.csv', 'w') as f:
  f.write(myYelpData)
  files.download('myYelpData.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

myYelpData.json should be in your download folder. Open it in a text editor and begin to develop a context of your neighborhood story.