**Below are the key features we scrape from the Yelp website:**
1. Restaurant name
2. Restaurant address
3. Restaurant rating (yelp rating 0-5 with 0.5 increment)
4. Hygiene (official scores: A, B, C)
5. Restaurant neighborhood (Morningside height, East Village, Chelsea, etc.)
6. Category (i.e. cuisine type: Chinese, Japanese, French, American, etc.)
7. Noise Level (Quiet, Noisy, Average)
8. Ambience (Romantic, trendy, etc.)
9. Price range (under  $$10, $11-30, $31-60, ... )
10. Parking Options (Street, Private Lot, Valet, etc)
11. Reservable? (Yes, No)
12. Has Gluten-free Options (Yes, No)
13. Alcohol (Beer & Wine Only, Full Bar, None)

     $ \vdots$
     
We ran the scraping code for the following cuisine types: Chinese, Korean, American, Indian, Japanese, Spanish, French, Italian, Greek, Thai, Mexico, Vietnamese, German, Iranian, Russian, and Turkey. 

We changed the first argument of the "get_urls_from_search(term, location, num)" function to get urls for each cuisine type. We have to change this manually and run separately. Otherwise, it is too easy to get IP ban from Yelp, and lose all the info.


For each cuisine type, we generate "{cuisine_type}_Restaurant.csv"

**Use the following code to scrape**

In [12]:
from bs4 import BeautifulSoup
import re
from threading import Thread
import urllib
import pandas as pd
import urllib.request
import time
from random import randint
from urllib.request import urlopen, Request

In [2]:
opener = urllib.request.build_opener()
# IE 9 proved to be the most successful
opener.addheaders = [('User-agent', 'IE 9/Windows: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)')]
urllib.request.install_opener(opener)

In [3]:
# Function that will do the scraping job from yelp
def scrape(ur):

    with urllib.request.urlopen(ur) as url:
        html = url.read()
    soup = BeautifulSoup(html,"lxml")
    retaurant_name = soup.find('h1')
    
    # create a dictionary business info for storing key business features 
    business_info = {}
    business_info['restaurant_name']= str(retaurant_name.text.strip().rstrip())
    
    if soup.find('span',itemprop="streetAddress") != None:
        retaurant_address = soup.find('span',itemprop="streetAddress")
        business_info['retaurant_address'] = str(retaurant_address.text.strip().rstrip())
    
    if soup.find('span',itemprop="postalCode") != None:
        restaurant_zipcode = soup.find('span',itemprop="postalCode")
        business_info['restaurant_zipcode'] = str(restaurant_zipcode.text.strip().rstrip())
    
    if soup.find('span',itemprop="reviewCount") != None:
        restaurant_reviewcount = soup.find('span',itemprop="reviewCount")
        business_info['restaurant_reviewcount'] = str(restaurant_reviewcount.text.strip().rstrip())
   
    if soup.find(itemprop="ratingValue") != None:
        business_info['restaurant_rating'] = soup.find(itemprop="ratingValue").get("content")

    if soup.find('span', {'class': 'neighborhood-str-list'}) != None:
        neighborhood = soup.find('span', {'class': 'neighborhood-str-list'})
        business_info['restaurant_neighobrhood'] = str(neighborhood.text.strip().rstrip())
   
    if soup.find('dd',{'class':"nowrap health-score-description"}) != None:
        hygiene_score = soup.find('dd',{'class':"nowrap health-score-description"})
        business_info['Hygiene_score'] = str(hygiene_score.text.strip().rstrip())
        
    if soup.find('dd', {'class':"nowrap price-description"}) != None:
        price_range = soup.find('dd', {'class':"nowrap price-description"})
        business_info['price_range'] = str(price_range.text.strip().rstrip())
   
    if soup.find('div',{'class':'short-def-list'}) != None:
        for i in soup.find('div',{'class':'short-def-list'}).findAll('dl'):
            key = i.find('dt').text.strip().rstrip()
            value = i.find('dd').text.strip().rstrip()
            business_info[str(key)]=str(value)
    
    if soup.find(property="place:location:latitude") != None:
        business_info['latitude'] = soup.find(property="place:location:latitude").get("content")

    if soup.find(property="place:location:longitude") != None:
        business_info['longitude'] = soup.find(property="place:location:longitude").get("content")  
    
    business_info['Category']= ''
    if soup.find('span',{'class':'category-str-list'}) != None:
        for i in soup.find('span',{'class':'category-str-list'}).findAll('a'):
            business_info['Category'] += (str(i.text.strip().rstrip())+'; ')
                
    return business_info

In [4]:
# Example
d = scrape('https://www.yelp.com/biz/blue-ribbon-sushi-new-york?osq=blue+ribbon')
print(d)

{'restaurant_name': 'Blue Ribbon Sushi', 'retaurant_address': '119 Sullivan St', 'restaurant_zipcode': '10012', 'restaurant_reviewcount': '1024', 'restaurant_rating': '4.0', 'restaurant_neighobrhood': 'South Village', 'Hygiene_score': 'A', 'price_range': '$31-60', 'Takes Reservations': 'No', 'Delivery': 'No', 'Take-out': 'Yes', 'Accepts Credit Cards': 'Yes', 'Accepts Apple Pay': 'No', 'Good For': 'Dinner', 'Parking': 'Street', 'Bike Parking': 'Yes', 'Wheelchair Accessible': 'No', 'Good for Kids': 'No', 'Good for Groups': 'Yes', 'Attire': 'Casual', 'Noise Level': 'Average', 'Alcohol': 'Full Bar', 'Outdoor Seating': 'No', 'Wi-Fi': 'No', 'Has TV': 'No', 'Waiter Service': 'Yes', 'Caters': 'No', 'Gender Neutral Restrooms': 'Yes', 'Category': 'Sushi Bars; Japanese; '}


In [5]:
# List of yelp urls to scrape
# change term to change the keyword we want to search
def get_urls_from_search(term, location, num):
    
    term = term.replace(' ','+')
    location = location.replace(' ','+')
    query = 'https://www.yelp.com/search?find_desc='+term+'&find_loc='+location+'&start='+str(num*10)
    with urllib.request.urlopen(query) as url:
        contents = url.read()
    
    soup = BeautifulSoup(contents, "html.parser")
    
    business_url = []
    for result in soup.findAll('a',{'class':'biz-name js-analytics-click'}):
        business_url.append("http://www.yelp.com" + result['href'])
    return business_url

In [6]:
# List of all locations
# 'Alphabet_City','Battery_Park','Chelsea','Chinatown','Civic_Center','East_Harlem','East_Village','Financial_District','Flatiron','Gramercy','Greenwich_Village','Harlem','Hell\'s_Kitchen','Inwood','Kips_Bay','Koreatown','Little_Italy','Lower_East_Side','Manhattan_Valley','Marble_Hill','Meatpacking_District','Midtown_East','Midtown_West','Morningside_Heights','Murray_Hill','NoHo','Nolita','Roosevelt_Island','SoHo','South_Street_Seaport','South_Village','Stuyvesant_Town','Theater_District','TriBeCa','Two_Bridges','Union_Square','Upper_East_Side','Upper_West_Side','Washington_Heights','West_Village', 'Yorkville'

# Yorkville entirely encompassed by Upper East Side
# Theater District entirely encompassed by Midtown West
searchLocations = ['Alphabet_City','Battery_Park','Chelsea','Chinatown','Civic_Center','East_Harlem','East_Village','Financial_District','Flatiron','Gramercy','Greenwich_Village','Harlem','Hell\'s_Kitchen','Inwood','Kips_Bay','Koreatown','Little_Italy','Lower_East_Side','Manhattan_Valley','Marble_Hill','Meatpacking_District','Midtown_East','Midtown_West','Morningside_Heights','Murray_Hill','NoHo','Nolita','Roosevelt_Island','SoHo','South_Street_Seaport','South_Village','Stuyvesant_Town','TriBeCa','Two_Bridges','Union_Square','Upper_East_Side','Upper_West_Side','Washington_Heights','West_Village']

**The following cell may take about 20 minutes to run**

In [7]:
# Get NYC Russian restaurant urls
max_num = 5
urls_set = []
for i, loc in enumerate(searchLocations):
    # now run for loop with fix location and food type and append urls 
    # page ranking is based on relevance ranked by yelp
    for num in range(0,max_num):
        urls = get_urls_from_search("Russian restaurants",loc, num)
        urls = urls[1:] # 0th link is irrelavant
        # len(urls)=0 if the starting page number exceed the maximum possible
        if (len(urls) ==0):
            break
        else:
            for i in range(0,len(urls)-1):
                urls_set.append(urls[i])
                
    # Delays to help reduce queries and reduce the possibility of IP Ban            
    time.sleep(5)

#check urls_set lenght
print(len(urls_set))
urls_set

549


['http://www.yelp.com/biz/mari-vanna-new-york-2?osq=Russian+restaurants',
 'http://www.yelp.com/biz/veselka-new-york?osq=Russian+restaurants',
 'http://www.yelp.com/biz/anyway-cafe-new-york?osq=Russian+restaurants',
 'http://www.yelp.com/biz/streecha-new-york?osq=Russian+restaurants',
 'http://www.yelp.com/biz/odessa-restaurant-new-york-129?osq=Russian+restaurants',
 'http://www.yelp.com/biz/onegin-new-york-4?osq=Russian+restaurants',
 'http://www.yelp.com/biz/babushkas-russian-dumplings-new-york?osq=Russian+restaurants',
 'http://www.yelp.com/biz/ukrainian-east-village-new-york?osq=Russian+restaurants',
 'http://www.yelp.com/biz/kafana-new-york?osq=Russian+restaurants',
 'http://www.yelp.com/biz/antons-dumplings-brooklyn?osq=Russian+restaurants',
 'http://www.yelp.com/biz/old-tbilisi-garden-new-york-5?osq=Russian+restaurants',
 'http://www.yelp.com/biz/klimat-lounge-new-york?osq=Russian+restaurants',
 'http://www.yelp.com/biz/edi-and-the-wolf-new-york?osq=Russian+restaurants',
 'http:

In [8]:
pd_urls_set = pd.DataFrame(urls_set)
url = pd_urls_set.drop_duplicates().values.tolist() # drop duplicates to improve efficiency
urls = [i[0] for i in url]
# check number of restaurants pages we need to scrape
print(len(urls))

74


**It takes the scrape function about 20-30 seconds to scrape each webpage**

**So, the following cell takes about (20$\times$ len(urls)) seconds to run**

e.g. if we have 1000 urls to run then the following code takes about 6 hours to run.

In [9]:
# header contains the info we want to scrape
header=['restaurant_name', 'retaurant_address', 'restaurant_zipcode', 'restaurant_reviewcount', 'restaurant_rating', 'restaurant_neighobrhood', 'Hygiene_score', 'price_range', 'Liked by Vegetarians', 'Takes Reservations', 'Delivery', 'Take-out', 'Accepts Credit Cards', 'Accepts Bitcoin', 'Parking', 'Bike Parking', 'Wheelchair Accessible', 'Good for Kids', 'Good for Groups', 'Attire', 'Noise Level', 'Alcohol', 'Happy Hour', 'Outdoor Seating', 'Wi-Fi', 'Has TV', 'Dogs Allowed', 'Waiter Service', 'Caters', 'Category', 'Has Soy-free Options', 'Has Dairy-free Options', 'Liked by Vegans', 'Has Gluten-free Options', 'Good For', 'Ambience', 'Gender Neutral Restrooms']
info={}
for u in urls:
    url_dict=scrape(u)
    for i in header:
        if i in url_dict.keys():
            info.setdefault(i,[]).append(url_dict[i])
        else:
            info.setdefault(i,[]).append('NA')
    time.sleep(2)

In [10]:
df = pd.DataFrame(info)
df.to_csv('Russian_Restaurant.csv') 

In [13]:
df.head()

Unnamed: 0,Accepts Bitcoin,Accepts Credit Cards,Alcohol,Ambience,Attire,Bike Parking,Category,Caters,Delivery,Dogs Allowed,...,Waiter Service,Wheelchair Accessible,Wi-Fi,price_range,restaurant_name,restaurant_neighobrhood,restaurant_rating,restaurant_reviewcount,restaurant_zipcode,retaurant_address
0,,Yes,Full Bar,,Casual,No,Russian;,Yes,Yes,,...,,,Free,$31-60,Mari Vanna,Flatiron,4.0,587,10003,41 E 20th St
1,No,Yes,Beer & Wine Only,Casual,,Yes,Diners; Ukrainian;,Yes,Yes,No,...,Yes,Yes,Free,$11-30,Veselka,East Village,4.0,2288,10003,144 2nd Ave
2,,Yes,Full Bar,Intimate,Casual,Yes,Russian; French;,No,No,,...,,,No,$11-30,Anyway Cafe,East Village,4.0,198,10003,32 E 2nd St
3,,No,No,Casual,Casual,Yes,Ukrainian;,No,No,,...,,,No,Under $10,Streecha,East Village,4.5,113,10003,33 E 7th St
4,,Yes,Full Bar,Casual,,Yes,Diners; Ukrainian; Greek;,No,Yes,,...,,,No,Under $10,Odessa Restaurant,"East Village, Alphabet City",3.5,326,10009,119 Ave A


On our local machines, we generate csv by replacing the term argument in the  "get_urls_from_search" function with:
Chinese, Korean, American, Indian, Japanese, Spanish, French, Italian, Greek, Thai, Mexico, Vietnamese, German, Iranian, Russian, and Turkey. 

**Caveat:** Both "get_urls_from_search" and "scrape" functions are easy to get IP ban from Yelp, so we need to change IP constantly to get all the above cuisine types, which can be very time consuming.