# Webscraping restaurant listings from Yelp

This project is a part of the [Zero to Data Analyst Bootcamp by Jovian](https://www.jovian.ai/data-analyst-bootcamp)
![](https://imgur.com/YKrRS4U.png)


Yelp is a San Fransisco, CA headquartered directory listing service. Its USP is user reviews, recommendations and ratings to find nearby restaurants, shopping, nightlife, food, entertainment, things to do and other services.

**Problem Statement**: In this notebook we will write python functions that will create a CSV file containing details of
    restaurants that are listed for the city of New York, USA on www.yelp.com.

        Name: Name of the restaurant
        Cuisine: Type of food
        Stars: Rating based on user inputs for this restaurant
        Reviews: Number of users who rated this restaurant
        Address: Address of the restaurant
        Contact: phone number
        Website: Yelp url for the restaurant

**Why is this data important?**

Restaurant listing services are extremely helpful whether you are a foodie looking to explore world cuisine, a travellor in the city looking for the comforts of home or celebrating an occasion with your friends and family.
    
But did you know?

- The restuarant business in the US recorded sales of \\$863 billion in 2019 and employs 12.5 million people? (\\$659 billion in 2020 due to the pandemic) This is about 4% of the US GDP.

- 1 in 10 Americans are employed in the restaurant industry. Nearly one in three Americans had their first job at a restaurant

- Consumers spent 49% of their food budget in restaurants before the pandemic in 2019.


**Who will find this data useful?**

 _1) Diners/Customers:_ Seek the best restaurant in an area or offering a cuisine based on user reviews, compare prices

 _2) Restaurants:_ Use the data for various kinds of decision making.
 
   - customer profiling, site selection, forecasting, customer relationship management, menu design            
   - improve the brand, evaluate whether to open new branches, offer franchises
   - understand changing customer behaviour post pandemic
   - upgrade to newer technology to improve social engagement, integration of tech into delivery service

_3) Startups in the restaurant space:_ Research trends to select suitable location, cuisine, menu design etc.

_4) Investors:_ 

   - Identify the upstream and downstream business uch as restaurant aggregation to food delivery.e.g., zomato, swiggy
   - Evaluate brand and social impact of the business, analyse competition
   - New investing opportunites based on trends


## How to run the code

- Click the "Run" button at the top of this page and select binder to run the code.

- If you want to make changes, you could duplicate a version of this notebook, edit and save it on [Jovian](https://jovian.ai).
  Run the following commands on your Jupyter notebook to save your changes.

In [None]:
project_name='restaurant-listings-notebook'

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
jovian.commit(project=project_name, privacy='secret')

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "anushree-k/restaurant-listings-notebook" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/anushree-k/restaurant-listings-notebook[0m


'https://jovian.ai/anushree-k/restaurant-listings-notebook'

## How to read this notebook?

 This notebook is organised as follows:
 - **Section 1** : Import libraries
 - **Section 2** : Helper functions
 - **Section 3** : Exploratory code to identify revelant tags and attributes
 - **Section 4** : Main function which does the webscraping and writes the output to a CSV file.

### Section 1: Import libraries

In [None]:
!pip install jovian requests beautifulsoup4 --upgrade --quiet

In [None]:
import jovian
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

### Section 2: Helper functions

This section is built up gradually during the exploration of the BeautifulSoup object in section 3.

In [None]:
# Helper functions 

def get_topic_page(url):
    """Read a url and get the html page contents as text from the response object.
    Output: If successful returns the HTML text of the URL else raises an exception"""
    
    response = requests.get(topic_url)
    
    if response.status_code == 200:
        page_contents = response.text
    else:
        print("Status code:", response.status_code)
        raise Exception("Failed to fetch web page" + topic_url)
    
    return response.text

def create_file(file_name, file_contents):
    """Create a HTML file of name file_name with the file_contents.
    Output: Creates a HTML file with the file_name which includes path"""
    
    with open(file_name, 'w', encoding = 'utf-8') as file:
        file.write(file_contents)
    return

def read_file(file_name):
    """ Read a file HTML in this case and return the contents.
    Output: Returns contents of the HTML file"""
    
    try:
        with open(file_name, "r") as file:
            page_contents = file.read()
    except FileNotFoundError as fnf_error:
        page_contents = []
        print(fnf_error)
        
    return page_contents

#Helper functions

def write_csv(items, path):
    """ Write the contents of items which is a list of dictionaries into a CSV file
    Output: Creates a CSV file in the path specified"""
    
    with open(path, 'w') as f:
        
        #Return if there are no items
        if len(items) == 0:
            raise Exception("Restaurant list is empty")
            return
        
        #Write headers
        headers = list(items[0].keys())
        f.write(",".join(headers) +'\n')
        
        #write one item per line
        for item in items:
            values = []
            for header in headers:
                #write "" if there is no value and double quote each field for multiple values in csv column
                values.append('"'+ str(item.get(header,""))+'"') 
            f.write(",".join(values)+'\n')
    return

def parse_rating(rating_tag):
    """Clean up the rating_tag contents to extract the rating.
    Output: Returns rating as a string"""
    
    rating = rating_tag['aria-label']
    bad_chars = ['star rating']
    for char in bad_chars:
        rating = rating.replace(char,"").strip()
        
    return rating

def is_possible_phonenumber(phone_str):
    """Check if the string contains a valid phone number. Simple validations for now
    Output: Boolean True or False"""
    
    bad_chars = ['(',")","-"," "]
    for bad_char in bad_chars:
        phone_str =  phone_str.replace(bad_char, "")
    try:
        if type(int(phone_str ) == 'int'):
            return True
    except ValueError:
        pass
    
    return False

def parse_contacts(contact_details):
    """Extracts phone and address from the list of strings in contact details.
    Output: Dictionary with phone and address"""
   
    
    phone = ""
    address = ""

    for item in contact_details:
        if(is_possible_phonenumber(item)):
            phone = item
            address = ",".join(contact_details[1:])
                     
    if not phone:
        address = ",".join(contact_details) 

    return {'phone': phone,
        'address':address}
       

def parse_cuisine(cuisine_tag):
    """Clean up the cuisine_tag contents to extract the different cuisines.
    Output: Returns cuisine as a single string"""
    
    cuisine = [link.text for link in cuisine_tag]
    
    return ",".join(cuisine)


def parse_restaurant(business_tag):
    """For each chosen unique tag that contains all the details about a restaurant, extract name, rating, 
    reviews, cuisine address, phone number.
    
    Output: Returns the details as a dictionary"""
    try:
        restaurant_name = business_tag.find("h4", class_="css-1l5lt1i").find("a").text

        rating_tag = business_tag.find("div", class_="i-stars__09f24__1T6rz")
        rating = parse_rating(rating_tag)

        num_reviews = business_tag.find("span", class_="reviewCount__09f24__EUXPN css-e81eai").text

        cuisine_tag = business_tag.find_all("a", class_ ="css-1joxor6")
        cuisine = parse_cuisine(cuisine_tag)

        contact = [contact_details.text for contact_details in business_tag.find_all("p", class_="css-8jxw1i")]
        parsed_contacts = parse_contacts(contact)
        phone = parsed_contacts['phone']
        address = parsed_contacts['address']

        yelp_link_tag = business_tag.find("a", class_="css-166la90")
        yelp_link = base_url + yelp_link_tag['href']
        
    except Exception as error:
        print("Error in method parse_restaurant\n" + error)
        
    return {"restaurant": restaurant_name,
            "rating": rating,
            "reviews": num_reviews,
            "cuisine": cuisine,
            "phone": phone,
            "address": address,
            "yelp_link": yelp_link}

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project= project_name)

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "anushree-k/restaurant-listings-notebook" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/anushree-k/restaurant-listings-notebook[0m


'https://jovian.ai/anushree-k/restaurant-listings-notebook'

### Section 3: Exploratory code to identify revelant tags and attributes

Let us first visualise our output as below using googlesheets.

![](https://imgur.com/2HTmkQh.png)

Let us now explore the response text using BeautifulSoup search and filter functions to identify the unique tags and attributes which will give us the details we need.

In [None]:
#Identify the main website to be scraped, the specific city - New York and store the relevant details
base_url = "https://www.yelp.com"
rel_url = "/search?cflt=restaurants&find_loc=New+York%2C+NY"

topic_url = base_url+rel_url
topic_url

'https://www.yelp.com/search?cflt=restaurants&find_loc=New+York%2C+NY'

In [None]:
#Identify city. Helpful for future use if we want to search for listings in different cities
pos = rel_url.find("find_loc=")
rel_url[pos+len("find_loc="):]
city = "New_York"


#Choose a sample url to start exploring relevant tags
topic_url = "https://www.yelp.com/search?cflt=restaurants&find_loc=New%20York%2C%20NY&start=10"
file_name = city+"_restaurants_page_test.html"


In [None]:
#Store the HTML file offline and read the contents
create_file(file_name, get_topic_page(topic_url))
page_contents = read_file(file_name)

In [None]:
#convert html contents to BeautifulSoup object
soup_doc = BeautifulSoup(page_contents)

HTML browsers such as chrome provide an `inspect` function which will help us navigate through the HTML page of the URL that we are interested in.

Let us do the following:
- open the topic_url: This will open up the webpage
- Place your cursor over the area you are interested in 
- Right click and select inspect
- click on inspect
![](https://i.imgur.com/cycc84b.png)

We can now explore the HTML code to visually inspect the tags and attributes that we may find interesting. Let us start with the `<h4>` tag and then work our way up gradually to find the tag that holds all the details that we need.

![](https://i.imgur.com/mXikE6v.png)

A bit of trial and error as we work up from choosing h4 tag as the unique tag to select each restaurant and work our way up to `"div", class_="scrollablePhotos__09f24__1PpB8"`

In [None]:
#h4_tag = soup_doc.find_all("h4")
#main_tag = h4_tag[0].find_parent().find_parent
#div_tags = main_tag.find_all("div")
#len(div_tags)
#for div in div_tags:
#    print(div.attrs)

In [None]:
super_tags = soup_doc.find_all("div", class_="scrollablePhotos__09f24__1PpB8")
len(super_tags)

10

`<div>` tag with `class = "scrollablePhotos__09f24__1PpB8"` seems unique enough since there are 10 restaurants in the 1st page. Let us now try to find the relevant tags to get the complete details for the restaurant.

In [None]:
super_tag_0 = super_tags[0]

In [None]:
business_name_tag = super_tag_0.find("h4", class_="css-1l5lt1i").find("a")

In [None]:
restaurant_name = business_name_tag.text
restaurant_name

'La Contenta'

In [None]:
rating = super_tag_0.find("div", class_="i-stars__09f24__1T6rz")["aria-label"]
rating

'4.5 star rating'

In [None]:
bad_chars = ['star rating']
for char in bad_chars:
    rating = rating.replace(char,"").strip()
rating

'4.5'

In [None]:
num_reviews = super_tag_0.find("span", class_="reviewCount__09f24__EUXPN css-e81eai").text
num_reviews

'777'

In [None]:
cuisine_tag = super_tag_0.find_all("a", class_ ="css-1joxor6")
cuisine = [link.text for link in cuisine_tag]
cuisine

['Mexican', 'Bars', 'Breakfast & Brunch']

In [None]:
address_tag = super_tag_0.find("div", class_="secondaryAttributes__09f24__3db5x")
contact = [contact_details.text for contact_details in super_tag_0.find_all("p", class_="css-8jxw1i")]
phone = contact[0]
address = ",".join(contact[1:])
print("Phone:", phone)
print("Address: ", address)

Phone: (212) 432-4180
Address:  102 Norfolk St,Lower East Side


In [None]:
sample_contact = [ "(212) 432-4180",
            "102 Norfolk St",
            "Lower East Side"]

In [None]:
phone = ""
address = ""

for item in sample_contact:
    if(is_possible_phonenumber(item)):
        phone = item
        address = ",".join(sample_contact[1:])
                     
if not phone:
    address = ",".join(sample_contact) 

print(phone)
print(address)

(212) 432-4180
102 Norfolk St,Lower East Side


In [None]:
parsed_contacts = parse_contacts(sample_contact)
parsed_contacts['phone']

'(212) 432-4180'

In [None]:
yelp_link_tag = super_tag_0.find("a", class_="css-166la90")
yelp_link = base_url + yelp_link_tag['href']
yelp_link

'https://www.yelp.com/biz/la-contenta-new-york'

In [None]:
print("Restaurant name: ", restaurant_name)
print("Rating: ", rating)
print("Reviews: ", num_reviews)
print("Cuisine: ", ",".join(cuisine))
print("Phone: ", phone)
print("Address: ", address)
print("Yelp link: ", yelp_link)

Restaurant name:  La Contenta
Rating:  4.5
Reviews:  777
Cuisine:  Mexican,Bars,Breakfast & Brunch
Phone:  (212) 432-4180
Address:  102 Norfolk St,Lower East Side
Yelp link:  https://www.yelp.com/biz/la-contenta-new-york


In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project= project_name)

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "anushree-k/restaurant-listings-notebook" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/anushree-k/restaurant-listings-notebook[0m


'https://jovian.ai/anushree-k/restaurant-listings-notebook'

### Section 4: Main function which scrapes yelp.com for restaurants in NY, USA and writes the output to a CSV file.

Now let us write the main function with all the features that we have extracted so far.

**Steps we'll follow:**
1. Identify the webpage to be scraped
2. Download the webpage using `requests` and save it into a HTML file
3. Parse the HTML code using BeautifulSoup
4. Compile the extracted details into Python lists and dictionaries `parse_restaurant()`
5. Introduce a time delay of 1 second for ethical scraping
6. Extract and combine data for multiple pages - `for ...loop:`
7. Write the combined data into a CSV file

In [None]:
#Note: introduce a delay between each page request call for polite crawling


restaurant_list = []
base_url = "https://www.yelp.com"
search_url = "https://www.yelp.com/search?cflt=restaurants&find_loc=New%20York%2C%20NY&start="

#Loop for at least 10 search page results
try:
    
    #for i in range(0,24):
    #Using just a single page for trial runs
    for i in range(0,1):
        topic_url = search_url+str((i)*10)
        file_name = city+"_restaurants_page_"+ str((i))+".html"
        create_file(file_name, get_topic_page(topic_url))
        page_contents = read_file(file_name)
        soup_doc = BeautifulSoup(page_contents)
        business_tags = soup_doc.find_all("div", class_="scrollablePhotos__09f24__1PpB8")
        restaurant_list.extend([parse_restaurant(business_tag) for business_tag in business_tags])
        time.sleep(1) 
except Exception as error:
    print(error)

In [None]:
write_csv(restaurant_list, "Yelp_NY_restaurant_list.csv")

In [None]:
pd.DataFrame(restaurant_list)

Unnamed: 0,restaurant,rating,reviews,cuisine,phone,address,yelp_link
0,The Cabin NYC,4,275,"American (New),Cocktail Bars,Breakfast & Brunch",(212) 777-0454,"205 E 4th St,East Village",https://www.yelp.com/biz/the-cabin-nyc-new-york-2
1,The Osprey,4,231,American (New),(929) 414-0012,60 Furman St,https://www.yelp.com/biz/the-osprey-brooklyn
2,Thursday Kitchen,4.5,1385,"Korean,American (New),Tapas/Small Plates",,"424 E 9th St,East Village",https://www.yelp.com/biz/thursday-kitchen-new-...
3,Cecconi’s Dumbo,3.5,766,Italian,(718) 650-3900,"55 Water St,DUMBO",https://www.yelp.com/biz/cecconis-dumbo-brooklyn
4,Amélie,4.5,2718,"French,Wine Bars",(212) 533-2962,"22 W 8th St,Greenwich Village",https://www.yelp.com/biz/am%C3%A9lie-new-york
...,...,...,...,...,...,...,...
235,MONOMONO,4,493,"Korean,Cocktail Bars",(917) 285-5034,"116 E 4th St,East Village",https://www.yelp.com/biz/monomono-new-york-2
236,Juicy King Crab Express,4.5,7,"Seafood,Cajun/Creole",(917) 639-3088,"213 East Broadway,Lower East Side",https://www.yelp.com/biz/juicy-king-crab-expre...
237,The Fulton,3.5,209,"Seafood,Pasta Shops,Sandwiches",(212) 838-1200,"89 S St,South Street Seaport",https://www.yelp.com/biz/the-fulton-new-york-2
238,Da Andrea,4,819,Tuscan,(212) 367-1979,"35 W 13th St,Greenwich Village",https://www.yelp.com/biz/da-andrea-new-york


In [None]:
# Execute this to save the notebook and the CSV file that we have compiled using webscraping
jovian.commit(project= project_name, out="Yelp_NY_restaurant_list.csv")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m


## Summary

The project of extracting the list of restaurants and writing the contents into a CSV file is now complete.
We used Python, Requests and BeautifulSoup to extract the information that we required.

The CSV file contains **7 columns and 240 rows of data** from **24 pages of HTML** from www.yelp.com

        1. Name: Name of the restaurant
        2. Cuisine: Type of food
        3. Stars: Rating based on user inputs for this restaurant
        4. Reviews: Number of users who rated this restaurant
        5. Address: Address of the restaurant
        6. Contact: phone number
        7. Website: Yelp url for the restaurant

Here is a snapshot of the CSV file

![](https://i.imgur.com/TMsuhNV.png)

### Future Work

1. Extract additional information such as dietary preferences, dining in or takeaway etc.
2. Extract details from individual restaurant listings on www.yelp.com using the URLs collected.
3. Exploratory analysis on the collected data. Visit this [blog post on Zomato data on towardsdatascience](https://towardsdatascience.com/zomato-bangalore-data-analysis-6ee83652890f) for ideas 
4. Launch your Startup with your own product similar to [RestaurantData](https://www.restaurantdata.com/) and many more


### References

1. More on webscraping

    [Jovian tutorial on webscraping](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis)      
    [Jovian workshop on webscraping](https://www.youtube.com/watch?v=RKsLLG-bzEY)      
    [Requests documentation](https://docs.python-requests.org/en/master/)     
    [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)  
    
    
2. US Restaurant Industry statistics

   [US restaurant industry facts](https://restaurant.org/research/restaurant-statistics/restaurant-industry-facts-at-a-glance)      
   [restaurant industry statistics](https://upserve.com/restaurant-insider/industry-statistics/) 
   
        
3. Online coding forums, tutorials and repositories

    www.stackoverflow.com      
    www.geeksforgeeks.com  
    www.towardsdatascience.com

