# Readme

This script allows you to download information about Korean restaurants in ten cities in the US and compile the data into a Pandas dataframe.

Prerequisites
Python 3.x
Requests library
Beautiful Soup 4 library
Pandas library
Pymongo
Json

Usage
Clone or download the repository to your local machine.
Open the file yelp_korean_restaurants.py in your Python environment.
Run the script. The script will first download the Yelp search results for Korean restaurants in each of the ten cities and save the HTML files to your working directory.
The script will then extract information about each restaurant from the downloaded HTML files and compile the data into a Pandas dataframe.
The script will also download the Yelp webpages for each restaurant to gather more information.
The data will be saved in a Pandas dataframe and printed to the console.
Note that the script includes a time delay of one second between downloading each Yelp search results page and a delay of five seconds between downloading each restaurant's Yelp webpage. You can adjust these delays if necessary.

Acknowledgments
This script was created using Python and the following libraries:

Requests
Beautiful Soup 4
Pandas

In [244]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import pymongo
import json
from pymongo import MongoClient

In [13]:
pd.set_option('display.max_rows', None)


The below code sets up a list of 10 cities and a corresponding list of city names where spaces are replaced with '+' signs. Then, it sets up a list of page numbers to download from Yelp's search results for each city. Using a nested for loop, it downloads the HTML content for each three pages of search results for each city using the requests library and writes the content to a file. It also prints a message to indicate the file download was successful and sleeps for 1 second before moving on to the next file.

In [60]:
# Download the first 10 of searching on "Korean" Restaurant of each cities.
list_city = ['New York','Los Angeles','Chicago','Oakland,CA','Houston','Philadelphia','Seattle','Atlanta','Dallas','San Jose,CA']
list_city_name = ['New+York','Los+Angeles','Chicago','Oakland%2C+CA','Houston','Philadelphia','Seattle','Atlanta','Dallas','San+Jose%2C+CA']
item = [0,10,20]

for i in range(len(list_city_name)): 
    for n in range(1,4):
        url = f'https://www.yelp.com/search?find_desc=Korean&find_loc={list_city_name[i]}&start={item[n-1]}'
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}
        response = requests.get(url,headers = headers) 
        webcontent = response.content
        f = open(f'Yelp Korean food in {list_city[i]} page_{n}.htm', 'wb')
        f.write(webcontent)
        f.close()
        print(f"Download Yelp Korean food in {list_city[i]} page_{n}.htm successfully")
        time.sleep(1)


Download Yelp Korean food in New York page_1.htm successfully
Download Yelp Korean food in New York page_2.htm successfully
Download Yelp Korean food in New York page_3.htm successfully
Download Yelp Korean food in Los Angeles page_1.htm successfully
Download Yelp Korean food in Los Angeles page_2.htm successfully
Download Yelp Korean food in Los Angeles page_3.htm successfully
Download Yelp Korean food in Chicago page_1.htm successfully
Download Yelp Korean food in Chicago page_2.htm successfully
Download Yelp Korean food in Chicago page_3.htm successfully
Download Yelp Korean food in Oakland,CA page_1.htm successfully
Download Yelp Korean food in Oakland,CA page_2.htm successfully
Download Yelp Korean food in Oakland,CA page_3.htm successfully
Download Yelp Korean food in Houston page_1.htm successfully
Download Yelp Korean food in Houston page_2.htm successfully
Download Yelp Korean food in Houston page_3.htm successfully
Download Yelp Korean food in Philadelphia page_1.htm successf

The below code sets up empty lists to store the restaurant data scraped from the downloaded Yelp search result pages. It then loops through each city and each page (3 pages * 10 cities = 30 pages) of search results for that city and uses the BeautifulSoup library to extract information about each restaurant listed on the page. The information includes the restaurant name, rank (i.e., position on the search result page), Yelp link, rating, number of reviews, image URL, price range, and location. It stores this information in the corresponding lists. Once all the search result pages for all the cities have been scraped, it combines the information in the lists into a pandas DataFrame.

In [233]:
name_list = []
rank_list = []
link_list = []
rating_list = []
img_list = []
price_list = []
city_loc_list = []
num_review_list = []
city = []
price_int = []
for i in range(len(list_city)): #len(list_city)
    for n in range(1,4):
        with open(f'Yelp Korean food in {list_city[i]} page_{n}.htm') as f:
            soup = BeautifulSoup(f, 'lxml')
            for k in range(8,19):
                info = soup.select(f'div > ul > li:nth-child({k}) > div.container__09f24__mpR8_.hoverable__09f24__wQ_on.border-color--default__09f24__NPAKY')
                #main-content > div > ul > li:nth-child(8) > div.container__09f24__mpR8_.hoverable__09f24__wQ_on.border-color--default__09f24__NPAKY > div
                #for j in info.:
                for restaurant in info:
                    city.append(list_city[i])
                    # Restaurant Name
                    name = restaurant.find('h3',{'class':'css-1agk4wl'}).find("a").text
                    name_list.append(name)
                    
                    # Restaurant Rank
                    rank = restaurant.find('span',{'class': 'css-1egxyvc'}).text.split('.')[0]
                    rank_list.append(rank)
                    
                    # Restaurant Yelp Link
                    link = restaurant.find('h3',{'class':'css-1agk4wl'}).find("a").get("href")
                    link_list.append('https://www.yelp.com'+ link)
                    
                    # Restaurant Rating
                    rating =  restaurant.find('span',{'class':"border-color--default__09f24__NPAKY"}).find('div').get('aria-label')
                    ## Convert string to int for further calculation
                    rating = re.split(" ", rating)
                    rating = rating[0]
                    rating = float(rating)
                    rating_list.append(rating)
                        
                    # Number of Reviews
                    review = restaurant.find('span',{'class':'css-chan6m'}).text
                    num_review_list.append(review)
                    
                    # Image of the restaurant
                    img = restaurant.find('img',{'class':'css-xlzvdl'}).get('src')
                    img_list.append(img)
                    
                    # Price of Restaurant
                    price = restaurant.find('span',{'class':'priceRange__09f24__mmOuH css-1s7bx9e'})
                    ## Convert Price to int for further calculation
                    if price is None:
                        price_list.append(None)
                    else:
                        price_list.append(price.text)
                        
                    if price is None:
                        price_int.append(None)
                    elif price.text == "$":
                        price_int.append(1)
                    elif price.text == "$$":
                        price_int.append(2)
                    elif price.text == "$$$":
                        price_int.append(3)
                    else:
                        price_int.append(4)
                        
                        
                    # City Location
                    city_loc = restaurant.select(' div > div > div > p > span.css-chan6m')[0]
                    
                    if city_loc is None:
                        city_loc_list.append(None)
                    else:
                        city_loc_list.append(city_loc.get_text())


Next, we will create a dataframe to store all the lists created in the above code for ease of use and access 

In [234]:
df = pd.DataFrame({'City':city,
    'Name': name_list,
    'Rank': rank_list,
    'Link': link_list,
    'Rating': rating_list,
    'Image': img_list,
    'Price Indicator': price_list,
    'Price Indicator number': price_int,
    'Location': city_loc_list,
    'Review Count': num_review_list
})

In [235]:
df['Rank'].value_counts()

1     10
2     10
29    10
28    10
27    10
26    10
25    10
24    10
23    10
22    10
21    10
20    10
19    10
18    10
17    10
16    10
15    10
14    10
13    10
12    10
11    10
10    10
9     10
8     10
7     10
6     10
5     10
4     10
3     10
30    10
Name: Rank, dtype: int64

### Now, we scrape each restaurant's Yelp page by clicking on the links for 30 x 10 = 300 restaurants

In [228]:
# More information needed to be scrape from the restaurant yelp website
for i in range(len(df)): 
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}
    response = requests.get(link_list[i],headers = headers) 
    webcontent = response.content
    f = open(f'{name_list[i]}.htm', 'wb')
    f.write(webcontent)
    f.close()
    time.sleep(5)
    print(f"Downloaded {name_list[i]}'s yelp webpage successfully. Num:{i+1}")    


Downloaded Kuun's yelp webpage successfully. Num:1
Downloaded Cho Dang Gol Korean Restaurant's yelp webpage successfully. Num:2
Downloaded Her Name Is Han's yelp webpage successfully. Num:3
Downloaded Thursday Kitchen's yelp webpage successfully. Num:4
Downloaded ARIARI's yelp webpage successfully. Num:5
Downloaded Barn Joo's yelp webpage successfully. Num:6
Downloaded BCD Tofu House's yelp webpage successfully. Num:7
Downloaded Sobak's yelp webpage successfully. Num:8
Downloaded Tofu Tofu's yelp webpage successfully. Num:9
Downloaded Woori Korean's yelp webpage successfully. Num:10
Downloaded Atti's yelp webpage successfully. Num:11
Downloaded Jongro BBQ's yelp webpage successfully. Num:12
Downloaded KJUN's yelp webpage successfully. Num:13
Downloaded Take31's yelp webpage successfully. Num:14
Downloaded 8282's yelp webpage successfully. Num:15
Downloaded JeJu Noodle Bar's yelp webpage successfully. Num:16
Downloaded Woorijip's yelp webpage successfully. Num:17
Downloaded Jang Dok Dae

Downloaded Korean Noodle House's yelp webpage successfully. Num:131
Downloaded So Gong Dong Tofu House's yelp webpage successfully. Num:132
Downloaded Ohn Korean Eatery's yelp webpage successfully. Num:133
Downloaded Wa Jang Chang Resaurant and Grill's yelp webpage successfully. Num:134
Downloaded Jin Korean BBQ's yelp webpage successfully. Num:135
Downloaded Tree Garden Korean Restaurant's yelp webpage successfully. Num:136
Downloaded Bon KBBQ's yelp webpage successfully. Num:137
Downloaded Joon’s Kitchen's yelp webpage successfully. Num:138
Downloaded Soju 101's yelp webpage successfully. Num:139
Downloaded MDK Noodles's yelp webpage successfully. Num:140
Downloaded Honey Pig's yelp webpage successfully. Num:141
Downloaded Tasty Ko's yelp webpage successfully. Num:142
Downloaded Karne's yelp webpage successfully. Num:143
Downloaded Dokdo Restaurant's yelp webpage successfully. Num:144
Downloaded Bon Galbi's yelp webpage successfully. Num:145
Downloaded BBQ Garden's yelp webpage succe

Downloaded Dansungsa's yelp webpage successfully. Num:260
Downloaded Komé's yelp webpage successfully. Num:261
Downloaded Dong Bo Sung's yelp webpage successfully. Num:262
Downloaded Kim’s House Grill & BBQ's yelp webpage successfully. Num:263
Downloaded Gen Korean BBQ House's yelp webpage successfully. Num:264
Downloaded Tofu Factory Korean Cuisine's yelp webpage successfully. Num:265
Downloaded LA Hanbat's yelp webpage successfully. Num:266
Downloaded K Pop Ramen's yelp webpage successfully. Num:267
Downloaded Hot Stone & Korean Kitchen's yelp webpage successfully. Num:268
Downloaded Hampyong Cold Noodles's yelp webpage successfully. Num:269
Downloaded E Rae Korean Restaurant's yelp webpage successfully. Num:270
Downloaded SJ Omogari's yelp webpage successfully. Num:271
Downloaded Kunjip's yelp webpage successfully. Num:272
Downloaded Daeho Kalbi Jjim & Beef Soup's yelp webpage successfully. Num:273
Downloaded Sodam's yelp webpage successfully. Num:274
Downloaded Danbi Korean Restaur

## API Call and more web scraping
This code scrapes Yelp webpages for a list of restaurants and extracts information such as their addresses, phone numbers, geolocations, and popular dishes. The code creates an empty list address_list and phone_num_list, and two empty lists city and restaurant.

It then creates a pandas DataFrame df2 with columns City, Name, Address, Geolocation, Phone Number, and Popular Dishes. The code loops through each restaurant in the original DataFrame df using a for loop and opens the Yelp webpage for that restaurant using BeautifulSoup.

The code then tries to extract the restaurant's address using the soup.select() method and the Get Directions tag. It then uses the positionstack API to obtain the geolocation of the address. If the geolocation is obtained successfully, it formats the longitude and latitude coordinates into a string and stores it in geolocation.

The code also tries to extract the restaurant's phone number using the soup.select() method and the Phone number tag. If the phone number is successfully obtained, it is stored in phone_number.

Next, the code tries to extract the popular dishes for the restaurant using the soup.select() method and the css-nyjpex class. It loops through all the dish elements and extracts their names. If a dish name is successfully obtained, it is appended to the popular_dishes_list.

Finally, the extracted information for each restaurant is stored in a new row in the df2 DataFrame using the .loc method, and the lists address_list and phone_num_list are appended with the restaurant's address and phone number, respectively.

In [236]:
# Something wrong with the code below
address_list = []
phone_num_list = []

city =[]
restaurant=[]
df2= pd.DataFrame(columns=['City','Name','Address', 'Geolocation','Phone Number', 'Popular Dishes'])

for i in range(len(df)):
    with open(f'{name_list[i]}.htm') as f:
        soup = BeautifulSoup(f, 'html.parser')
        popular_dishes_list = []
        
        city.append(df['City'].loc[i])
        restaurant.append(df['Name'].loc[i])
        #print(city[i],restaurant[i], address)
        # Restaurant Address 
          
        try:
            address_soup = soup.select('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.border-color--default__09f24__NPAKY > p.css-qyp8bo')
            #address = soup.find('p', text='Phone number')
            address_tag = soup.find('p', text='Get Directions')
    
        # Find the next <p> tag that contains the phone number
            address = address_tag.find_next('p').text
            access_key = 'f76819fd4043da5b70bb1582a2b7523c'#This is my access key for the free account I signed up for
        
            api ='http://api.positionstack.com/v1/forward?access_key='+access_key+'&query='+address
            response = requests.get(api)
            geo_info = response.json()
                    #print(geo_info['data'])
            if geo_info['data']:
                long = geo_info['data'][0]['longitude']
                lat = geo_info['data'][0]['latitude']
                geolocation = f'({long:.4f},{lat:.4f})'
        
        except Exception as ex:
            address = None
            long = None
            lat=None
            geolocation = None
        #print(address)
    
        try:
            phone_num_elements = soup.select('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.border-color--default__09f24__NPAKY > p.css-1p9ibgf')
 
            phone_tag = soup.find('p', text='Phone number')

        # Find the next <p> tag that contains the phone number
            phone_number = phone_tag.find_next('p').text

        # Print the phone number
            #print(phone_number) 
          
        except Exception as ex:
            phone_number = None
        phone_num_list.append(phone_number)
      
    
        try:            
            popular_dishes_elements = soup.select('div.margin-t2__09f24__b0bxj.border-color--default__09f24__NPAKY div div div div div div div a div div:nth-child(2) div div')          
            for dish_element in popular_dishes_elements:                
                dish_name_element = dish_element.find('p', class_='css-nyjpex')                
                if dish_name_element:                    
                    dish_name = dish_name_element.text                    
                    popular_dishes_list.append(dish_name)        
        except Exception as ex:            
            dish_name = None                              
        df2.loc[i] = [df['City'].loc[i],df['Name'].loc[i], address,geolocation, phone_number, popular_dishes_list]      


In [240]:
df2

Unnamed: 0,City,Name,Address,Geolocation,Phone Number,Popular Dishes
0,New York,Kuun,"290 Livingston St Brooklyn, NY 11217","(-73.9831,40.6885)",(917) 909-1466,"[Bibimbap, Hot Stone Bibimbap, Kimchi Fried Ri..."
1,New York,Cho Dang Gol Korean Restaurant,"55 W 35th St New York, NY 10001","(-73.9865,40.7503)",(212) 695-8222,"[Tofu Stew, Homemade Tofu, Galbi Jjim, Baby Oc..."
2,New York,Her Name Is Han,,,(212) 779-9990,"[Pork Belly, Rice Cakes, Fried Chicken, Black ..."
3,New York,Thursday Kitchen,,,,"[Truffle Mac, Kimchi Paella, Edamame Dumplings..."
4,New York,ARIARI,,,(646) 422-7466,"[Soft Shell Crab, Beef Tartare, Seafood Stew, ..."
5,New York,Barn Joo,"35 Union Sq W New York, NY 10003","(-73.9907,40.7369)",(646) 398-9663,"[Chicken Wings, Uni Bibimbap, Tiger Roll, Kore..."
6,New York,BCD Tofu House,"5W 32nd St New York, NY 10001","(-73.9861,40.7475)",(212) 967-1900,"[Tofu Soup, Fried Fish, Original Bcd Tofu, Soo..."
7,New York,Sobak,"51B Canal St New York, NY 10002","(-73.9917,40.7149)",,[]
8,New York,Tofu Tofu,"96 Bowery New York, NY 10013","(-73.9954,40.7176)",(917) 442-5001,"[Fried Chicken Wings, Short Rib Soon Tofu, Bee..."
9,New York,Woori Korean,"336 Myrtle Ave Brooklyn, NY 11205","(-73.9731,40.6931)",(516) 460-8606,"[Bulgogi, Japchae, Galbitang, Bibimbap, La La ..."


In [241]:
final_df = pd.merge(df, df2, on=['City','Name'], how='inner')

In [242]:
final_df

Unnamed: 0,City,Name,Rank,Link,Rating,Image,Price Indicator,Price Indicator number,Location,Review Count,Address,Geolocation,Phone Number,Popular Dishes
0,New York,Kuun,1,https://www.yelp.com/biz/kuun-brooklyn?osq=Korean,4.5,https://s3-media0.fl.yelpcdn.com/bphoto/Y6Aas_...,$$$,3.0,Boerum Hill,215,"290 Livingston St Brooklyn, NY 11217","(-73.9831,40.6885)",(917) 909-1466,"[Bibimbap, Hot Stone Bibimbap, Kimchi Fried Ri..."
1,New York,Cho Dang Gol Korean Restaurant,2,https://www.yelp.com/biz/cho-dang-gol-korean-r...,4.0,https://s3-media0.fl.yelpcdn.com/bphoto/reoblM...,$$,2.0,Midtown West,1425,"55 W 35th St New York, NY 10001","(-73.9865,40.7503)",(212) 695-8222,"[Tofu Stew, Homemade Tofu, Galbi Jjim, Baby Oc..."
2,New York,Her Name Is Han,3,https://www.yelp.com/biz/her-name-is-han-new-y...,4.5,https://s3-media0.fl.yelpcdn.com/bphoto/SBjw3d...,$$,2.0,Midtown East,1704,,,(212) 779-9990,"[Pork Belly, Rice Cakes, Fried Chicken, Black ..."
3,New York,Thursday Kitchen,4,https://www.yelp.com/biz/thursday-kitchen-new-...,4.5,https://s3-media0.fl.yelpcdn.com/bphoto/rNxrGd...,$$,2.0,East Village,1701,,,,"[Truffle Mac, Kimchi Paella, Edamame Dumplings..."
4,New York,ARIARI,5,https://www.yelp.com/biz/ariari-new-york?osq=K...,4.5,https://s3-media0.fl.yelpcdn.com/bphoto/x5UIZS...,,,East Village,82,,,(646) 422-7466,"[Soft Shell Crab, Beef Tartare, Seafood Stew, ..."
5,New York,Barn Joo,6,https://www.yelp.com/biz/barn-joo-new-york-3?o...,4.0,https://s3-media0.fl.yelpcdn.com/bphoto/MoNgwY...,$$,2.0,Union Square,1862,"35 Union Sq W New York, NY 10003","(-73.9907,40.7369)",(646) 398-9663,"[Chicken Wings, Uni Bibimbap, Tiger Roll, Kore..."
6,New York,BCD Tofu House,7,https://www.yelp.com/biz/bcd-tofu-house-new-yo...,4.0,https://s3-media0.fl.yelpcdn.com/bphoto/pZI_fF...,$$,2.0,Koreatown,2391,"5W 32nd St New York, NY 10001","(-73.9861,40.7475)",(212) 967-1900,"[Tofu Soup, Fried Fish, Original Bcd Tofu, Soo..."
7,New York,Sobak,8,https://www.yelp.com/biz/sobak-new-york-2?osq=...,5.0,https://s3-media0.fl.yelpcdn.com/bphoto/jIzMOv...,$$,2.0,Chinatown,16,"51B Canal St New York, NY 10002","(-73.9917,40.7149)",,[]
8,New York,Tofu Tofu,9,https://www.yelp.com/biz/tofu-tofu-new-york-5?...,4.5,https://s3-media0.fl.yelpcdn.com/bphoto/89k2KX...,$$,2.0,Chinatown,485,"96 Bowery New York, NY 10013","(-73.9954,40.7176)",(917) 442-5001,"[Fried Chicken Wings, Short Rib Soon Tofu, Bee..."
9,New York,Woori Korean,10,https://www.yelp.com/biz/woori-korean-brooklyn...,5.0,https://s3-media0.fl.yelpcdn.com/bphoto/PrgNEq...,,,Fort Greene,49,"336 Myrtle Ave Brooklyn, NY 11205","(-73.9731,40.6931)",(516) 460-8606,"[Bulgogi, Japchae, Galbitang, Bibimbap, La La ..."


## Mongodb
This code connects to a local MongoDB instance and creates a database called "ucdavis". If there is no collection called "restaurant" in the database, it creates one.

The code then deletes all documents in the "restaurant" collection (if it exists). After that, it reads a pandas DataFrame called "final_df" and iterates over each row of the DataFrame. For each row, it converts the row to a dictionary and uses the dictionary to update (or insert if not exists) a document in the "restaurant" collection.

Next, it retrieves all the documents from the "restaurant" collection and loads them into another pandas DataFrame called "data". Finally, it writes the contents of the "data" DataFrame to an Excel file called "output.xlsx".

In [245]:
# Connect to MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["ucdavis"]

In [246]:
#Here, if a collection (table) called bayc doesn't already exists it creates one

if 'restaurant' not in db.list_collection_names():
    db.create_collection('restaurant', capped=False)

restaurant = db['restaurant'] 

In [249]:
if 'restaurant' in db.list_collection_names():db.restaurant.delete_many({})
        
#We add each row from the dataframe to the collection if it doesn't already existss
for index, row in final_df.iterrows():
    insert_row = row.to_dict()
    update = {'$set': insert_row}
    restaurant.update_many(insert_row, update, upsert=True)

In [250]:
# retrieve data from collection as a pandas DataFrame
data = pd.DataFrame(list(restaurant.find()))

# write data to an Excel file
data.to_excel("output.xlsx", index=False)