### Unit 1 Homework:  Scraping the Yelp Website

Welcome!  For this homework assignment you'll be tasked with building a web scraper in a manner that builds on what was covered in our web scraping class.

The assignment will extend the lab work done during that time, where we built a dataset that listed the name, number of reviews and price range for restaurant on the following web page: https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1

Your most basic task is to build a dataset from the above website that has the following characteristics:

 - Has at least 5 columns (they can overlap with what we created in class)
 - Has at least 100 rows (this means you will have to scrape more than 1 page)
 
Your final product will be a jupyter notebook that has the following characteristics:

 - It results in the creation of a pandas dataframe
 - You should write comments in every cell explaining what you are doing and your line of thinking
 
**Bonus:**

 - If you'd like you can cycle through different pages manually, but see if you can do so programmatically -- ie, using loops (**hint: ** `while` loops can help here, as well as `try/except` blocks to catch errors)
 - Some values are not consistent across every entry so you might have to provide checks to see if a value exists at all

In [1]:
# imports
from bs4 import BeautifulSoup
import pandas as pd
import requests


In [2]:
#create empty list for names from all pages
all_names = []

# create list of values to amend url with
start_from = [0, 30, 60, 90]

# create for loop to run through start_from values to create a url per web page, to be used for scraping
for i in start_from:
    url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start={}".format(i)
    req = requests.get(url)
    hwk_scraper = BeautifulSoup(req.text)
    
# find all restaurant names    
    names = hwk_scraper.find_all("a", {"class": "lemon--a__373c0__IEZFH",
                                   "class": "link__373c0__1G70M",
                                   "class": "link-color--inherit__373c0__3dzpk",
                                   "class": "link-size--inherit__373c0__1VFlE"})
# convert names to string data type
    names = [str(name) for name in names]
# remove </a> tag
    names = [name.replace("</a>", "") for name in names]
# split each item to retain only the restaurant name    
    names = [name.split(">")[-1] for name in names]
# remove more from the names list
    names = [name for name in names if name != "more"]
# trim names list to remove empty entries ""     
    names = names[:-8]
# check contents of names. count items in names. if longer than 30, remove last item in list
    if len(names) == 31:
        names = names[:-1]    
    print(names)    
    print(len(names))

# store names in this loop to all_names list
    all_names.append(names)

['The Mayfair Chippy', 'Dishoom', 'Flat Iron', 'Restaurant Gordon Ramsay', 'Ffiona’s Restaurant', 'Mother Mash', 'The Queens Arms', 'Padella', 'The Grazing Goat', 'The Golden Chippy', 'BAO - Soho', 'Duck &amp; Waffle', 'Ye Olde Cheshire Cheese', 'Dishoom', 'The Pig and Butcher', 'Gordon Ramsay Street Pizza', 'Sketch', 'NOPI', 'The Churchill Arms', 'Abeno', 'The Victoria', 'The Shed', 'Shoryu Ramen', 'Busaba Soho', 'The Colonel Fawcett', 'Burger &amp; Lobster', 'Hawksmoor Seven Dials', 'Savoir Faire', 'Yauatcha', 'London House by Gordon Ramsay']
30
['Dinner by Heston Blumenthal', 'Kazan', 'Wright Brothers - South Kensington', 'Barrafina', 'Belgo Centraal', 'Lanzhou Noodle Bar', 'Yasmeen Restaurant', 'Barrafina', 'The Ledbury', 'Piccolino', 'Smoking Goat', 'Roti Chai', 'Steak &amp; Co', 'Naru', 'The George Inn', 'Homeslice Neal’s Yard', 'St John Bar and Restaurant', 'Wahaca', 'José', 'Hoppers', 'Silk Road', 'The Palomar Restaurant', 'Negril', 'The Ninth London', 'Kiln', 'Da Mario Restaur

In [3]:
# repeat but for number of reviews
all_reviews = []

for i in start_from:
    url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start={}".format(i)
    req = requests.get(url)
    hwk_scraper = BeautifulSoup(req.text)


    num_reviews = hwk_scraper.find_all("span", {"class": "lemon--span__373c0__3997G", 
                                        "class": "text__373c0__2Kxyz",
                                        "class": "reviewCount__373c0__2r4xT",
                                        "class": "text-color--black-extra-light__373c0__2OyzO"})

    num_reviews = [str(review) for review in num_reviews]
    num_reviews = [review.replace("</span>", "") for review in num_reviews]
    num_reviews = [review.split(">")[-1] for review in num_reviews]
    num_reviews = [review for review in num_reviews if review.isdigit()]
    print(num_reviews)
    print(len(num_reviews))
    all_reviews.append(num_reviews)

['277', '1842', '377', '204', '270', '468', '119', '202', '239', '108', '183', '701', '353', '544', '109', '30', '830', '271', '358', '101', '212', '78', '375', '381', '26', '292', '342', '194', '482', '22']
30
['289', '112', '22', '61', '319', '351', '14', '37', '168', '47', '30', '297', '110', '65', '142', '227', '294', '310', '82', '113', '44', '104', '54', '22', '51', '139', '93', '3', '89', '54']
30
['23', '145', '683', '50', '22', '149', '57', '94', '64', '243', '142', '10', '11', '11', '95', '110', '33', '340', '105', '12', '3', '33', '115', '213', '91', '48', '35', '276', '105', '27']
30
['121', '23', '26', '119', '46', '97', '139', '68', '38', '95', '124', '163', '35', '90', '23', '49', '126', '18', '41', '78', '110', '51', '17', '93', '10', '98', '27', '22', '30', '170']
30


In [25]:
# repeat for price ranges
all_price_ranges = []

for i in start_from:
    url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start={}".format(i)
    req = requests.get(url)
    hwk_scraper = BeautifulSoup(req.text)
    
    price_ranges = hwk_scraper.find_all("span", {"class": "lemon--span__373c0__3997G", 
                                        "class": "text__373c0__2Kxyz",
                                        "class": "reviewCount__373c0__2r4xT",
                                        "class": "text-color--black-extra-light__373c0__2OyzO"})

    price_ranges = [str(price) for price in price_ranges]
    price_ranges = [price.replace("</span>", "") for price in price_ranges]
    price_ranges = [price.split(">")[-1] for price in price_ranges]
# price ranges immediately follow number of reviews 
# therefore if we isolate the reviews, we can take the next item and add it to the all_price_ranges list.
# tried using enumerate and a blank list (next_price) to store the next items in the lists, 
# then append each next_price to all_price_ranges
    next_price = []
    for index, price in enumerate(price_ranges):
        if price.isdigit():
            next_price = price[index + 1]
            all_price_ranges.append(next_price)
            
    
    print(all_price_ranges)

IndexError: string index out of range

In [None]:
#    price_ranges = price_ranges[1::3]
#    price_ranges = [price for price in price_ranges if '\xA3' in price]

#    if len(price_ranges) != 30:

#    print(price_ranges)
#    print(len(price_ranges))



In [5]:
all_addresses = []

for i in start_from:
    url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start={}".format(i)
    req = requests.get(url)
    hwk_scraper = BeautifulSoup(req.text)
    
    addresses = hwk_scraper.find_all("span", {"class": "lemon--span__373c0__3997G",
                                   "class": "raw__373c0__3rcx7"})
    addresses = [str(address) for address in addresses]
    addresses = [address.replace("</span>", "") for address in addresses]
    addresses = [address.split(">")[-1] for address in addresses]
# trim list    
    addresses = addresses[1:]
# remove additional tags "Delivery" and "Takeout"    
    addresses = [address for address in addresses if address != "Delivery" and "Takeout" not in address]
    print(addresses)
    print(len(addresses))
    all_addresses.append(addresses)

['14 North Audley Street', "12 Upper Saint Martin's Lane", '17 Beak Street', '68 Royal Hospital Road', '51 Kensington Church Street', '26 Ganton Street', '11 Warwick Way', '6 Southwark Street', '6 New Quebec Street', '62 Greenwich High Road', '53 Lexington Street', '110 Bishopsgate', '145 Fleet Street', '22 Kingly Street', '80 Liverpool Road', '10 Bread Street', '9 Conduit Street', '21-22 Warwick Street', '119 Kensington Church Street', '47 Museum Street', '10a Strathearn Place', '122 Palace Gardens Terrace', '9 Regent Street', '106-110 Wardour Street', '1 Randolph Street', '36 Dean Street', '11 Langley Street', '42 New Oxford Street', '15-17 Broadwick Street', '7-9 Battersea Square']
30
['66 Knightsbridge', '93-94 Wilton Road', '56 Old Brompton Road', '43 Drury Lane', '50 Earlham Street', '33 Cranbourne Street', '1 Blenheim Terrace', '26 Dean Street', '127 Ledbury Road', '21 Heddon Street', '64 Shoreditch High Street', '3 Portman Mews South', '3/5 Charing Cross Road', '230 Shaftsbury 

In [6]:
all_areas = []

for i in start_from:
    url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start={}".format(i)
    req = requests.get(url)
    hwk_scraper = BeautifulSoup(req.text)

    areas = hwk_scraper.find_all("p", {"class": "lemon--p__373c0__3Qnnj",
                                   "class": "text__373c0__2Kxyz",
                                   "class": "text-color--black-extra-light__373c0__2OyzO",
                                   "class": "text-align--right__373c0__1f0KI", 
                                   "class": "text-size--small__373c0__3NVWO"})
    areas = [str(area) for area in areas]
    areas = [area.replace("</p>", "") for area in areas]
    areas = [area.split(">")[-1] for area in areas]
    areas = [area for area in areas if area != ""]
# remove telephone numbers from items in list    
    areas = [area for area in areas if "020" not in area and "04420" not in area]
# trim list    
    areas = areas[2:-1]
# check contents of list. Given all lists equal length, there are no missing values extra values so no further changes    
    print(areas)
    print(len(areas))
    all_areas.append(areas)

['Mayfair', 'Covent Garden', 'Soho', 'Chelsea', 'Kensington', 'Soho', 'Victoria', 'London Bridge', 'Marylebone', 'Deptford', 'Soho', 'Aldgate', 'Blackfriars', 'Soho', 'Islington', 'The City', 'Mayfair', 'Soho', 'Notting Hill', 'Bloomsbury', 'Paddington', 'Notting Hill', "St James's", 'Soho', 'Camden Town', 'Soho', 'Covent Garden', 'Bloomsbury', 'Soho', 'Battersea']
30
['Hyde Park', 'Victoria', 'South Kensington', 'Covent Garden', 'Covent Garden', 'Covent Garden', "St John's Wood", 'Soho', 'Notting Hill', 'Mayfair', 'Shoreditch', 'Marylebone', 'Leicester Square', 'Covent Garden', 'Borough', 'Covent Garden', 'Farringdon', 'Soho', 'Borough', 'Soho', 'Camberwell', 'Chinatown', 'Brixton Hill', 'Fitzrovia', 'Soho', 'Kensington', 'Camden Town', 'Covent Garden', 'London Bridge', 'Bayswater']
30
['Barons Court', 'Marylebone', 'Whitechapel', 'Pimlico', 'Kentish Town', 'Paddington', 'Farringdon', 'Chalk Farm', 'Covent Garden', 'Westminster', 'Leicester Square', 'Swiss Cottage', 'Fitzrovia', 'Nine

In [8]:
# make a flat list for each of the lists created 

all_names_flat = []

for result in all_names:
    for name in result:
        all_names_flat.append(name)

all_reviews_flat = []

for result in all_reviews:
    for review in result:
        all_reviews_flat.append(review)
        
all_addresses_flat = []

for result in all_addresses:
    for address in result:
        all_addresses_flat.append(address)
        
all_areas_flat = []

for result in all_areas:
    for area in result:
        all_areas_flat.append(area)
        
print(all_names_flat)
print(all_reviews_flat)
print(all_addresses_flat)
print(all_areas_flat)

['The Mayfair Chippy', 'Dishoom', 'Flat Iron', 'Restaurant Gordon Ramsay', 'Ffiona’s Restaurant', 'Mother Mash', 'The Queens Arms', 'Padella', 'The Grazing Goat', 'The Golden Chippy', 'BAO - Soho', 'Duck &amp; Waffle', 'Ye Olde Cheshire Cheese', 'Dishoom', 'The Pig and Butcher', 'Gordon Ramsay Street Pizza', 'Sketch', 'NOPI', 'The Churchill Arms', 'Abeno', 'The Victoria', 'The Shed', 'Shoryu Ramen', 'Busaba Soho', 'The Colonel Fawcett', 'Burger &amp; Lobster', 'Hawksmoor Seven Dials', 'Savoir Faire', 'Yauatcha', 'London House by Gordon Ramsay', 'Dinner by Heston Blumenthal', 'Kazan', 'Wright Brothers - South Kensington', 'Barrafina', 'Belgo Centraal', 'Lanzhou Noodle Bar', 'Yasmeen Restaurant', 'Barrafina', 'The Ledbury', 'Piccolino', 'Smoking Goat', 'Roti Chai', 'Steak &amp; Co', 'Naru', 'The George Inn', 'Homeslice Neal’s Yard', 'St John Bar and Restaurant', 'Wahaca', 'José', 'Hoppers', 'Silk Road', 'The Palomar Restaurant', 'Negril', 'The Ninth London', 'Kiln', 'Da Mario Restaurant'

In [11]:
df_dict = {
    "Name": all_names_flat,
    "NumReviews": all_reviews_flat,
    "Location": all_areas_flat,
    "Address": all_addresses_flat
}
df = pd.DataFrame(df_dict)

In [12]:
df

Unnamed: 0,Name,NumReviews,Location,Address
0,The Mayfair Chippy,277,Mayfair,14 North Audley Street
1,Dishoom,1842,Covent Garden,12 Upper Saint Martin's Lane
2,Flat Iron,377,Soho,17 Beak Street
3,Restaurant Gordon Ramsay,204,Chelsea,68 Royal Hospital Road
4,Ffiona’s Restaurant,270,Kensington,51 Kensington Church Street
...,...,...,...,...
115,Maroush II,98,Chelsea,38 Beauchamp Place
116,Jakobs,27,Kensington,20 Gloucester Road
117,Texas Joe’s,22,Borough,8-9 Snowsfields
118,Lobos Meat and Tapas,30,London Bridge,14 Borough High Street


In [None]:
# CAR PARK - how to deal with multiple tags per entry
# get cuisines, convert to string and clean 
#cuisines = hwk_scraper.find_all("a", {"class": "lemon--a__373c0__IEZFH",
                                      "class": "link__373c0__1G70M",
                                      "class": "link-color--inherit__373c0__3dzpk",
                                      "class": "link-size--default__373c0__7tls6"})
#cuisines = [str(cuisine) for cuisine in cuisines]
#cuisines = [cuisine.replace("</a>", "") for cuisine in cuisines]
#cuisines = [cuisine.split(">")[-1] for cuisine in cuisines]
#cuisines = [cuisine for cuisine in cuisines if cuisine != ""]
#cuisines = cuisines[4:-30]

In [None]:
# contact_numbers = hwk_scraper.find_all("p", {"class": "lemon--p__373c0__3Qnnj",
                                   "class": "text__373c0__2Kxyz",
                                   "class": "text-color--black-extra-light__373c0__2OyzO",
                                   "class": "text-align--right__373c0__1f0KI", 
                                   "class": "text-size--small__373c0__3NVWO"})
#contact_numbers = [str(contact) for contact in contact_numbers]
#contact_numbers = [contact.replace("</p>", "") for contact in contact_numbers]
#contact_numbers = [contact.split(">")[-1] for contact in contact_numbers]