### Unit 1 Homework:  Scraping the Yelp Website

Welcome!  For this homework assignment you'll be tasked with building a web scraper in a manner that builds on what was covered in our web scraping class.

The assignment will extend the lab work done during that time, where we built a dataset that listed the name, number of reviews and price range for restaurant on the following web page: https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1

Your most basic task is to build a dataset from the above website that has the following characteristics:

 - Has at least 5 columns (they can overlap with what we created in class)
 - Has at least 100 rows (this means you will have to scrape more than 1 page)
 
Your final product will be a jupyter notebook that has the following characteristics:

 - It results in the creation of a pandas dataframe
 - You should write comments in every cell explaining what you are doing and your line of thinking
 
**Bonus:**

 - If you'd like you can cycle through different pages manually, but see if you can do so programmatically -- ie, using loops (**hint: ** `while` loops can help here, as well as `try/except` blocks to catch errors)
 - Some values are not consistent across every entry so you might have to provide checks to see if a value exists at all

In [79]:
# imports
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [80]:
#initialize the request
#use {} as page number holder (base_url.format('2')) for page 2
base_url = ("https://www.yelp.com/search?find_desc=Restaurants&find_loc=London%2C%20United%20Kingdom&ns=1&start={}")


In [81]:
req = requests.get(base_url.format(1))

In [None]:
scraper = bs4.BeautifulSoup(req.text,'lxml')

In [82]:
req

<Response [200]>

In [83]:
# feed the text into a scraper
yelp_scraper = BeautifulSoup(req.text)

In [84]:
#class titles are separated by spaces
#find every single link
## use the find_all method to select every <a> tag, along with its accompanying classes
titles = yelp_scraper.find_all('a', {'class': 'lemon--a__373c0__IEZFH', 'class': 'link__373c0__1G70M',
                                'class': 'link-color--inherit__373c0__3dzpk',
                                'class':  'link-size--inherit__373c0__1VFlE'})

In [85]:
titles[0]

<a class="lemon--a__373c0__IEZFH link__373c0__1G70M link-color--inherit__373c0__3dzpk link-size--inherit__373c0__1VFlE" href="/biz/the-champion-notting-hill-london?osq=Restaurants" name="The Champion- Notting Hill" rel="" target="">The Champion- Notting Hill</a>

In [86]:
#bs for element tag
## the data type of the item,it's not a string, 
#but a specialized scraper object
type(titles[0])

bs4.element.Tag

In [87]:
# first convert everything into a string
titles = [str(title) for title in titles]

In [88]:
# remove the </a> tag at the end
titles = [title.replace('</a>', '') for title in titles]

In [89]:
# and then split the items and grab the appropriate spot in the list to get the actual title
titles = [title.split('>')[1] for title in titles]

In [90]:
titles

['The Champion- Notting Hill',
 'more',
 'Escudo de Cuba',
 'more',
 'BunBunBun Vietnamese Food',
 'more',
 'Goddards at Greenwich',
 'more',
 'Piccolino',
 'more',
 'Xi’An BiangBiang Noodles',
 'more',
 'Hoppers',
 'more',
 'Greenberry Café',
 'more',
 'Fish in a Tie',
 'more',
 'Chokhi Dhani',
 'more',
 'Da Mario Restaurant',
 'more',
 'Texas Joe’s',
 'more',
 'Bizzarro',
 'more',
 'Dinerama',
 'more',
 'The Barbary',
 'more',
 'Flat Iron',
 'more',
 'Black Axe Mangal',
 'more',
 'Busaba Soho',
 'more',
 'Cambridge Street',
 'more',
 'Restaurante Santafereño',
 'more',
 'Circolo Popolare',
 'more',
 'The Ivy',
 'more',
 'Fondue Factory London',
 'more',
 'The Lion &amp; Unicorn',
 'more',
 'Phat Phuc Noodle Bar',
 'more',
 'Balaio Brazilian Grill',
 'more',
 'Electric Diner',
 'more',
 'Bird of Smithfield',
 'more',
 'Kiln',
 'more',
 'CASK Pub and Kitchen',
 'more',
 '<span aria-hidden="true" class="lemon--span__373c0__3997G icon__373c0__ehCWV icon--24-chevron-left-v2 icon--currentC

In [91]:
#get rid of more and the extra quotation 
titles = [title for title in titles if title != 'more']

In [92]:
titles = titles[:-8]

In [93]:
titles

['The Champion- Notting Hill',
 'Escudo de Cuba',
 'BunBunBun Vietnamese Food',
 'Goddards at Greenwich',
 'Piccolino',
 'Xi’An BiangBiang Noodles',
 'Hoppers',
 'Greenberry Café',
 'Fish in a Tie',
 'Chokhi Dhani',
 'Da Mario Restaurant',
 'Texas Joe’s',
 'Bizzarro',
 'Dinerama',
 'The Barbary',
 'Flat Iron',
 'Black Axe Mangal',
 'Busaba Soho',
 'Cambridge Street',
 'Restaurante Santafereño',
 'Circolo Popolare',
 'The Ivy',
 'Fondue Factory London',
 'The Lion &amp; Unicorn',
 'Phat Phuc Noodle Bar',
 'Balaio Brazilian Grill',
 'Electric Diner',
 'Bird of Smithfield',
 'Kiln',
 'CASK Pub and Kitchen',
 '<span aria-hidden="true" class="lemon--span__373c0__3997G icon__373c0__ehCWV icon--24-chevron-left-v2 icon--currentColor__373c0__x-sG2 icon--v2__373c0__1yp8c navigation-button-icon__373c0__1WyUh" style="width:24px;height:24px"']

In [97]:
titles = [title for title in titles if title != 'more' and '<div' not in title and '<span' not in title and '<div' not in title]


In [98]:
titles

['The Champion- Notting Hill',
 'Escudo de Cuba',
 'BunBunBun Vietnamese Food',
 'Goddards at Greenwich',
 'Piccolino',
 'Xi’An BiangBiang Noodles',
 'Hoppers',
 'Greenberry Café',
 'Fish in a Tie',
 'Chokhi Dhani',
 'Da Mario Restaurant',
 'Texas Joe’s',
 'Bizzarro',
 'Dinerama',
 'The Barbary',
 'Flat Iron',
 'Black Axe Mangal',
 'Busaba Soho',
 'Cambridge Street',
 'Restaurante Santafereño',
 'Circolo Popolare',
 'The Ivy',
 'Fondue Factory London',
 'The Lion &amp; Unicorn',
 'Phat Phuc Noodle Bar',
 'Balaio Brazilian Grill',
 'Electric Diner',
 'Bird of Smithfield',
 'Kiln',
 'CASK Pub and Kitchen']

In [99]:
num_reviews = yelp_scraper.find_all('span', {'class': 'lemon--span__373c0__3997G', 
                                        'class': 'text__373c0__2Kxyz',
                                        'class': 'reviewCount__373c0__2r4xT',
                                        'class': 'text-color--black-extra-light__373c0__2OyzO'})

In [100]:
num_reviews[2]

<span class="lemon--span__373c0__3997G text__373c0__2Kxyz text-color--black-extra-light__373c0__2OyzO text-align--left__373c0__2XGa-"><a class="lemon--a__373c0__IEZFH link__373c0__1G70M link-color--inherit__373c0__3dzpk link-size--default__373c0__7tls6" href="/search?cflt=pubs&amp;find_desc=Restaurants&amp;find_loc=London%2C+United+Kingdom" name="" rel="" role="link" target="">Pubs</a>, </span>

In [101]:
#change to a data type we can work with
# we'll convert everything into a string
num_reviews = [str(review) for review in num_reviews]

In [102]:
num_reviews[0]

'<span class="lemon--span__373c0__3997G text__373c0__2Kxyz reviewCount__373c0__2r4xT text-color--black-extra-light__373c0__2OyzO text-align--left__373c0__2XGa-">108</span>'

In [103]:
#remove the span replace it with nothing
num_reviews = [review.replace('</span>', '') for review in num_reviews]

In [104]:
num_reviews[0]

'<span class="lemon--span__373c0__3997G text__373c0__2Kxyz reviewCount__373c0__2r4xT text-color--black-extra-light__373c0__2OyzO text-align--left__373c0__2XGa-">108'

In [105]:
# split on the > to get the last item
num_reviews = [review.split('>')[-1] for review in num_reviews]

In [106]:
num_reviews[4]

'8'

In [107]:
#it's a string that encodes a number
num = '701'

In [108]:
num.isdigit()

True

In [109]:
#return the item in the list if this statement returns true
# do a check and just select the items that are numeric
# the isdigit() string method will be helpful here
num_reviews = [int(review) for review in num_reviews if review.isdigit()]

In [110]:
num_reviews

[108,
 8,
 33,
 84,
 47,
 17,
 113,
 32,
 33,
 1,
 139,
 22,
 149,
 59,
 64,
 3,
 10,
 381,
 50,
 7,
 11,
 170,
 1,
 18,
 56,
 1,
 108,
 23,
 51,
 139]

In [111]:
scraper = BeautifulSoup(req.text)

In [112]:
price_ranges = scraper.find_all('span', {'class': 'lemon--span__373c0__3997G', 
                                        'class': 'text__373c0__2Kxyz',
                                        'class': 'reviewCount__373c0__2r4xT',
                                        'class': 'text-color--black-extra-light__373c0__2OyzO'})

In [113]:
price_ranges = [str(range_) for range_ in price_ranges]

In [114]:
price_ranges = [range_.replace('</span>', '') for range_ in price_ranges]

In [115]:
price_ranges = [range_.split('>')[1] for range_ in price_ranges]

In [116]:
# \xA3 is unicode for the pound symbol
price_ranges = [range_ for range_ in price_ranges if '\xA3' in range_]

In [117]:
star_ratings = yelp_scraper.find_all('div',{'class': 'lemon--div__373c0__1mboc',
                                       'class': 'i-stars__373c0__1T6rz',
                                       'class': 'i-stars--regular-4-half__373c0__1YrPo'})
       

In [118]:
star_ratings[0]

<div aria-label="4.5 star rating" class="lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-4-half__373c0__1YrPo border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK" role="img"><img alt="" class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yelp_design_web.yji-52d3d7a328db670d4402843cbddeed89.png" width="132"/></div>

In [119]:
type(star_ratings[0])

bs4.element.Tag

In [120]:
star_ratings = [str(star_ratings) for star_ratings in star_ratings]

In [121]:
type(star_ratings[0])

str

In [122]:
star_ratings = [star_ratings.replace('</a>', '') for star_ratings in star_ratings]

In [123]:
star_ratings = [star_ratings.split('>')[-1] for star_ratings in star_ratings]

In [124]:
star_ratings

['', '', '', '', '', '', '', '', '', '', '', '', '', '']

In [125]:
cuisine = scraper.find_all('a', {'class': 'lemon--a__373c0__IEZFH', 
                              'class': 'link__373c0__1G70M', 
                              'class': 'link-color--inherit__373c0__3dzpk',
                              'class': 'link-size--default__373c0__7tls6 '})

In [126]:
cuisines[0]

IndexError: list index out of range

In [127]:
cuisines = [str(cuisine) for cuisine in cuisine]

In [128]:
type(cuisines[0])

IndexError: list index out of range

In [129]:
cuisines = [str(cuisines) for cuisine in cuisine]

In [130]:
cuisines= [cuisines.replace('</a>', '') for cuisine in cuisine]

In [131]:
cuisines = [cuisine.split('>')[1] for cuisine in cuisine ]

In [132]:
cuisines

[]

In [133]:
cuisines = [cuisine for cuisines in cuisines if cuisines != 'more' and '<div' not in cuisines and '<span' not in cuisine]

In [134]:
cuisines

[]

In [135]:
df_dict = {
    'Name': titles,
    'NumReviews': num_reviews,
    'PriceRange': price_ranges
    'StarRatings': star_ratings
    'Cuisines':cuisines
}

df = pd.DataFrame(df_dict)

SyntaxError: invalid syntax (<ipython-input-135-0cf01d05d8ae>, line 5)

In [137]:
df

NameError: name 'df' is not defined