# Scraping HTML

### Introduction

After knowing a bit of html, we can use our knowledge to extract data from a website by parsing through that HTML.  In this lesson, we'll give that a shot by scraping just a little bit of the Yelp website.  Let's get started.

### Scraping Yelp

In [13]:
import requests
url = 'https://www.yelp.com/search?find_desc=chinese&find_loc=New+York%2C+NY+10001'
response = requests.get(url)

Now this time, instead of getting back the `response.json()`, let's get back the text that is returned.  

In [14]:
response_text = response.text

In [16]:
response_text[:200]

'<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentEleme'

We can see that this text is really the HTML that we get back from the yelp website.  And unlike json, parsing this takes a bit more work.  Fortunately, we can use beautiful soup to help us search through it and find the information that we want.

### Using Beautiful Soup

In [6]:
from bs4 import BeautifulSoup as bs

In [17]:
parsed_html = bs(response_text)

Now by passing our string into beautiful soup, we now can search our string by different tags or attributes.  For example, if we want to find all elements of a certain tag, we can search our html like so.

In [56]:
parsed_html.select('h1')

[<h1 class="heading--h3__09f24__3gZ0A"><span class="raw__09f24__3Obuy">Best chinese <span class="">near New York, NY 10001</span></span></h1>]

So here, we get a list of all of the 'h1' elements on the page -- there's only one.  

### Finding list of elements

What we would prefer to do is find the list of restaurants returned.  When we search the inspector, we could find the correct element by passing through the class of "leftRail...", with something like the following:

<img src="./yelp-scrape.png" width="70%">

In [62]:
selected_divs = parsed_html.select("div[class^=leftRail]", limit = 1)

selected_divs[0].text[:20]

'Filters$$$$$$$$$$Sug'

So this says to find all of the divs with a class that begins with the string 'leftRail', and then limit one.  Then we select the first div from the list, extract the text and slice the first 20 characters.  It turns out this probably gives us back too much html.  What we would like to do, is get as close as possible to our list of restaurants, so let's try to find those li elements.

### Finding the li elements

<img src="./sponsored_results.png" width="80%">

Ok, so when we identify our list of HTML elements, it turns out that the first list item contains the text sponsored results.  Then the succeeding list elements appear to be the restaurants.  

So we would like to find this "sponsored results" list element, and then find the sibling list elements.  That is, find the li elements at the same level as the Sponsored Results li.  Here's how we can do this.

In [68]:
found_li = parsed_html.select(selector = 'li', text = "Sponsored Results")[0]

So now that we've identified our Sponsored Results 

In [85]:
li_results = found_li.findNextSiblings('li')

In [100]:
len(li_results)

13

Another way that we can find siblings is to get our `found_li`'s parent, and then find that parent's children.

In [115]:
parent = found_li.findParent()

In [123]:
parent.findChildren('li')[0].text

'Sponsored Results'

In [124]:
parent.findChildren('li')[1].text

"Yong Kang Street76$$Asian Fusion, Sushi Bars, Poke(212) 765-87771000 8th AveHell's KitchenDeliveryTakeout“A nice option for cheap and freshly made sushi. I've tried it a few times now and find the quality to be good for the price. The people running the place are welcoming and friendly.…”\xa0moreStart OrderOffers takeout and delivery"

### Finding 

Now once we have found our list elements that we wish to search through, we can select our first list result.

In [125]:
first_result = li_results[0]

And from there identify just the data that is relevant.  For example, the phone number is in the paragraph.

In [128]:
first_result.find('p').text

'(212) 765-8777'

And the title is located in h4.

In [127]:
li_results[0].find('h4').text

'Yong Kang Street'

So from there, we may wish to extract this information for each of our found restaurants.

In [145]:
restaurants = [] 
for card in li_results[3:]:
    title = card.find('h4')
    phone_number = card.find('p')
    if title:
        title = title.text
    if phone_number:
        number = phone_number.text
    restaurant = {'name': title.split('\xa0')[1], 'number': number}
    restaurants.append(restaurant)

In [146]:
restaurants

[{'name': 'China Xiang 中国湘', 'number': '(212) 967-6088'},
 {'name': 'Fu Xing', 'number': '(212) 575-6978'},
 {'name': 'Golden City Chinese Restaurant', 'number': '(212) 736-4004'},
 {'name': 'Dim Sum Chelsea', 'number': '(212) 645-0100'},
 {'name': 'Lan Sheng Restaurant', 'number': '(212) 575-8899'},
 {'name': 'Dim Sum Palace', 'number': '(646) 861-1910'},
 {'name': 'Da Tang Szechuan', 'number': '(646) 478-8345'},
 {'name': 'New Li Yuan', 'number': '(212) 575-6978'},
 {'name': 'Grand Sichuan', 'number': '(212) 620-5200'},
 {'name': 'Ming', 'number': '(212) 868-1378'}]

And now we have our list of restaurants.

### Summary