<a href="https://colab.research.google.com/github/suzannelittle/ca682i/blob/master/notebooks/2_4_8_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A simple web scraping example using Python and BeautifulSoup

Here's a simple example using http://books.toscrape.com/, a "fake" online book store setup to practise web scraping. The aim is to get a list of all the books and their prices and then save this as a JSON data structure for later use. 

1. Go to http://books.toscrape.com/ and use right-click "Inspect" (Chrome) or "Inspect Element" (Firefox) to see the HTML code that creates the web page.
2. Identify the tags surrounding each book.
3. Use requests to get the HTML of the web page.
4. Turn the HTML into a searchable, indexable object using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).
5. Find the tags for each book and extract the Title and Price values.
6. Find the next page link and repeat steps 3-5.
7. Save the Title & Price into a Python List and convert to a JSON string to store.


In [None]:
import requests
from bs4 import BeautifulSoup
import json

In [None]:
url = "http://books.toscrape.com/"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'lxml') 

Now let's find the book price and title from the article tag 

In [None]:
for article in soup.find_all('article'):   # find all article tags in the document
    title = article.find('h3').find('a')['title']    # get the h3->a tag where the Title is stored in the 'title' attribute
    price = article.find('p', {'class':'price_color'}).get_text()     # the price is in the <p class="price_color"> tag

    ## TODO: can you also find and include the star rating?

    print((title, price))           # store the (Title, Price) text

But this is just the first page! Let's find the address of the next page by getting the link from the "next" button.

In [None]:
next = soup.find('li', {'class':'next'}).find('a')['href']
print(next)

In [None]:
next_url = url+next    # "http://books.toscrape.com/" + "catalogue/page-2.html"
response = requests.get(next_url)
html = response.content
soup = BeautifulSoup(html, 'lxml')
for article in soup.find_all('article'):   # find all article tags in the document
    title = article.find('h3').find('a')['title']    # get the h3->a tag where the Title is stored in the 'title' attribute
    price = article.find('p', {'class':'price_color'}).get_text()     # the price is in the <p class="price_color"> tag
    print((title, price))           # store the (Title, Price) text

Great, so we've got the next page. Let's use this and create code to loop through until the last page. This is a clumsy example of crawling a site to get all the data. The result is a list of (Title, Price) that should contain 1000 books. 

In [None]:
url = "http://books.toscrape.com/"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'lxml')

myPriceList = []

for article in soup.find_all('article'): 
  title = article.find('h3').find('a')['title'] 
  price = article.find('p', {'class':'price_color'}).get_text() 
  myPriceList.append((title, price))

next = soup.find('li', {'class':'next'})
while next != None:
  next = next.find('a')['href']
  if not next.startswith('catalogue/'): next="catalogue/"+next    # after the first link doesn't include catalogue
  next_url = url+next   
  print(next_url)
  response = requests.get(next_url)
  html = response.content
  soup = BeautifulSoup(html, 'lxml')
  for article in soup.find_all('article'): 
    title = article.find('h3').find('a')['title']
    price = article.find('p', {'class':'price_color'}).get_text()
    myPriceList.append((title, price))
  next = soup.find('li', {'class':'next'})

In [None]:
len(myPriceList)   # should be 1000 entries ...

In [None]:
json.dumps(myPriceList)   # this could be saved to a file for later use

Hopefully this simple example shows you:
* how useful scraping can be when there's no other way to get dynamic data
* how complicated navigating HTML can be, especially if it's not well structured
* how fragile scraping can be -- this code would easily break if any changes were made to the web site