# Section 10: HTML, CSS and Webscraping

### Terminology

Web pages can be represented by the objects that comprise their structure and content. This representation is known as the **Document Object Model (DOM)**. The purpose of the DOM is to provide an interface for programs to change the structure, style, and content of web pages. The DOM represents the document as nodes and objects. Amongst other things, this allows programming languages to interactively change the page and HTML!

What you'll see is the DOM and HTML create a hierarchy of elements. This structure and the underlying elements can be navigated similarly to a family tree which is one of Beautiful Soup's main mechanisms for navigation. Once you select a specific element within a page, you can then navigate to successive elements using methods to retrieve related tags including a tag's sibling, parent or descendants.
  
To learn more about the DOM see:  
https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction

<img src="images/DOM-model.svg.png" width="500">

### Beautiful Soup     

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library designed for quick scraping projects. It allows you to select and navigate the tree-like structure of HTML documents, searching for particular tags, attributes or ids. It also allows you to then further traverse the HTML documents through relations like children or siblings. In other words, with Beautiful Soup, you could first select a specific `div` tag and then search through all of its nested tags. 


## Scraping a Single Page

In [None]:
from bs4 import BeautifulSoup
import requests

http://books.toscrape.com/

In [None]:
html_page = requests.get('http://books.toscrape.com/') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing


In [None]:
soup.prettify

In [None]:
soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

In [None]:
first_20 = soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

In [None]:
len(first_20)

In [None]:
first = first_20[0]

In [None]:
first

In [None]:
first.find('a')['href']

In [None]:
first.find('h3').find('a')['title']

In [None]:
first.find('p', {'class': 'price_color'})

In [None]:
first.find('p', {'class': 'price_color'}).text

In [None]:
first.find('p', {'class': "instock availability"})

In [None]:
# this one uses Regex -- a Mod 4 topic -- but could come in handy!!

import re
regex = re.compile("star-rating (.*)")
first.find('p', {'class': regex})

In [None]:
first.find('p', {'class': regex})['class']

In [None]:
def clean_scrape(book):
    info = {}
    
    
    info['title'] = book.find('h3').find('a')['title']
    info['price'] = book.find('p', {'class': 'price_color'}).text
    
    if 'In stock' in first.find('p', {'class': "instock availability"}).text:
        info['in_stock'] = True
    else:
        info['in_stock']= False
        
    info['stars'] = book.find('p', {'class': regex})['class'][-1]
    
    info['url'] = 'http://books.toscrape.com/' + book.find('a')['href']
    
    return info

In [None]:
book_dicts = [clean_scrape(book) for book in first_20]

In [None]:
book_dicts

In [None]:
import pandas as pd
pd.DataFrame(book_dicts)

## Scraping Multiple Pages (Pagination!)

In [None]:
url = 'http://books.toscrape.com/catalogue/page-1.html'

In [None]:
urls = ['http://books.toscrape.com/catalogue/page-{}.html'.format(i) for i in range(1, 51)]
urls

In [None]:
def get_20_books(url):
    
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    raw = soup.find_all('li', {'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
    to_dicts = [clean_scrape(book) for book in raw]
    
    return to_dicts

In [None]:
all_dicts = []

for url in urls:
    all_dicts.extend(get_20_books(url))

print(len(all_dicts))
all_dicts

In [None]:
df = pd.DataFrame(all_dicts)

In [None]:
df