## Task 3: Webscrapping with Beautiful Soup
**Goal**: We will learn about scraping a webpage with beautiful soup

**Learning Outcomes**: Learn to use beautiful soup to scape different websites. 

**Prerequisites**: Basic understanding of python.

### Part 1: Introduction to Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with different parsers to provide ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Here is an example html doc: 
```python
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
```

You can import BeautifulSoup and create a BeautifulSoup object like follows: 
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
```

Now you can try many differnt ways to navigate the BeauifulSoup data structure below: 

In [3]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.title.name)
# u'title'

print(soup.title.string)
# u'The Dormouse's story'

print(soup.title.parent.name)
# u'head'

print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>

print(soup.p['class'])
# u'title'

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soup.find_all('a'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find(id="link3"))
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


Two common tasks are 
1. Extracting all the URLs 
2. Extracting all the text from a page

See below for examples on how to do this: 

In [4]:
print("Extracting all the URLs")
for link in soup.find_all('a'):
    print(link.get('href'))

print("Extracting all the text from a page")
print(soup.get_text())

Extracting all the URLs
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
Extracting all the text from a page

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



### Part 2: Scrapping from the Web 
Where do we get the html/xml pages from? Python's requests library handles HTTP communication by making HTTP requests to web servers (GET, POST, etc,). This library will help us retrieve raw HTML content from the website. When we encounter an error, this library handles network related errors. 

```python
import requests
response = requests.get('http://books.toscrape.com/')
```

In [11]:
import requests
response = requests.get('http://books.toscrape.com/')

soup = BeautifulSoup(response.text, 'html.parser')

title = soup.title.text
all_paragraphs = soup.find_all('p')
specific_div = soup.find('div', class_='content')

### Part 3: Build an inventory by scrapping a website

Your task is to scrape a book website and collect the prices for all the books you find.

In [58]:
import pandas as pd


    
# This is the url of the website we will scrape
base_url = 'http://books.toscrape.com/'
book_inventory = {} # store you results in this dictionary here

### YOUR CODE STARTS HERE
# Your task is to scrape a book website and collect the prices for all the books you find. 
# make sure you store the results in the book_inventory dictionary with the price as a float
# make sure you scrape all the pages so you have 1000 books in total

response = requests.get(base_url)
soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')
NUM_PAGES=0 
while True: 
    all_books = soup.find_all('article', class_='product_pod')
    if len(all_books) != 20: 
        print(all_books)
    for book in all_books:
        title = book.find('h3').find('a')['title']
        price = book.find('p', class_='price_color').text
        book_inventory[title] = float(price.split('Â£')[1])
    
    next_page = soup.select_one('li.next')
    if not next_page:
        break
    url = base_url + 'catalogue/' + next_page.find('a')['href'].split('/')[-1]
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')
    NUM_PAGES+=1 
### YOUR CODE ENDS HERE

print(pd.DataFrame.from_dict(book_inventory, orient='index', columns=['Price'])) # your book inventory should be a dictionary with the book title as the key and the price as the value




                                                    Price
A Light in the Attic                                51.77
Tipping the Velvet                                  53.74
Soumission                                          50.10
Sharp Objects                                       47.82
Sapiens: A Brief History of Humankind               54.23
...                                                   ...
Alice in Wonderland (Alice's Adventures in Wond...  55.53
Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)    57.06
A Spy's Devotion (The Regency Spies of London #1)   16.97
1st to Die (Women's Murder Club #1)                 53.98
1,000 Places to See Before You Die                  26.08

[999 rows x 1 columns]


In [59]:
all_books_df = pd.DataFrame.from_dict(book_inventory, orient='index', columns=['Price'])
all_books_df.describe()

Unnamed: 0,Price
count,999.0
mean,35.059389
std,14.449765
min,10.0
25%,22.105
50%,35.96
75%,47.475
max,59.99
