![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Web scraping in Python

## Beautiful Soup

In this project, you will use Beautiful Soup to scrape content from additional websites.  For this purpose, use the `requests` third party module to actually obtain the web page contents.  Beautiful Soup will be useful for processing and extracting parts of interest.

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 1

**A fictional bookstore**

The URL http://books.toscrape.com/ contains a collection of pages that resemble an online bookstore.  Prices and ratings are randomly assigned by them.  The book titles and authors appear to be actual books, although I have not verified all of them.

For a first task, identify all the "Autobiography" title and their prices.  Save this information in a Python dictionary, or if you are familiar with Pandas, in a Pandas DataFrame.  Ideally, for this exercise, your web crawler will begin with the home page, and navigate within pages programmatically (i.e. do not manually find nested URLs).  

As with other web scraping tasks, getting the steps right will certainly require some trial-and-error, and examination of partial results.

In [1]:
import requests
from bs4 import BeautifulSoup

url = 'http://books.toscrape.com/'
bookstore = requests.get(url)

In [2]:
soup = BeautifulSoup(bookstore.text)
sidebar = soup.find('aside')
for a in sidebar.find_all('a'):
    if 'autobiography' in a.text.lower():
        url_autobio = url + a['href']
        break

In [3]:
page = requests.get(url_autobio).text
autobiography = BeautifulSoup(page)

In [4]:
titles = []
for title in autobiography.find_all('a'):
    if title.get('title'):
        titles.append(title['title']) 
        
prices = []
for price in autobiography(class_="product_price"):
    # Stray character appears before the pound symbol
    prices.append(price.p.text.replace('Â', ''))

In [5]:
import pandas as pd
autobiographies = pd.DataFrame(list(zip(titles, prices)),
                               columns=['Title', 'Price'])
autobiographies

Unnamed: 0,Title,Price
0,The Argonauts,£10.93
1,M Train,£27.18
2,Lab Girl,£40.85
3,Approval Junkie: Adventures in Caring Too Much,£58.81
4,Running with Scissors,£12.91
5,Me Talk Pretty One Day,£57.60
6,Lust & Wonder,£11.87
7,Life Without a Recipe,£59.04
8,A Heartbreaking Work of Staggering Genius,£54.29


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 2

**NOAA factoids**

The website for the United States National Oceanic and Atmospheric Administration (https://www.noaa.gov/) contains a sidebar on the left.  The links in the sidebar lead to subject-area sections like "Fisheries" and "Satellites."  Each of those contains a large "quick fact" that is thematically related to that area.  For example:

<img src="img/NOAA-fact.png" width="50%" />

The particular quotes provided are randomized, and there is a link to cause a new quote to appear (still as relevant to the section).

For this task, pull one quote from each sidebar link, if the corresponding page has a quote.  Ignore the extraction if the page lacks such a quote.

In [6]:
url = 'https://www.noaa.gov/'
noaa = requests.get(url)

In [7]:
soup = BeautifulSoup(noaa.text)
sidebar = soup.find(id="navigation-main")
sections = [e.a['href'] for e in sidebar.find_all(class_="leaf")]
urls = [url+section for section in sections]

In [9]:
import re
for url in urls:
    topic = BeautifulSoup(requests.get(url).text)
    fact = topic.find(class_='field-collection-item-field-quick-facts')
    if fact:
        fact = re.sub(r"[\r\n]+", ":\n", fact.text.strip())
        print(fact)
        print('-----')

6.3 billion observations per day:
-----
2019 was second-warmest year on record:
Earth’s global average surface temperature was 1.71°F (0.95°C) above the 20th-century average in 2019. Nine of the 10 warmest years on record have occurred since 2005.
-----
More than 80% unexplored:
Most of the ocean is unseen by human eyes -- more than 80% of our ocean is unmapped, unobserved and unexplored.
-----
47:
The number of fish stocks rebuilt since 2000 as a result of our fishery management process. The number of stocks on the overfishing list dropped to an all-time low in 2019.
-----
Orbiting 1 million miles from Earth:
-----
20 :
The number of NOAA Cooperatative Institutes (CIs). Partnerships CIs, which are located at degree-granting institutions, play a vital role in increasing NOAA’s research capacity and expertise. CIs support NOAA's mission and educate the next generation of the nation’s scientific workforce to prepare NOAA for the future.
-----
More than 650:
The number of data collection 

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)