# Web Scraping Challenges

For these exercises, we will be parsing information from [Quotes to Scrape](http://quotes.toscrape.com/) and [Books to Scrape](http://books.toscrape.com/). These sites are built and offered as web scraping testing grounds to learn different web scraping techniques. For this, we will focus on using [Requests](https://requests.readthedocs.io/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/).

If you are looking for an extra challenge, here are a few things you can try:
* Only parse content with:
    * BeautifulSoup [`find()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) and [`find_all()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) methods
    * [css selectors](https://www.w3.org/TR/selectors-4/) (BeautifulSoup)
    * [path selectors](https://www.w3.org/TR/2017/REC-xpath-31-20170321/) (lxml)
* Use [Selenium](https://selenium.dev/selenium/docs/api/py/) to scrape a [javascript version of Quotes to Scrape](http://quotes.toscrape.com/js/)
* Use [Urllib](https://docs.python.org/3/library/urllib.html) from the Python Standard Library instead of Requests

## Getting started

To get started, we need to import the necessary modules. Unless you're attempting one of the challenges, we'll give you this one for free. However, you might find it useful to import some other modules later to help answer some questions.

In [None]:
import requests
from bs4 import BeautifulSoup

books_url = 'http://books.toscrape.com'
quotes_url = 'http://quotes.toscrape.com'

## Getting Page Content

The first challenge is to [get](https://requests.readthedocs.io/en/master/user/quickstart/#make-a-request) the page content. For now, let's just look at the Quotes site.

### Metadata

What kind of information was returned in the response header?

What kind of information did we send in the request header?

### Actual Data

How many quotes are on the first page?

# Pagination

How many pages of quotes are there to scrape?

How many quotes are there to scrape? Store them in a list for further processing (reduces number of slow http requests)

### Tags
How many tags are there?

What are the top 20 tags? How many quotes does each tag have? What is it's url?

### Authors

How many authors are there?

What are the top 20 authors? How many quotes does each author have? What is their url?

Who is the oldest author? Who is the youngest?

Where were the most authors born (by country)? Which countries have the most?

# Parsing JSON

How many pages does it take to scrape the whole api? Use [params](https://requests.readthedocs.io/en/master/user/quickstart/#passing-parameters-in-urls) to build the query string.

The api can be found at http://quotes.toscrape.com/api/quotes

How many quotes are available on the api?

How many unique tags are in the api?

What are the top 20 tags? Do they match the previous result?

How many unique authors are in the api?

What are the top 20 authors? Do they match the previous result?