# Web Scraping Challenges

For these exercises, we will be parsing information from [Quotes to Scrape](http://quotes.toscrape.com/) and [Books to Scrape](http://books.toscrape.com/). These sites are built and offered as web scraping testing grounds to learn different web scraping techniques. For this, we will focus on using [Requests](https://requests.readthedocs.io/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/).

If you are looking for an extra challenge, here are a few things you can try:
* Only parse content with:
    * BeautifulSoup [`find()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) and [`find_all()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) methods
    * [css selectors](https://www.w3.org/TR/selectors-4/) (BeautifulSoup)
    * [path selectors](https://www.w3.org/TR/2017/REC-xpath-31-20170321/) (lxml)
* Use [Selenium](https://selenium.dev/selenium/docs/api/py/) to scrape a [javascript version of Quotes to Scrape](http://quotes.toscrape.com/js/)
* Use [Urllib](https://docs.python.org/3/library/urllib.html) from the Python Standard Library instead of Requests

## Getting started

To get started, we need to import the necessary modules. Unless you're attempting one of the challenges, we'll give you this one for free. However, you might find it useful to import some other modules later to help answer some questions.

In [None]:
import requests
from bs4 import BeautifulSoup

books_url = 'http://books.toscrape.com'
quotes_url = 'http://quotes.toscrape.com'

In [None]:
# Additional Imports
from collections import Counter
from datetime import datetime
from itertools import chain
from tabulate import tabulate
from urllib.parse import urljoin

## Getting Page Content

The first challenge is to [get](https://requests.readthedocs.io/en/master/user/quickstart/#make-a-request) the page content. For now, let's just look at the Quotes site.

In [None]:
response = requests.get(quotes_url)
response

### Metadata

What kind of information was returned in the response header?

In [None]:
print(tabulate(response.headers.items()))

What kind of information did we send in the request header?

In [None]:
print(tabulate(response.request.headers.items()))

### Actual Data

How many quotes are on the first page?

In [None]:
soup = BeautifulSoup(response.content, 'html.parser')

items = soup.select('div.quote')
print(f"The first page has {len(items)} quotes")

# Pagination

How many pages of quotes are there to scrape?

In [None]:
next_page = '/page/1/'
while next_page:
    response = requests.get(urljoin(quotes_url, next_page))
    soup = BeautifulSoup(response.content, 'html.parser')
    try:
        next_page = soup.select_one('li.next a')['href']
    except TypeError:
        break

print(f"There are {next_page.split('/')[2]} pages to scrape.")

How many quotes are there to scrape? Store them in a list for further processing (reduces number of slow http requests)

In [None]:
quotes = []

next_page = '/page/1/'
while next_page:
    response = requests.get(urljoin(quotes_url, next_page))
    soup = BeautifulSoup(response.content, 'html.parser')
    quotes.extend(soup.select('div.quote'))
    try:
        next_page = soup.select_one('li.next a')['href']
    except TypeError:
        break
        
print(f'There are {len(quotes)} quotes to process.')

### Tags
How many tags are there?

In [None]:
tags = Counter(
    chain.from_iterable([
        quote.select('a.tag') 
        for quote in quotes
    ])
)

print(f'There are {len(tags)} tags')

What are the top 20 tags? How many quotes does each tag have? What is it's url?

In [None]:
print(tabulate([
    {
        'tag': tag.text,
        'num': count,
        'url': tag['href'],
    }
    for tag, count in tags.most_common(20)
]))

### Authors

How many authors are there?

In [None]:
authors = Counter([
    quote.select_one('small.author').text
    for quote in quotes
])
    
print(f'There are {len(authors)} authors')

What are the top 20 authors? How many quotes does each author have? What is their url?

In [None]:
authors_dict = {name: {'count': count} for name, count in authors.most_common()}
for quote in quotes:
    authors_dict[quote.select_one('small.author').text]['url'] = quote.select_one('span a')['href']

for auth in authors_dict:
    authors_dict[auth]['name'] = auth

authors_list = list(authors_dict.values())

print(tabulate(authors_list[:20]))

Who is the oldest author? Who is the youngest?

In [None]:
for author in authors_list:
    response = requests.get(urljoin(quotes_url, author['url']))
    soup = BeautifulSoup(response.content, 'html.parser')
    author['birthday'] = datetime.strptime(
        soup.select_one('.author-born-date').text,
        '%B %d, %Y'
    )
    author['home'] = soup.select_one('.author-born-location').text[3:]

In [None]:
by_birthday = sorted(authors_list, key=lambda i: i['birthday'])

print(f"The oldest author is {by_birthday[0]['name']} born on {by_birthday[0]['birthday'].strftime('%B %d, %Y')}")
print(f"The youngest author is {by_birthday[-1]['name']} born on {by_birthday[-1]['birthday'].strftime('%B %d, %Y')}")

Where were the most authors born (by country)? Which countries have the most?

In [None]:
print(tabulate(
    Counter([
        auth['home'].split(', ')[-1] 
        for auth in authors_list]
    ).most_common()
))

# Parsing JSON

How many pages does it take to scrape the whole api? Use [params](https://requests.readthedocs.io/en/master/user/quickstart/#passing-parameters-in-urls) to build the query string.

The api can be found at http://quotes.toscrape.com/api/quotes

In [None]:
quotes_url_json = 'http://quotes.toscrape.com/api/quotes'

next_page = 1
while next_page:
    response = requests.get(quotes_url_json, params={'page': next_page})
    data = response.json()
    if data['has_next']:
        next_page += 1
    else:
        break
        
print(f"There are {next_page} pages to scrape.")

How many quotes are available on the api?

In [None]:
quotes = []

next_page = 1
while next_page:
    response = requests.get(quotes_url_json, params={'page': next_page})
    data = response.json()
    quotes.extend(data['quotes'])
    if data['has_next']:
        next_page += 1
    else:
        break
        
print(f'There are {len(quotes)} quotes to process.')

How many unique tags are in the api?

In [None]:
tags = Counter(
    chain.from_iterable([
        quote['tags']
        for quote in quotes
    ])
)

print(f'There are {len(tags)} tags')

What are the top 20 tags? Do they match the previous result?

In [None]:
print(tabulate(
    tags.most_common(20)
))

How many unique authors are in the api?

In [None]:
authors = Counter([
    quote['author']['name']
    for quote in quotes
])
    
print(f'There are {len(authors)} authors')

What are the top 20 authors? Do they match the previous result?

In [None]:
print(tabulate(
    authors.most_common(20)
))