# Unit 2 Advanced Link Navigation and URL Management in Web Scraping

# Topic Overview

Welcome\! In this lesson, we'll delve into advanced link navigation and URL management within the realm of web scraping using Python and BeautifulSoup. Our goal is to ensure that you can navigate between linked web pages and manage URLs effectively for scalable web scraping.

## Navigating Author Details

To solidify your understanding of link navigation, we'll focus on a scenario where you scrape quotes from a website and navigate to author pages to extract additional information. This process involves extracting links from the main page, navigating to the linked pages, and scraping data from those pages. The following code scrapes quote from the main page and navigate to the author pages for more information:

```python
import requests
from bs4 import BeautifulSoup

def scrape_quotes(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    quotes = soup.select('.quote')
    for quote in quotes:
        text = quote.select_one('.text').get_text()
        author = quote.select_one('.author').get_text()
        print(f'{text} - {author}')
        endpoint_to_about_page = quote.select_one('span a')['href']
        url_to_about_page = base_url + endpoint_to_about_page

        response = requests.get(url_to_about_page)
        soup_about = BeautifulSoup(response.text, 'html.parser')
        born_date = soup_about.select_one('.author-born-date').get_text()
        born_location = soup_about.select_one('.author-born-location').get_text()
        print(f'{author} was born on {born_date} in {born_location}\n')

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)
```

First, we import the necessary libraries and define a soup object for the main page.

Then, we extract quotes from the main page and iterate over each quote to extract text and author information.

After that, we extract the endpoint to the author's page and construct the full URL:

```python
endpoint_to_about_page = quote.select_one('span a')['href']
url_to_about_page = base_url + endpoint_to_about_page
```

Remember that `select_one()` returns the first matching element, and we use the `['href']` attribute to extract the endpoint.

Once we have the full URL, we send a request to the author's page and create a new soup object to extract additional information:

```python
response = requests.get(url_to_about_page)
soup_about = BeautifulSoup(response.text, 'html.parser')
born_date = soup_about.select_one('.author-born-date').get_text()
born_location = soup_about.select_one('.author-born-location').get_text()
print(f'{author} was born on {born_date} in {born_location}\n')
```

Notice, that in this snippet as well, we use `select_one()` to extract the birth date and location of the author.

The output of the code will be the following:

```
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” - Albert Einstein
Albert Einstein was born on March 14, 1879 in in Ulm, Germany

“It is our choices, Harry, that show what we truly are, far more than our abilities.” - J.K. Rowling
J.K. Rowling was born on July 31, 1965 in in Yate, South Gloucestershire, England, The United Kingdom

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” - Albert Einstein
Albert Einstein was born on March 14, 1879 in in Ulm, Germany

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” - Jane Austen
Jane Austen was born on December 16, 1775 in in Steventon Rectory, Hampshire, The United Kingdom
...
```

## Lesson Summary and Practice

In this lesson, we've covered advanced link navigation and URL management in web scraping using Python and BeautifulSoup. We examined and extracted links, navigated between pages, handled relative and absolute URLs, and applied these concepts in a detailed code example. These skills will enable you to handle more complex web scraping tasks effectively.

These exercises will help you practice and deepen your understanding of link navigation and URL management in web scraping, enhancing your proficiency in scalable scraping projects. Happy Scraping\!

## Navigate and Scrape with Python

Great work on the lesson! Now, let's run the code provided in the lesson to see how it works in action.

By running this code, you will be able to observe how it fetches quotes, authors, and navigates to author pages to extract additional information like birth dates and locations.

```python
import requests
from bs4 import BeautifulSoup

def scrape_quotes(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.select('.quote')

    for quote in quotes:
        text = quote.select_one('.text').get_text()
        author = quote.select_one('.author').get_text()
        print(f'{text} - {author}')

        endpoint_to_about_page = quote.select_one('span a')['href']
        url_to_about_page = base_url + endpoint_to_about_page

        response = requests.get(url_to_about_page)
        soup_about = BeautifulSoup(response.text, 'html.parser')
        born_date = soup_about.select_one('.author-born-date').get_text()
        born_location = soup_about.select_one('.author-born-location').get_text()

        # Remove leading "in " from born_location if present
        if born_location.startswith("in "):
            born_location = born_location[3:]

        print(f'{author} was born on {born_date} in {born_location}\n')

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)

```

## Adding Quote Tags for More Detail

Great progress so far!

Now, let's update the code the retrieve and print the author description as well along with born date and location. Follow the TODO instructions to complete the task.

```python
import requests
from bs4 import BeautifulSoup

def scrape_quotes(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.select('.quote')[:2]

    for quote in quotes:
        text = quote.select_one('.text').get_text()
        author = quote.select_one('.author').get_text()
        print(f'{text} - {author}')

        endpoint_to_about_page = quote.select_one('span a')['href']
        url_to_about_page = base_url + endpoint_to_about_page

        response = requests.get(url_to_about_page)
        soup_about = BeautifulSoup(response.text, 'html.parser')
        born_date = soup_about.select_one('.author-born-date').get_text()
        born_location = soup_about.select_one('.author-born-location').get_text()
        print(f'{author} was born on {born_date} in {born_location}\n')

        # TODO: Add another line to retrieve the author description from the nested URL

        # TODO: Print the author description. Hint: You can use the element with class 'author-description'

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)

```

```python
import requests
from bs4 import BeautifulSoup

def scrape_quotes(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.select('.quote')[:2]

    for quote in quotes:
        text = quote.select_one('.text').get_text()
        author = quote.select_one('.author').get_text()
        print(f'{text} - {author}')

        endpoint_to_about_page = quote.select_one('span a')['href']
        url_to_about_page = base_url + endpoint_to_about_page

        response = requests.get(url_to_about_page)
        soup_about = BeautifulSoup(response.text, 'html.parser')
        born_date = soup_about.select_one('.author-born-date').get_text()
        born_location = soup_about.select_one('.author-born-location').get_text()
        print(f'{author} was born on {born_date} in {born_location}\n')

        # TODO: Add another line to retrieve the author description from the nested URL
        author_description = soup_about.select_one('.author-description').get_text()

        # TODO: Print the author description. Hint: You can use the element with class 'author-description'
        print(f'Description: {author_description}\n')

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)
```

## Navigating Linked Author Pages

You've done well navigating through links and extracting data.

Now, let's practice filling in missing parts of a Python script to scrape quotes and additional author information. Complete the script so that it navigates to author pages to extract their birth dates and locations.

You need to fill in the missing parts of the script.

```python
import requests
from bs4 import BeautifulSoup

def scrape_quotes(base_url):
    # Send a request to get the main page content
    response = requests.get(base_url)
    # Create a BeautifulSoup object from the response text
    soup = BeautifulSoup(response.text, 'html.parser')

    # Select all quotes on the main page
    quotes = soup.select('.quote')

    for quote in quotes:
        # Extract the text of the quote
        text = quote.select_one('.text').get_text()
        # Extract the author's name
        author = quote.select_one('.author').get_text()
        print(f'{text} - {author}')

        # Extract the relative URL to the author's page
        endpoint_to_about_page = quote.select_one('span a')['href']

        # Generate the full URL
        url_to_about_page = base_url + endpoint_to_about_page

        # Send a request to get the author's page content
        response = requests.get(url_to_about_page)

        # Create a BeautifulSoup object from the author's page
        soup_about = BeautifulSoup(response.text, 'html.parser')

        # TODO: Extract the author's birth date from the linked page using CSS selector and get it's text
        # Hint: The class name of the necessary elemenet is 'author-born-date'

        # TODO: Extract the author's birth location from the linked page using CSS selector and get it's text
        # Hint: The class name of the necessary elemenet is 'author-born-location'

        print(f'{author} was born on {born_date} in {born_location}\n')

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)
```

```python
import requests
from bs4 import BeautifulSoup

def scrape_quotes(base_url):
    # Send a request to get the main page content
    response = requests.get(base_url)
    # Create a BeautifulSoup object from the response text
    soup = BeautifulSoup(response.text, 'html.parser')

    # Select all quotes on the main page
    quotes = soup.select('.quote')

    for quote in quotes:
        # Extract the text of the quote
        text = quote.select_one('.text').get_text()
        # Extract the author's name
        author = quote.select_one('.author').get_text()
        print(f'{text} - {author}')

        # Extract the relative URL to the author's page
        endpoint_to_about_page = quote.select_one('span a')['href']

        # Generate the full URL
        url_to_about_page = base_url + endpoint_to_about_page

        # Send a request to get the author's page content
        response = requests.get(url_to_about_page)

        # Create a BeautifulSoup object from the author's page
        soup_about = BeautifulSoup(response.text, 'html.parser')

        # TODO: Extract the author's birth date from the linked page using CSS selector and get it's text
        # Hint: The class name of the necessary elemenet is 'author-born-date'
        born_date = soup_about.select_one('.author-born-date').get_text()

        # TODO: Extract the author's birth location from the linked page using CSS selector and get it's text
        # Hint: The class name of the necessary elemenet is 'author-born-location'
        born_location = soup_about.select_one('.author-born-location').get_text()

        print(f'{author} was born on {born_date} in {born_location}\n')

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)
```

## Scrape Books and Author Bios

Awesome work on the previous tasks!

Now, let's shift to the books scraping in this exercise. In this task, we are scraping the books page, then navigating to each individual book page and retrieving information for each book

Follow the TODO instructions to complete the task to fetch the necessary information for each book as a product.

```python
import requests
from bs4 import BeautifulSoup

def parse_product_info(product_page_url):
    product_html = requests.get(product_page_url)
    product_soup = BeautifulSoup(product_html.text, 'html.parser')

    table = product_soup.select_one(".table.table-striped")

    info = {}

    # TODO: Find all 'tr' elements in the table
    
    # TODO: Iterate over all the rows
        
        # TODO: Extract 'th' of each row as key

        # TODO: Extract 'td' of each row as value

        # TODO: Store the key-value pair in the info dictionary

    return info

def scrape_books(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    books = soup.select('.product_pod')

    for book in books:
        title = book.select_one('h3 a')['title']
        author_page_endpoint = book.select_one('h3 a')['href']
        product_page_url = base_url + author_page_endpoint

        info = parse_product_info(product_page_url)

        print(f'Book: {title}')
        print(f'Info: {info}')

base_url = 'http://books.toscrape.com/'
scrape_books(base_url)

```

```python
import requests
from bs4 import BeautifulSoup

def parse_product_info(product_page_url):
    product_html = requests.get(product_page_url)
    product_soup = BeautifulSoup(product_html.text, 'html.parser')

    table = product_soup.select_one(".table.table-striped")

    info = {}

    # TODO: Find all 'tr' elements in the table
    rows = table.find_all('tr')
    
    # TODO: Iterate over all the rows
    for row in rows:
        
        # TODO: Extract 'th' of each row as key
        key = row.find('th').get_text(strip=True)

        # TODO: Extract 'td' of each row as value
        value = row.find('td').get_text(strip=True)

        # TODO: Store the key-value pair in the info dictionary
        info[key] = value

    return info

def scrape_books(base_url):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    books = soup.select('.product_pod')

    for book in books:
        title = book.select_one('h3 a')['title']
        author_page_endpoint = book.select_one('h3 a')['href']
        product_page_url = base_url + author_page_endpoint

        info = parse_product_info(product_page_url)

        print(f'Book: {title}')
        print(f'Info: {info}')

base_url = 'http://books.toscrape.com/'
scrape_books(base_url)
```

## Navigate and Scrape Author Details

You've completed several exercises to practice navigating through links and extracting detailed information.

Now, it's time to put all that knowledge into action. Write a Python script using BeautifulSoup to scrape quotes from a webpage and navigate to the author's page to extract additional details.

Remember, great scraping starts with a great website analysis, so don't hesitate to open the URLs in the web browser and inspect their content.

```python
import requests
from bs4 import BeautifulSoup

# TODO: Define the function scrape_quotes(base_url)

    # TODO: Send a request to the main URL and get the page content

    # TODO: Create a BeautifulSoup object from the response text

    # TODO: Find all quotes in the main page using the appropriate CSS selector

    # TODO: Iterate over each quote and extract the text and author

    # TODO: For each quote, find the relative URL to the author's page

        # TODO: Create the full URL to the author's page by concatenating it with the base URL

        # TODO: Send a request to the author's page URL

        # TODO: Create a BeautifulSoup object from the author's page content

        # TODO: Extract the author's birth date

        # TODO: Extract the author's birth location

        # TODO: Print the author's name, birth date, and birth location

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)

```

```python
import requests
from bs4 import BeautifulSoup

def scrape_quotes(base_url):
    """
    Scrapes quotes from a webpage, navigates to the author's page, and
    extracts additional details about the author.
    """
    # Send a request to the main URL and get the page content
    main_page_response = requests.get(base_url)
    
    # Check if the request was successful
    if main_page_response.status_code != 200:
        print(f"Failed to retrieve the main page. Status code: {main_page_response.status_code}")
        return

    # Create a BeautifulSoup object from the response text
    soup = BeautifulSoup(main_page_response.text, 'html.parser')

    # Find all quotes in the main page using the appropriate CSS selector
    quotes = soup.find_all('div', class_='quote')

    # Iterate over each quote and extract the text and author
    for quote in quotes:
        text = quote.find('span', class_='text').get_text(strip=True)
        author_name = quote.find('small', class_='author').get_text(strip=True)
        
        # For each quote, find the relative URL to the author's page
        author_link_tag = quote.find('a', href=True)
        if author_link_tag:
            relative_url = author_link_tag['href']

            # Create the full URL to the author's page by concatenating it with the base URL
            author_url = base_url + relative_url

            # Send a request to the author's page URL
            author_page_response = requests.get(author_url)

            # Check if the request to the author's page was successful
            if author_page_response.status_code == 200:
                # Create a BeautifulSoup object from the author's page content
                author_soup = BeautifulSoup(author_page_response.text, 'html.parser')

                # Extract the author's birth date
                born_date = author_soup.find('span', class_='author-born-date').get_text(strip=True)

                # Extract the author's birth location
                born_location = author_soup.find('span', class_='author-born-location').get_text(strip=True)

                # Print the author's name, birth date, and birth location
                print(f"Quote: {text}")
                print(f"Author: {author_name}")
                print(f"Born: {born_date} {born_location}")
                print("-" * 20)
            else:
                print(f"Failed to retrieve author page for {author_name}. Status code: {author_page_response.status_code}")

base_url = 'http://quotes.toscrape.com'
scrape_quotes(base_url)
```