# Unit 3 Scraping HTML Lists with Beautiful Soup

# Web Scraping HTML Lists

Welcome\! In this lesson, we will dive into the world of web scraping, specifically focusing on scraping **HTML lists**. Let's start with a brief introduction to HTML lists and their significance in web scraping.

-----

## HTML Lists Overview

HTML lists are used to display a series of items in a structured manner. Broadly, there are two types of lists:

  * **Ordered Lists (`<ol>`)**: These lists are numbered (e.g., 1, 2, 3).
  * **Unordered Lists (`<ul>`)**: These lists are bulleted (e.g., •, •, •).

Each item in these lists is enclosed within `<li>` tags. Lists are commonly found on web pages in forms like navigation menus, product listings, etc., making them ideal targets for web scraping.

Here is an example of an ordered list:

```html
<ol>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ol>
```

-----

## Loading Libraries and Fetching the Webpage

We start by importing the required libraries and fetching the HTML content of the webpage.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
```

Next, we use a CSS selector to identify the specific list containing the books: `soup.select(".page_inner section ol li")`. This selects all `<li>` elements that are descendants of `.page_inner section ol`.

With that, we can loop through the selected items and extract the book titles:

```python
books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    print(title)
```

**Explanation of the code:**

  * `book.select("article h3 a")[0]`: Selects the `<a>` tag inside the `<h3>` of the `<article>` tag. Note that `select` returns a list, so we use `[0]` to access the first element.
  * `book.select("article h3 a")[0]["title"]`: Extracts the `title` attribute of the `<a>` tag.
  * `print(title)`: Prints the extracted book title.

The output will display the titles of the books listed on the webpage:

```
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
...
```

-----

## Summary

In this lesson on HTML lists, we explored the basics of HTML lists and their significance in web scraping. We also learned how to fetch a webpage, identify specific lists using CSS selectors, and extract information from the selected list items. This knowledge will be invaluable as we proceed with more advanced web scraping techniques.

Now, let's put this knowledge into practice with some hands-on exercises\!

## Run Web Scraping Code

Nice work on understanding the basics of HTML lists!

Let's run the code you saw in the lesson to see how it works in real-time.

We will be fetching and parsing an HTML page to extract book titles using the requests library for fetching the webpage content and BeautifulSoup for parsing the HTML. This task will help you see how to utilize CSS selectors to navigate through an HTML structure and extract specific information.

Here's the code that fetches and prints the titles of books listed on the "Books to Scrape" website:

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    print(title)

```

## Enhance Web Scraping Skills

Great job so far! You've learned how to extract book titles from an HTML list.

Now, let's enhance our scraping script to include the price of each book.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    print(title)

    # TODO: Extract and print the price of each book alongside its title.
    # Hint: The price is located inside the .product_price .price_color class within each book article.

```

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    # TODO: Extract and print the price of each book alongside its title.
    # Hint: The price is located inside the .product_price .price_color class within each book article.
    price = book.select_one(".product_price .price_color").text
    print(f"Title: {title}, Price: {price}")
```

## Add Book Availability Status

You've done great so far by extracting book titles from the list.

Now, let's modify the script to also include the availability status of each book. The availability status can be found under the tag with the class instock availability.

Update the current code to include the availability status in the output.

Before you proceed, let's understand how the CSS selector works when the class name contains spaces – when a class name contains spaces, it means that the element has multiple classes. For example, the class name instock availability means that the element has two classes: instock and availability. In this case, you can use the CSS selector .instock.availability to select the element with both classes.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    print(title)

    # TODO: Extract and print the availability status of each book alongside its title.
    # The availability status can be found under the tag with the class `instock availability` within article element.

```

Great\! Adding the availability status is the perfect next step. This will give you a comprehensive overview of each book's details.

Based on the information you've provided about the CSS selectors, here is the updated code that extracts and prints the availability status for each book.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    
    # Solution: Extract the availability status
    availability = book.select_one(".instock.availability").text.strip()

    # Print the title and availability status
    print(f"Title: {title}, Availability: {availability}")
```

### Explanation of the new line:

  * `book.select_one(".instock.availability")`: This line uses the CSS selector you described to target the element with both the `instock` and `availability` classes.
  * `.text.strip()`: After selecting the element, we get its text content using `.text`. The `.strip()` method is then used to remove any leading or trailing whitespace, ensuring the output is clean.

Now your script will print the title and the availability status for every book on the page.

## Complete the Webpage Parsing Task

You've done a great job so far extracting book titles.

Now, let's add a missing line, where we select the li elements in the ordered list from the parsed HTML using the CSS selector.

Follow the instructions in the starter code to complete the task.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# TODO: Select the li elements in the ordered list from the parsed HTML using CSS selector
# Navigate from the element with the class "page_inner" - note that the li elements are within an ordered list under the section element

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    print(title)

```

I can certainly help with that. You need to add the line that selects the list items (`<li>`) containing the book information.

Based on the instructions, here's how you complete the script:

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Solution: Select the li elements in the ordered list from the parsed HTML using CSS selector
books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    title = book.select("article h3 a")[0]["title"]
    print(title)
```

### Explanation of the added line:

  * `soup.select(".page_inner section ol li")`: This is the CSS selector that performs the task.
      * `.page_inner`: Selects the element with the class `page_inner`.
      * `section`: Narrows the search to the `<section>` element inside `.page_inner`.
      * `ol`: Finds the ordered list (`<ol>`) within the `<section>`.
      * `li`: Finally, selects all the list items (`<li>`) within that ordered list, which is exactly what you need.

This line correctly selects the list of book elements and assigns them to the `books_ordered_list` variable, allowing the rest of the code to function as intended.

## Scrape Book Titles Efficiently

Awesome progress so far!

In this exercise, some parts of the code are missing. Your task is to fill in the missing blocks to complete the script. This practice will strengthen your understanding of using CSS selectors.

Complete the code to print the titles of the books listed on the webpage.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    # TODO: Select 'a' tags in the 'h3' element in the 'article' and extract the 'title' attribute of the first 'a' tag
    
    print(title)

```

Great job on your progress\! Filling in the missing pieces is an excellent way to solidify your skills.

Here is the completed script that efficiently scrapes and prints the book titles. The missing line uses a CSS selector to navigate to the correct element and extract the title.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books_ordered_list = soup.select(".page_inner section ol li")

for book in books_ordered_list:
    # Solution: Select 'a' tags in the 'h3' element in the 'article' and extract the 'title' attribute of the first 'a' tag
    title = book.select("article h3 a")[0]["title"]
    
    print(title)
```

### Explanation of the completed line:

  * `book.select("article h3 a")`: This CSS selector finds the `<a>` tag that contains the book's title. It looks for an `<a>` tag nested inside an `<h3>` tag, which is itself inside an `<article>` tag.
  * `[0]`: Since `select` returns a list of all matching elements, we use `[0]` to get the very first (and in this case, only) one.
  * `["title"]`: Finally, we access the `title` attribute of the selected `<a>` tag to get the book's title.

## Scrape Book Titles with Robustness

You’ve done great so far! Now, it's time to show what you've learned by writing the entire solution from scratch.

Remember to inspect the website to understand its structure and identify the elements you need to extract.

```python
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/"
response = requests.get(url)

# TODO: Parse the fetched content using BeautifulSoup

# TODO: Select the list elements in the ordered list of books using the appropriate CSS selector starting with ".page_inner" and ending with "li"

# TODO: Loop through the list items and extract the title attribute of the <a> tag in each book

# TODO: Print each extracted book title

```

