# Unit 3 Navigating HTML Trees with BeautifulSoup
 
### Welcome to today's lesson on navigating the HTML tree structure using the Python BeautifulSoup library.

This interactive tutorial will walk you through a step-by-step guide on extracting specific elements from web pages. By the end of the lesson, you will have a clear understanding of the hierarchical nature of HTML pages and how to traverse these structures effectively to extract desired information.

### Understanding the HTML Tree Structure

The structure of an HTML document is like a tree, with parent, child, and sibling elements. Every individual element in an HTML document forms a node in the tree structure.

```
<html> - Root Node
|
|--<head> - Child of Root Node and Parent to <title>
|  |--<title> - Child Node of <head>
|
|--<body> - Child of Root Node and Parent to <div>
|  |--<div> - Child of <body> and parent of <p> and <span>
|  |  |--<p> - Child Node of <div>
|  |  |--<span> - Another Child Node of <div>
```

Let's break down the HTML tree relationships:

  * **Parent Nodes:** Elements that contain other elements. For example, `<body>` is a parent of `<div>`, which is a parent of `<p>`.
  * **Child Nodes:** Elements that are directly nested inside another element. For example, `<p>` is a child of `<div>`, which is a child of `<body>`.
  * **Sibling Nodes:** Elements that share the same parent. For instance, `<p>` and `<span>` are siblings because they are both children of the same `<div>` element.

In the upcoming sections, we'll explore how BeautifulSoup enables us to traverse these relationships.

### Using BeautifulSoup to Navigate HTML Trees

BeautifulSoup offers several useful functions for traversing the HTML tree. One fundamental function is the `find()` method, which returns the first matching element.

To illustrate `find()`, we will use a simple HTML string:

```python
from bs4 import BeautifulSoup

html_content = '<html><body><div id="main"><h1>Welcome</h1><p>Learn web scraping.</p></div></body></html>'

soup = BeautifulSoup(html_content, 'html.parser')

# Access the main 'div' using find
main_div = soup.find('div', id='main')

print("Main div content:")
print(main_div.prettify())
```

The output of the above code will be:

```
Main div content:
<div id="main">
 <h1>
  Welcome
 </h1>
 <p>
  Learn web scraping.
 </p>
</div>
```

We start off by creating a `BeautifulSoup` object. This line of code parses the HTML content and creates a `BeautifulSoup` object, `soup`, which represents the HTML document as a nested data structure. 2. `soup.find('div', id='main')` is used to find the `div` element with an `id` of `main`. 3. `main_div.prettify()` is then used to print the HTML content in a formatted manner.

Running this code will output the formatted HTML content within the `div` with an `id` of 'main'.

### Exploring HTML with `parent` and `children` Attributes

In addition to `find()`, BeautifulSoup also provides the `.children`, `.parent` attributes for vertical traversal (up and down the tree). These attributes allow us to access the parent and children of a given node.

Let's explore some of these methods with a more complex HTML example. Let's first define and HTML string and then use BeautifulSoup to extract the main `div`:

```python
from bs4 import BeautifulSoup

html_content = '''<html><body><div id="main">    <h1>Welcome</h1>    <p>Learn web scraping.</p>    <p>It's a useful technique.</p></div></body></html>'''

soup = BeautifulSoup(html_content, 'html.parser')
main_div = soup.find('div', id='main')
```

Next, we will use the `.children` and `.parent` attributes to explore the HTML tree structure:

```python
# Finding the children of the 'main' div
children = main_div.children
print("Children of the main div:")
for child in children: # Print the h1 and two p tags
    print(child)

# Accessing the parent of the 'main' div
parent = main_div.parent
print("\nParent of the main div:")
print(parent.name) # This will print the 'body' tag
```

### Using `find_next_sibling` and `find_previous_sibling` to Navigate Sibling Nodes

BeautifulSoup's `find_next_sibling` method allows us to navigate horizontal relationships within an HTML tree. Sibling nodes refer to nodes that share the same parent; hence `find_next_sibling` is used to find the next sibling of a given node (i.e., an element at the same structural level).

Let's inspect this using our HTML sample:

```python
from bs4 import BeautifulSoup

html_content = '''<html><body><div id="main">    <h1>Welcome</h1>    <p>Learn web scraping.</p>    <p>It's a useful technique.</p></div></body></html>'''

soup = BeautifulSoup(html_content, 'html.parser')

# Finding the first 'p' tag in our 'div'
first_p = soup.find('div', id='main').find('p')
print("First paragraph:", first_p)

# Finding the next sibling of the first 'p' tag (the second 'p' tag)
second_p = first_p.find_next_sibling()
print("Second paragraph:", second_p)
```

This BeautifulSoup 'soup' represents our HTML document. We then identified the first `<p>` tag in our 'main' `<div>` using the `find` method. The `find_next_sibling` method is then used to locate the next sibling of the first `<p>` tag (which would be the second `<p>` tag in the 'main' `<div>`). Running this code, we will see the contents of the first and second `<p>` tags in our 'main' `<div>`.

The `find_next_sibling` method offers an effective way to navigate through an HTML document horizontally. Understanding how to move between sibling nodes allows for more precise and flexible web scraping.

Similarly, we can use the `find_previous_sibling` to get the previous sibling of a node as follows:

```python
first_p_from_second = second_p.find_previous_sibling()
print("First paragraph:", first_p_from_second)
```

### Summary and Practice Exercises

Congrats on making it to this point\! We hope this lesson has advanced your understanding of HTML tree structures and BeautifulSoup's different traversal methods.

To solidify and apply your newfound knowledge, we'll embark on some practical exercises. These exercises will immerse you in scenarios that mimic real-world web scraping tasks, providing you with opportunities to traverse complex HTML trees to extract valuable information. Let's get to it\!

Remember, practice is the key to mastering Python web scraping. Happy coding\!

## Exploring a Travel Booking Website with Beautiful Soup

In the given code, we explore the HTML content of a travel booking website to find the first available flight and hotel options. Can you determine what the output will be upon running this script? Click Run to view the details of the first flight and hotel listings!

```python
from bs4 import BeautifulSoup

# Sample HTML content from a travel booking website
html_content = """
<html>
  <body>
    <div class="flights">
      <h1>Flight Options</h1>
      <ul>
        <li class="flight" id="flight1">Flight 101 to Rome - $499</li>
        <li class="flight" id="flight2">Flight 202 to Paris - $599</li>
      </ul>
    </div>
    <div class="hotels">
      <h2>Hotel Listings</h2>
      <ul>
        <li class="hotel" id="hotel1">Hotel Plaza - $99 per night</li>
        <li class="hotel" id="hotel2">Grand Hotel - $199 per night</li>
      </ul>
    </div>
    <div class="reviews">
      <h3>Customer Reviews</h3>
      <p id="review1">Great experience on Flight 101!</p>
      <p id="review2">Highly recommend Hotel Plaza for the price.</p>
    </div>
  </body>
</html>
"""

# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')

# Locate the flights div and then find the first flight option within it
flights_div = soup.find('div', class_='flights')
first_flight = flights_div.find('li', class_='flight').text
print(f"First flight option: {first_flight}")

# Navigate to the sibling element of the flights div, i.e., the hotels div, and find the first hotel option
hotels_div = flights_div.find_next_sibling('div', class_='hotels')
first_hotel = hotels_div.find('li', class_='hotel').text
print(f"First hotel option: {first_hotel}")

```

## Navigating Sibling Elements with BeautifulSoup

Next up, let's further sharpen your BeautifulSoup skills. Expand on what you've learned by modifying the existing code to print the "Second Deal" from the travel deals section, using the .find_next_sibling() method. Remember, this method allows you to navigate horizontally in the HTML tree to find sibling elements.

```python
from bs4 import BeautifulSoup

# Simulating a section of an HTML document from a travel booking website
html_content = """
<div id='travel-deals'>
    <h2>Today's Top Deals</h2>
    <p>Save on flights to Hawaii!</p>
    <p>Discounted hotel rates in Paris!</p>
</div>
"""

# Creating a BeautifulSoup Object
soup = BeautifulSoup(html_content, 'html.parser')

# Find the 'h2' tag to see the section title
section_title = soup.find('h2').string
print("Section Title:", section_title)

# Find the first 'p' tag to see the first deal
first_deal = soup.find('p')

# TODO: Find the next sibling of the first 'p' tag (the second deal)

# TODO: Remember to update the print statement below to print the second deal instead
print("Second Deal:", first_deal)

```

```python
from bs4 import BeautifulSoup

# Simulating a section of an HTML document from a travel booking website
html_content = """
<div id='travel-deals'>
    <h2>Today's Top Deals</h2>
    <p>Save on flights to Hawaii!</p>
    <p>Discounted hotel rates in Paris!</p>
</div>
"""

# Creating a BeautifulSoup Object
soup = BeautifulSoup(html_content, 'html.parser')

# Find the 'h2' tag to see the section title
section_title = soup.find('h2').string
print("Section Title:", section_title)

# Find the first 'p' tag to see the first deal
first_deal = soup.find('p')

# Find the next sibling of the first 'p' tag (the second deal)
second_deal = first_deal.find_next_sibling('p')

# Print the second deal
print("Second Deal:", second_deal.text)
```

## Navigating the Travel Booking Website HTML Tree

## Navigating the HTML Tree on a Travel Booking Site

## Navigating HTML Elements with BeautifulSoup