## **Parsing HTML**

BeautifulSoup is a Python library that simplifies the process of web scraping by allowing developers to extract data from HTML documents easily. It transforms complicated HTML documents into a tree of Python objects, such as tags, navigable strings, and comments. This makes it straightforward to locate and manipulate the desired data.

In [1]:
from bs4 import BeautifulSoup as bs, Comment
import json

"""
Practice Exercise: BeautifulSoup Basics

Complete each function below by following the TODO instructions. 
Each function includes the objective of the task and the expected output.
"""

'\nPractice Exercise: BeautifulSoup Basics\n\nComplete each function below by following the TODO instructions. \nEach function includes the objective of the task and the expected output.\n'

In [2]:
def convert_text_to_soup():
    """
    Objective: Convert the provided text (HTML content) into a BeautifulSoup object.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Use BeautifulSoup to convert text object type to soup object type
    # from bs4 import BeautifulSoup
    soup = bs(text, "html.parser")
    # TODO: Use print() to print the type of text before and after conversion
    print(f"Type of text before soup: {type(text)}")
    print(f"Type of text after soup: {type(soup)}")
convert_text_to_soup()

Type of text before soup: <class 'str'>
Type of text after soup: <class 'bs4.BeautifulSoup'>


In [3]:
def print_pretty():
    """
    Objective: Compare the print output with and without BeautifulSoup.prettify() method.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """<html><body><div><p>Hello, world!</p></div></body></html>"""
    # TODO: Use BeautifulSoup to convert text object type to soup object type
    soup = bs(text, "html.parser")
    # TODO: Print text
    print(text)
    # TODO: Print soup directly
    print(soup)
    # TODO: Print using prettify method
    print(soup.prettify())
print_pretty()

<html><body><div><p>Hello, world!</p></div></body></html>
<html><body><div><p>Hello, world!</p></div></body></html>
<html>
 <body>
  <div>
   <p>
    Hello, world!
   </p>
  </div>
 </body>
</html>



In [4]:
def find_going_down():
    """
    Objective: Demonstrate how to traverse downward in the HTML structure using `.find()` method.
    Expected Output:
    The <body>, <div>, and <p> tags in sequence as they are traversed.
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Navigate soup to get <body> and print it
    body = soup.find("body")
    print(body)
    # TODO: Navigate body to get <div> and print it
    div = body.find("div")
    print(div)
    # TODO: Navigate div to get <p> and print it 
    print(div.find_all("p"))

find_going_down()

<body>
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>
</body>
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>
[<p class="my-class">Hello, my class!</p>, <p id="my-id">Hello, my id!</p>]


In [5]:
def find_next_to():
    """
    Objective: Extract the text of the <p> element that comes immediately after a specific <p>.
    Expected Output:
    <p>Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p>Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use `.find()` to locate the <p> element with class="my-class"
    my_class = soup.find("p", "my-class")
    # TODO: Use `.find_next()` to locate the next <p> element
    return my_class.find_next()
find_next_to()

<p>Hello, my id!</p>

In [6]:
def use_css_selectors():
    """
    Objective: Locate elements using CSS selectors.
    Expected Output:
    <p class="my-class">Hello, my class!</p>
    <p id="my-id">Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use `.select_one()` to locate elements using class, then print them.
    print(soup.select_one(".my-class"))
    # TODO: Use `.select_one()` to locate elements using ID selectors, then print them.
    print(soup.select_one("#my-id"))
    
use_css_selectors()

<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>


In [7]:
def extract_text():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    Hello, world!
    """
    html = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use `.find()` to locate the <p> element
    p = soup.find("p")
    # TODO: Extract text from the <p> element and print it
    print(p.string)
extract_text()

Hello, world!


In [8]:
def extract_attributes():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    dict_keys(['href', 'class'])
    https://www.google.com
    """
    html = """
        <html>
            <body>
                <div>
                    <a class="my-link" href="https://www.google.com" target="_blank">Google</a>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use `.find()` to locate the <a> element
    a = soup.find("a")
    # TODO: Print all available attributes from the <a> element
    attrs = a.attrs
    print(attrs.keys())
    # TODO: Print the href attribute from the <a> element
    print(attrs['href'])
extract_attributes()

dict_keys(['class', 'href', 'target'])
https://www.google.com


In [9]:
def extract_text_from_list():
    """
    Objective: Extract text from all <li> elements and return them as a list of strings.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use `.find_all()` to locate all <li> elements
    li = soup.find_all("li")
    # TODO: Iterate over the <li> elements and extract their text into a list.
    result = [item.string for item in li]
    # TODO: Print the list
    print(result)
extract_text_from_list()

['Item 1', 'Item 2', 'Item 3']


In [10]:
def find_all_going_down():
    """
    Objective: Extract text from all <p> elements and return them as a list of dictionaries.
    Expected Output:
    [{'name': 'John Doe', 'age': '25'}, {'name': 'Nadia', 'age': '31'}, {'name': 'Serena', 'age': '23'}, {'name': 'Tessa', 'age': '17'}, {'name': 'Una', 'age': '23'}]
    """
    html = """
        <section>
            <div class="user-info">
                <p class="my-name">John Doe</p>
                <p class="my-age">25</p>
            </div>
            <div class="user-info">
                <p class="my-name">Nadia</p>
                <p class="my-age">31</p>
            </div>
            <div class="user-info">
                <p class="my-name">Serena</p>
                <p class="my-age">23</p>
            </div>
            <div class="user-info">
                <p class="my-name">Tessa</p>
                <p class="my-age">17</p>
            </div>
            <div class="user-info">
                <p class="my-name">Una</p>
                <p class="my-age">23</p>
            </div>
        </section>"""
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use `.find_all()` to locate all <p> elements
    result = []
    p_names = soup.find_all("p", "my-name")
    p_ages = soup.find_all("p", "my-age")
    # TODO: Iterate over the <p> elements and extract their text into a list of dictionaries
    for i in range(len(p_names)):
        data = {
            'name': p_names[i].string,
            'age': p_ages[i].string
        }
        result.append(data)
    # TODO: Print the list of dictionaries
    print(result)
    
find_all_going_down()

[{'name': 'John Doe', 'age': '25'}, {'name': 'Nadia', 'age': '31'}, {'name': 'Serena', 'age': '23'}, {'name': 'Tessa', 'age': '17'}, {'name': 'Una', 'age': '23'}]


In [11]:
def extract_tables():
    """
    Objective: Extract data from an HTML table and return it as a list of dictionaries.
    Expected Output:
    [{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]
    """
    html = """
    <table>
        <thead>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Alice</td>
                <td>30</td>
            </tr>
            <tr>
                <td>Bob</td>
                <td>25</td>
            </tr>
        </tbody>
    </table>
    """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Extract <tr> elements from <tbody>
    tr = soup.tbody.find_all("tr")
    # TODO: Iterate over the <td> elements, to construct dictionaries for each row.
    result = []
    for item in tr:
        td = item.find_all("td")
        data = {
            'name': td[0].string,
            'age': td[1].string
        }
    # TODO: Append each dictionary to a list
        result.append(data)
    # TODO: Print the list
    print(result)
    # Challenges: Can you extract it directly from the <tr> elements?
    tr = soup.find_all("tr")[1:]
    result = []
    for item in tr:
        td = item.find_all("td")
        data = {
            'name': td[0].string,
            'age': td[1].string
        }
        result.append(data)
    print(result)

extract_tables()

[{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]
[{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]


In [12]:
def extract_scripts():
    """
    Objective: Extract JSON-like data embedded in a <script> tag and return it as a Python dictionary.
    Expected Output:
    {"id": 123, "name": "Alice"}
    """
    html = """
        <script>
            var userInfo = { "id": 123, "name": "Alice" };
        </script>
        """

    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Extract the JSON-like content from the <script> tag
    text = soup.script.string
    # TODO: Remove the "var userInfo = " and ";" from the JSON-like content
    text = text.replace("var userInfo = ", "").replace(";", "").strip()
    # TODO: Convert the JSON-like content to a Python dictionary
    import json
    result = json.loads(text)
    # TODO: Print the dictionary
    print(result)
extract_scripts()

{'id': 123, 'name': 'Alice'}


In [13]:
def extract_comments():
    """
    Objective: Extract a comment from the HTML and return it as a string.
    Expected Output:
    ' User ID: 67890 '
    """
    html = """
        <!-- User ID: 67890 -->
        <div class="user-info">Name: John Doe</div>
        """

    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use BeautifulSoup to locate and extract the comment.
    comment = soup.contents[1] 
    # TODO: Print the comment
    print(comment)
extract_comments()

 User ID: 67890 


In [14]:
def extract_dynamic_classes():
    """
    Objective: Extract text content from all <div> elements with class names starting with 'content-'.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <div class="content-1">Item 1</div>
        <div class="content-2">Item 2</div>
        <div class="content-3">Item 3</div>
        """
    
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")
    # TODO: Use `.find_all()` with a custom filter to locate the elements
    import re
    data = soup.find_all("div", attrs={'class': re.compile("content")})
    # TODO: Iterate over the elements and extract text from the specified <div> elements.
    result = [item.string for item in data]
    # TODO: Print the list
    print(result)

extract_dynamic_classes()

['Item 1', 'Item 2', 'Item 3']


### **Reflection**
Which one method in BeautifulSoup you prefer? .find() or .select_one() ?

I prefer using .find() when I know the element I need to find is a simple element that doesn't require CSS

### **Exploration**
Automate the process of getting HTML content by using Requests library. Read the official documentations.