## **Parsing HTML**

BeautifulSoup is a Python library that simplifies the process of web scraping by allowing developers to extract data from HTML documents easily. It transforms complicated HTML documents into a tree of Python objects, such as tags, navigable strings, and comments. This makes it straightforward to locate and manipulate the desired data.

In [1]:
from bs4 import BeautifulSoup as bs, Comment
import json

"""
Practice Exercise: BeautifulSoup Basics

Complete each function below by following the TODO instructions. 
Each function includes the objective of the task and the expected output.
"""

'\nPractice Exercise: BeautifulSoup Basics\n\nComplete each function below by following the TODO instructions. \nEach function includes the objective of the task and the expected output.\n'

In [None]:
def convert_text_to_soup():
    """
    Objective: Convert the provided text (HTML content) into a BeautifulSoup object.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Use BeautifulSoup to convert text object type to soup object type

    print(f"Type of text before soup: {type(text)}")
    
    text = bs(text, "html.parser")

    print(f"Type of text after soup: {type(text)}")

    # TODO: Use print() to print the type of text before and after conversion


    print(f"Type of text before soup: {type(text)}")
    
    text = bs(text, "html.parser")

    print(f"Type of text after soup: {type(text)}")
    
convert_text_to_soup()

Type of text before soup: <class 'str'>
Type of text after soup: <class 'bs4.BeautifulSoup'>


In [10]:
def print_pretty():
    """
    Objective: Compare the print output with and without BeautifulSoup.prettify() method.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """<html><body><div><p>Hello, world!</p></div></body></html>"""
    # TODO: Use BeautifulSoup to convert text object type to soup object type
    text = bs(text, "html.parser")

    # TODO: Print text
    print(text)

    # TODO: Print soup directly
    print(bs)

    # TODO: Print using prettify method
    print(text.prettify())


print_pretty()

<html><body><div><p>Hello, world!</p></div></body></html>
<class 'bs4.BeautifulSoup'>
<html>
 <body>
  <div>
   <p>
    Hello, world!
   </p>
  </div>
 </body>
</html>



In [17]:
def find_going_down():
    """
    Objective: Demonstrate how to traverse downward in the HTML structure using `.find()` method.
    Expected Output:
    The <body>, <div>, and <p> tags in sequence as they are traversed.
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Navigate soup to get <body> and print it
    body_tag = soup.body

    print(body_tag)
    # print(body_tag.prettify())

    # TODO: Navigate body to get <div> and print it
    div_tag = soup.body.div

    print(div_tag)
    # print(div_tag.prettify())

    # TODO: Navigate div to get <p> and print it   
    p_tag = soup.body.div.p

    print(p_tag)
    # print(p_tag.prettify())

find_going_down()

<body>
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>
</body>
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>
<p class="my-class">Hello, my class!</p>


In [19]:
def find_next_to():
    """
    Objective: Extract the text of the <p> element that comes immediately after a specific <p>.
    Expected Output:
    <p>Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p>Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Use `.find()` to locate the <p> element with class="my-class"
    p_my_class = soup.find("p", { "class": "my-class"})
    print(p_my_class)

    # TODO: Use `.find_next()` to locate the next <p> element
    next_p = p_my_class.find_next("p")
    print(next_p)

find_next_to()

<p class="my-class">Hello, my class!</p>
<p>Hello, my id!</p>


In [24]:
def use_css_selectors():
    """
    Objective: Locate elements using CSS selectors.
    Expected Output:
    <p class="my-class">Hello, my class!</p>
    <p id="my-id">Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Use `.select_one()` to locate elements using class, then print them.
    p_my_class = soup.select_one(".my-class")
    print(p_my_class)

    # TODO: Use `.select_one()` to locate elements using ID selectors, then print them.
    p_my_id = soup.select_one("#my-id")
    print(p_my_id)

use_css_selectors()

<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>


In [27]:
def extract_text():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    Hello, world!
    """
    html = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Use `.find()` to locate the <p> element
    p_tag = soup.find("p")

    # TODO: Extract text from the <p> element and print it
    print(p_tag.text)

extract_text()

Hello, world!


In [32]:
def extract_attributes():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    dict_keys(['href', 'class'])
    https://www.google.com
    """
    html = """
        <html>
            <body>
                <div>
                    <a class="my-link" href="https://www.google.com" target="_blank">Google</a>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Use `.find()` to locate the <a> element
    a = soup.find("a")

    # TODO: Print all available attributes from the <a> element
    print(a.attrs.keys())
    print(a.attrs.values())
    print(a.attrs)

    # TODO: Print the href attribute from the <a> element
    print(a.attrs.get("href"))

extract_attributes()

dict_keys(['class', 'href', 'target'])
dict_values([['my-link'], 'https://www.google.com', '_blank'])
{'class': ['my-link'], 'href': 'https://www.google.com', 'target': '_blank'}
https://www.google.com


In [34]:
def extract_text_from_list():
    """
    Objective: Extract text from all <li> elements and return them as a list of strings.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
        """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Use `.find_all()` to locate all <li> elements 
    li_tags = soup.find_all("li")

    # TODO: Iterate over the <li> elements and extract their text into a list.
    text_list = [li.text for li in li_tags]
    print(text_list)

    # TODO: Print the list
    for txt in text_list:
        print(txt)

extract_text_from_list()

['Item 1', 'Item 2', 'Item 3']
Item 1
Item 2
Item 3


In [42]:
def find_all_going_down():
    """
    Objective: Extract text from all <p> elements and return them as a list of dictionaries.
    Expected Output:
    [{'name': 'John Doe', 'age': '25'}, {'name': 'Nadia', 'age': '31'}, {'name': 'Serena', 'age': '23'}, {'name': 'Tessa', 'age': '17'}, {'name': 'Una', 'age': '23'}]
    """
    html = """
        <section>
            <div class="user-info">
                <p class="my-name">John Doe</p>
                <p class="my-age">25</p>
            </div>
            <div class="user-info">
                <p class="my-name">Nadia</p>
                <p class="my-age">31</p>
            </div>
            <div class="user-info">
                <p class="my-name">Serena</p>
                <p class="my-age">23</p>
            </div>
            <div class="user-info">
                <p class="my-name">Tessa</p>
                <p class="my-age">17</p>
            </div>
            <div class="user-info">
                <p class="my-name">Una</p>
                <p class="my-age">23</p>
            </div>
        </section>"""
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Use `.find_all()` to locate all <p> elements
    p_tags = soup.find_all("p")

    # TODO: Iterate over the <p> elements and extract their text into a list of dictionaries
    divs = soup.find_all("div")

    persons = []
    for div in divs:
        p1 = div.find("p").text
        p2 = div.find_next("p").text

        persons.append({"name": p1, "age": p2})

    print(persons)

    # TODO: Print the list of dictionaries
    for person in persons:
        print(person)

find_all_going_down()

[{'name': 'John Doe', 'age': 'John Doe'}, {'name': 'Nadia', 'age': 'Nadia'}, {'name': 'Serena', 'age': 'Serena'}, {'name': 'Tessa', 'age': 'Tessa'}, {'name': 'Una', 'age': 'Una'}]
{'name': 'John Doe', 'age': 'John Doe'}
{'name': 'Nadia', 'age': 'Nadia'}
{'name': 'Serena', 'age': 'Serena'}
{'name': 'Tessa', 'age': 'Tessa'}
{'name': 'Una', 'age': 'Una'}


In [60]:
def extract_tables():
    """
    Objective: Extract data from an HTML table and return it as a list of dictionaries.
    Expected Output:
    [{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]
    """
    html = """
    <table>
        <thead>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Alice</td>
                <td>30</td>
            </tr>
            <tr>
                <td>Bob</td>
                <td>25</td>
            </tr>
        </tbody>
    </table>
    """
    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Extract <tr> elements from <tbody>
    tr_tags = soup.find_all("tr")
    # print(tr_tags)

    # TODO: Iterate over the <td> elements, to construct dictionaries for each row.

    for tr in list(tr_tags):
        td_tags = tr.find_all("td")
        print( f"{td_tags[0].text} - {td_tags[1].text}" if len(td_tags) == 2 else "")



    # TODO: Append each dictionary to a list
    my_dictionary = []
    for tr in list(tr_tags):
        td_tags = tr.find_all("td")
        if len(td_tags) == 2:
            my_dictionary.append({"name": td_tags[0].text, "age": td_tags[1].text})

    print(my_dictionary)

    # TODO: Print the list
    for person in my_dictionary:
        print(person)
    
    # Challenges: Can you extract it directly from the <tr> elements?

extract_tables()


Alice - 30
Bob - 25
[{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]
{'name': 'Alice', 'age': '30'}
{'name': 'Bob', 'age': '25'}


In [None]:
import json

def extract_scripts():
    """
    Objective: Extract JSON-like data embedded in a <script> tag and return it as a Python dictionary.
    Expected Output:
    {"id": 123, "name": "Alice"}
    """
    html = """
        <script>
            var userInfo = { "id": 123, "name": "Alice" };
        </script>
        """

    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Extract the JSON-like content from the <script> tag
    script = soup.script

    print(script.prettify())

    script_value = script.string

    print(script_value)

    index_of_open_curly_brace = script_value.index("{")

    print(index_of_open_curly_brace)
    
    # TODO: Remove the "var userInfo = " and ";" from the JSON-like content
    json_like_value = script_value[index_of_open_curly_brace:].replace(";", "")
    print(json_like_value)
    
    # TODO: Convert the JSON-like content to a Python dictionary
    python_dict = json.loads(json_like_value)
    
    # TODO: Print the dictionary
    print(python_dict)

extract_scripts()

<script>
 var userInfo = { "id": 123, "name": "Alice" };
</script>


            var userInfo = { "id": 123, "name": "Alice" };
        
28
{ "id": 123, "name": "Alice" }
        
{'id': 123, 'name': 'Alice'}


In [83]:
def extract_comments():
    """
    Objective: Extract a comment from the HTML and return it as a string.
    Expected Output:
    ' User ID: 67890 '
    """
    html = """
        <!-- User ID: 67890 -->
        <div class="user-info">Name: John Doe</div>
        """

    # TODO: Convert html to soup
    soup = bs(html, "html.parser")

    # TODO: Use BeautifulSoup to locate and extract the comment.
    soup_string = soup.find_all(string=True)

    comments = []

    for s in soup_string:
        if isinstance(s, Comment):
            comments.append(s)


    # TODO: Print the comment
    for comment in comments:
        print(comment)

extract_comments()

 User ID: 67890 


In [93]:
def extract_dynamic_classes():
    """
    Objective: Extract text content from all <div> elements with class names starting with 'content-'.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <div class="content-1">Item 1</div>
        <div class="content-2">Item 2</div>
        <div class="content-3">Item 3</div>
        """
    
    # TODO: Convert html to soup
    soup = bs(html, "html.parser",)

    # TODO: Use `.find_all()` with a custom filter to locate the elements
    divs = soup.find_all("div", {"class": [ f"content-{i}" for i in range(4) ]})

    # TODO: Iterate over the elements and extract text from the specified <div> elements.
    div_text = list()

    for div in divs:
        div_text.append(div.text.strip())

    # TODO: Print the list
    print(div_text)


extract_dynamic_classes()

['Item 1', 'Item 2', 'Item 3']


### **Reflection**
Which one method in BeautifulSoup you prefer? .find() or .select_one() ?

i dont think where is better betwen .find() or .select_one() but i prefer use .find()


### **Exploration**
Automate the process of getting HTML content by using Requests library. Read the official documentations.