## **Parsing HTML**

BeautifulSoup is a Python library that simplifies the process of web scraping by allowing developers to extract data from HTML documents easily. It transforms complicated HTML documents into a tree of Python objects, such as tags, navigable strings, and comments. This makes it straightforward to locate and manipulate the desired data.

In [5]:
from bs4 import BeautifulSoup as bs, Comment
import json

"""
Practice Exercise: BeautifulSoup Basics

Complete each function below by following the TODO instructions. 
Each function includes the objective of the task and the expected output.
"""


'\nPractice Exercise: BeautifulSoup Basics\n\nComplete each function below by following the TODO instructions. \nEach function includes the objective of the task and the expected output.\n'

In [6]:
def convert_text_to_soup():
    """
    Objective: Convert the provided text (HTML content) into a BeautifulSoup object.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Use BeautifulSoup to convert text object type to soup object type
    # TODO: Use print() to print the type of text before and after conversion
    bs_obj = bs(text, 'html.parser')
    print(f"Type of text before soup: {type(text)}")
    print(f"Type of text after soup: {type(bs_obj)}")
    return bs_obj

convert_text_to_soup()

Type of text before soup: <class 'str'>
Type of text after soup: <class 'bs4.BeautifulSoup'>



<html>
<body>
<div>
<p>Hello, world!</p>
</div>
</body>
</html>

In [7]:
def print_pretty():
    """
    Objective: Compare the print output with and without BeautifulSoup.prettify() method.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """<html><body><div><p>Hello, world!</p></div></body></html>"""
    # TODO: Use BeautifulSoup to convert text object type to soup object type
    # TODO: Print text
    # TODO: Print soup directly
    # TODO: Print using prettify method
    bs_obj = bs(text, 'html.parser')
    print(f"Type of text before soup: {type(text)}")
    print(f"Type of text after soup: {type(bs_obj)}")
    print(text)
    print(bs_obj)
    print(bs_obj.prettify())
    return bs_obj

print_pretty()

Type of text before soup: <class 'str'>
Type of text after soup: <class 'bs4.BeautifulSoup'>
<html><body><div><p>Hello, world!</p></div></body></html>
<html><body><div><p>Hello, world!</p></div></body></html>
<html>
 <body>
  <div>
   <p>
    Hello, world!
   </p>
  </div>
 </body>
</html>



<html><body><div><p>Hello, world!</p></div></body></html>

In [8]:
def find_going_down():
    """
    Objective: Demonstrate how to traverse downward in the HTML structure using `.find()` method.
    Expected Output:
    The <body>, <div>, and <p> tags in sequence as they are traversed.
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    # TODO: Navigate soup to get <body> and print it
    # TODO: Navigate body to get <div> and print it
    # TODO: Navigate div to get <p> and print it  
    bs_obj = bs(html, 'html.parser') 
    body = bs_obj.find('body')
    print(body)
    div = body.find('div')
    print(div)
    p = div.find('p')
    print(p)
    return p

find_going_down()

<body>
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>
</body>
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>
<p class="my-class">Hello, my class!</p>


<p class="my-class">Hello, my class!</p>

In [9]:
def find_next_to():
    """
    Objective: Extract the text of the <p> element that comes immediately after a specific <p>.
    Expected Output:
    <p>Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p>Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    # TODO: Use `.find()` to locate the <p> element with class="my-class"
    # TODO: Use `.find_next()` to locate the next <p> element
    bs_obj = bs(html, 'html.parser')
    p_class = bs_obj.find('p', class_='my-class')
    print(p_class)
    p_next = p_class.find_next()
    print(p_next)
    return p_next

find_next_to()

<p class="my-class">Hello, my class!</p>
<p>Hello, my id!</p>


<p>Hello, my id!</p>

In [10]:
def use_css_selectors():
    """
    Objective: Locate elements using CSS selectors.
    Expected Output:
    <p class="my-class">Hello, my class!</p>
    <p id="my-id">Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    # TODO: Use `.select_one()` to locate elements using class, then print them.
    # TODO: Use `.select_one()` to locate elements using ID selectors, then print them.
    bs_obj = bs(html, 'html.parser')
    p_class = bs_obj.select_one('.my-class')
    print(p_class)
    p_id = bs_obj.select_one('#my-id')
    print(p_id)
    return p_class, p_id

use_css_selectors()

<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>


(<p class="my-class">Hello, my class!</p>, <p id="my-id">Hello, my id!</p>)

In [12]:
def extract_text():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    Hello, world!
    """
    html = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    # TODO: Use `.find()` to locate the <p> element
    # TODO: Extract text from the <p> element and print it
    bs_obj = bs(html, 'html.parser')
    p = bs_obj.find('p')
    print(p)
    text = p.get_text()
    print(text)
    return text

extract_text()

<p>Hello, world!</p>
Hello, world!


'Hello, world!'

In [13]:
def extract_attributes():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    dict_keys(['href', 'class'])
    https://www.google.com
    """
    html = """
        <html>
            <body>
                <div>
                    <a class="my-link" href="https://www.google.com" target="_blank">Google</a>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    # TODO: Use `.find()` to locate the <a> element
    # TODO: Print all available attributes from the <a> element
    # TODO: Print the href attribute from the <a> element
    bs_obj = bs(html, 'html.parser')
    a = bs_obj.find('a')
    print(a)
    attributes = a.attrs
    print(attributes.keys())
    print(a['href'])
    return attributes, a['href']

extract_attributes()

<a class="my-link" href="https://www.google.com" target="_blank">Google</a>
dict_keys(['class', 'href', 'target'])
https://www.google.com


({'class': ['my-link'], 'href': 'https://www.google.com', 'target': '_blank'},
 'https://www.google.com')

In [14]:
def extract_text_from_list():
    """
    Objective: Extract text from all <li> elements and return them as a list of strings.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
        """
    # TODO: Convert html to soup
    # TODO: Use `.find_all()` to locate all <li> elements 
    # TODO: Iterate over the <li> elements and extract their text into a list.
    # TODO: Print the list
    bs_obj = bs(html, 'html.parser')
    li_elements = bs_obj.find_all('li')
    items = [li.get_text() for li in li_elements]
    print(items)
    return items

extract_text_from_list()

['Item 1', 'Item 2', 'Item 3']


['Item 1', 'Item 2', 'Item 3']

In [15]:
def find_all_going_down():
    """
    Objective: Extract text from all <p> elements and return them as a list of dictionaries.
    Expected Output:
    [{'name': 'John Doe', 'age': '25'}, {'name': 'Nadia', 'age': '31'}, {'name': 'Serena', 'age': '23'}, {'name': 'Tessa', 'age': '17'}, {'name': 'Una', 'age': '23'}]
    """
    html = """
        <section>
            <div class="user-info">
                <p class="my-name">John Doe</p>
                <p class="my-age">25</p>
            </div>
            <div class="user-info">
                <p class="my-name">Nadia</p>
                <p class="my-age">31</p>
            </div>
            <div class="user-info">
                <p class="my-name">Serena</p>
                <p class="my-age">23</p>
            </div>
            <div class="user-info">
                <p class="my-name">Tessa</p>
                <p class="my-age">17</p>
            </div>
            <div class="user-info">
                <p class="my-name">Una</p>
                <p class="my-age">23</p>
            </div>
        </section>"""
    # TODO: Convert html to soup
    # TODO: Use `.find_all()` to locate all <p> elements
    # TODO: Iterate over the <p> elements and extract their text into a list of dictionaries
    # TODO: Print the list of dictionaries
    bs_obj = bs(html, 'html.parser')
    p_elements = bs_obj.find_all('p')
    user_info = []
    for i in range(0, len(p_elements), 2):
        user = {
            'name': p_elements[i].get_text(),
            'age': p_elements[i + 1].get_text()
        }
        user_info.append(user)
    print(user_info)
    return user_info

find_all_going_down()

[{'name': 'John Doe', 'age': '25'}, {'name': 'Nadia', 'age': '31'}, {'name': 'Serena', 'age': '23'}, {'name': 'Tessa', 'age': '17'}, {'name': 'Una', 'age': '23'}]


[{'name': 'John Doe', 'age': '25'},
 {'name': 'Nadia', 'age': '31'},
 {'name': 'Serena', 'age': '23'},
 {'name': 'Tessa', 'age': '17'},
 {'name': 'Una', 'age': '23'}]

In [None]:
def extract_tables():
    """
    Objective: Extract data from an HTML table and return it as a list of dictionaries.
    Expected Output:
    [{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]
    """
    html = """
    <table>
        <thead>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Alice</td>
                <td>30</td>
            </tr>
            <tr>
                <td>Bob</td>
                <td>25</td>
            </tr>
        </tbody>
    </table>
    """
    # TODO: Convert html to soup
    # TODO: Extract <tr> elements from <tbody>
    # TODO: Iterate over the <td> elements, to construct dictionaries for each row.
    # TODO: Append each dictionary to a list
    # TODO: Print the list
    bs_obj = bs(html, 'html.parser')
    rows = bs_obj.find('tbody').find_all('tr')
    data = []
    for row in rows:
        cols = row.find_all('td')
        data.append({
            'name': cols[0].get_text(),
            'age': cols[1].get_text()
        })
    print(data)
    # return data
    
    # Challenges: Can you extract it directly from the <tr> elements?
    all_rows = bs_obj.select('tbody tr')
    data_direct = []
    for row in all_rows:
        tds = row.find_all('td')
        data_direct.append({
            'name': tds[0].get_text(),
            'age': tds[1].get_text()
        })
    print(data_direct)   

extract_tables()

[{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]
[{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]


In [19]:
def extract_scripts():
    """
    Objective: Extract JSON-like data embedded in a <script> tag and return it as a Python dictionary.
    Expected Output:
    {"id": 123, "name": "Alice"}
    """
    html = """
        <script>
            var userInfo = { "id": 123, "name": "Alice" };
        </script>
        """

    # TODO: Convert html to soup
    # TODO: Extract the JSON-like content from the <script> tag
    # TODO: Remove the "var userInfo = " and ";" from the JSON-like content
    # TODO: Convert the JSON-like content to a Python dictionary
    # TODO: Print the dictionary
    bs_obj = bs(html, 'html.parser')
    script = bs_obj.find('script').string
    json_data = script.replace('var userInfo = ', '').replace(';', '')
    data_dict = json.loads(json_data)
    print(data_dict)
    return data_dict

extract_scripts()

{'id': 123, 'name': 'Alice'}


{'id': 123, 'name': 'Alice'}

In [None]:
def extract_comments():
    """
    Objective: Extract a comment from the HTML and return it as a string.
    Expected Output:
    ' User ID: 67890 '
    """
    html = """
        <!-- User ID: 67890 -->
        <div class="user-info">Name: John Doe</div>
        """

    # TODO: Convert html to soup
    # TODO: Use BeautifulSoup to locate and extract the comment.
    # TODO: Print the comment
    bs_obj = bs(html, 'html.parser')
    #tidak ada informasi komentar di dalam tag html diatas. apakah yang dimaksud seperti ini?
    comment = bs_obj.find(string=lambda text: isinstance(text, Comment))
    print(comment)
    return comment

extract_comments()

 User ID: 67890 


' User ID: 67890 '

In [21]:
def extract_dynamic_classes():
    """
    Objective: Extract text content from all <div> elements with class names starting with 'content-'.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <div class="content-1">Item 1</div>
        <div class="content-2">Item 2</div>
        <div class="content-3">Item 3</div>
        """
    
    # TODO: Convert html to soup
    # TODO: Use `.find_all()` with a custom filter to locate the elements
    # TODO: Iterate over the elements and extract text from the specified <div> elements.
    # TODO: Print the list
    bs_obj = bs(html, 'html.parser')
    div_elements = bs_obj.find_all('div', class_=lambda x: x and x.startswith('content-'))
    items = [div.get_text() for div in div_elements]
    print(items)
    return items

extract_dynamic_classes()

['Item 1', 'Item 2', 'Item 3']


['Item 1', 'Item 2', 'Item 3']

### **Reflection**
Which one method in BeautifulSoup you prefer? .find() or .select_one() ?

(answer here)

ANSWER HERE

tergantung kebutuhan, tapi jika memang memilih dan tidak ada faktor2x lain saya lebih prefer .find karna bisa langsung mengambil yang pertama dan lebih ringkas.

### **Exploration**
Automate the process of getting HTML content by using Requests library. Read the official documentations.