## **Parsing HTML**

BeautifulSoup is a Python library that simplifies the process of web scraping by allowing developers to extract data from HTML documents easily. It transforms complicated HTML documents into a tree of Python objects, such as tags, navigable strings, and comments. This makes it straightforward to locate and manipulate the desired data.

In [None]:
from bs4 import BeautifulSoup as bs, Comment
import json

"""
Practice Exercise: BeautifulSoup Basics

Complete each function below by following the TODO instructions. 
Each function includes the objective of the task and the expected output.
"""

In [2]:
from bs4 import BeautifulSoup

def convert_text_to_soup():
    """
    Objective: Convert the provided text (HTML content) into a BeautifulSoup object.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Use BeautifulSoup to convert text object type to soup object type
    print(f"Type of text before soup: {type(text)}")
    # TODO: Use print() to print the type of text before and after conversion
    soup = BeautifulSoup(text, 'html.parser')
    print(f"Type of text after soup: {type(soup)}")

convert_text_to_soup()

Type of text before soup: <class 'str'>
Type of text after soup: <class 'bs4.BeautifulSoup'>


In [3]:
def print_pretty():
    """
    Objective: Compare the print output with and without BeautifulSoup.prettify() method.
    Expected Output:
    Type of text before soup: <class 'str'>
    Type of text after soup: <class 'bs4.BeautifulSoup'>
    """
    text = """<html><body><div><p>Hello, world!</p></div></body></html>"""
    # TODO: Use BeautifulSoup to convert text object type to soup object type
    # TODO: Print text
    # TODO: Print soup directly
    # TODO: Print using prettify method
    soup = BeautifulSoup(text, 'html.parser')
    
    # Print original text
    print("Original text:")
    print(text)
    print("\nSoup direct print:")
    print(soup)
    print("\nPrettified output:")
    print(soup.prettify())

print_pretty()

Original text:
<html><body><div><p>Hello, world!</p></div></body></html>

Soup direct print:
<html><body><div><p>Hello, world!</p></div></body></html>

Prettified output:
<html>
 <body>
  <div>
   <p>
    Hello, world!
   </p>
  </div>
 </body>
</html>



In [4]:
def find_going_down():
    """
    Objective: Demonstrate how to traverse downward in the HTML structure using `.find()` method.
    Expected Output:
    The <body>, <div>, and <p> tags in sequence as they are traversed.
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Navigate soup to get <body> and print it
    body = soup.find('body')
    print("Body tag:")
    print(body)
    # TODO: Navigate body to get <div> and print it
    div = body.find('div')
    print("\nDiv tag:")
    print(div)
    # TODO: Navigate div to get <p> and print it   
    p = div.find('p')
    print("\nFirst p tag:")
    print(p)


find_going_down()

Body tag:
<body>
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>
</body>

Div tag:
<div>
<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>
</div>

First p tag:
<p class="my-class">Hello, my class!</p>


In [5]:
def find_next_to():
    """
    Objective: Extract the text of the <p> element that comes immediately after a specific <p>.
    Expected Output:
    <p>Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p>Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Use `.find()` to locate the <p> element with class="my-class"
    first_p = soup.find('p', class_='my-class')
    # TODO: Use `.find_next()` to locate the next <p> element
    next_p = first_p.find_next('p')
    print(next_p)
    
find_next_to()

<p>Hello, my id!</p>


In [6]:
def use_css_selectors():
    """
    Objective: Locate elements using CSS selectors.
    Expected Output:
    <p class="my-class">Hello, my class!</p>
    <p id="my-id">Hello, my id!</p>
    """
    html = """
        <html>
            <body>
                <div>
                    <p class="my-class">Hello, my class!</p>
                    <p id="my-id">Hello, my id!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Use `.select_one()` to locate elements using class, then print them.
    class_element = soup.select_one('p.my-class')
    print(class_element)
    # TODO: Use `.select_one()` to locate elements using ID selectors, then print them.
    id_element = soup.select_one('p#my-id')
    print(id_element)

use_css_selectors()

<p class="my-class">Hello, my class!</p>
<p id="my-id">Hello, my id!</p>


In [7]:
def extract_text():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    Hello, world!
    """
    html = """
        <html>
            <body>
                <div>
                    <p>Hello, world!</p>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Use `.find()` to locate the <p> element
    p_element = soup.find('p')
    # TODO: Extract text from the <p> element and print it
    print(p_element.text)

extract_text()

Hello, world!


In [8]:
def extract_attributes():
    """
    Objective: Extract and print the text content of a <p> element.
    Expected Output:
    dict_keys(['href', 'class'])
    https://www.google.com
    """
    html = """
        <html>
            <body>
                <div>
                    <a class="my-link" href="https://www.google.com" target="_blank">Google</a>
                </div>
            </body>
        </html>
        """
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Use `.find()` to locate the <a> element
    a_element = soup.find('a')
    # TODO: Print all available attributes from the <a> element
    print(a_element.attrs.keys())
    # TODO: Print the href attribute from the <a> element
    print(a_element['href'])

extract_attributes()

dict_keys(['class', 'href', 'target'])
https://www.google.com


In [9]:
def extract_text_from_list():
    """
    Objective: Extract text from all <li> elements and return them as a list of strings.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
        """
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Use `.find_all()` to locate all <li> elements 
    li_elements = soup.find_all('li')
    # TODO: Iterate over the <li> elements and extract their text into a list.
    items = [li.text for li in li_elements]
    # TODO: Print the list
    print(items)

extract_text_from_list()

['Item 1', 'Item 2', 'Item 3']


In [10]:
def find_all_going_down():
    """
    Objective: Extract text from all <p> elements and return them as a list of dictionaries.
    Expected Output:
    [{'name': 'John Doe', 'age': '25'}, {'name': 'Nadia', 'age': '31'}, {'name': 'Serena', 'age': '23'}, {'name': 'Tessa', 'age': '17'}, {'name': 'Una', 'age': '23'}]
    """
    html = """
        <section>
            <div class="user-info">
                <p class="my-name">John Doe</p>
                <p class="my-age">25</p>
            </div>
            <div class="user-info">
                <p class="my-name">Nadia</p>
                <p class="my-age">31</p>
            </div>
            <div class="user-info">
                <p class="my-name">Serena</p>
                <p class="my-age">23</p>
            </div>
            <div class="user-info">
                <p class="my-name">Tessa</p>
                <p class="my-age">17</p>
            </div>
            <div class="user-info">
                <p class="my-name">Una</p>
                <p class="my-age">23</p>
            </div>
        </section>"""
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Use `.find_all()` to locate all <p> elements
    user_divs = soup.find_all('div', class_='user-info')
    # TODO: Iterate over the <p> elements and extract their text into a list of dictionaries
    users = []
    for div in user_divs:
        name = div.find('p', class_='my-name').text
        age = div.find('p', class_='my-age').text
        users.append({'name': name, 'age': age})
    
    # TODO: Print the list of dictionaries
    print(users)
    
find_all_going_down()

[{'name': 'John Doe', 'age': '25'}, {'name': 'Nadia', 'age': '31'}, {'name': 'Serena', 'age': '23'}, {'name': 'Tessa', 'age': '17'}, {'name': 'Una', 'age': '23'}]


In [12]:
def extract_tables():
    """
    Objective: Extract data from an HTML table and return it as a list of dictionaries.
    Expected Output:
    [{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]
    """
    html = """
    <table>
        <thead>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Alice</td>
                <td>30</td>
            </tr>
            <tr>
                <td>Bob</td>
                <td>25</td>
            </tr>
        </tbody>
    </table>
    """
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    # TODO: Extract <tr> elements from <tbody>
    rows = soup.find('tbody').find_all('tr')
    # TODO: Iterate over the <td> elements, to construct dictionaries for each row.
    table_data = []
    for row in rows:
        cols = row.find_all('td')
        data = {
            'name': cols[0].text,
            'age': cols[1].text
        }
    # TODO: Append each dictionary to a list
        table_data.append(data)    
    # TODO: Print the list
    print(table_data)
    
    # Challenges: Can you extract it directly from the <tr> elements?

extract_tables()

[{'name': 'Alice', 'age': '30'}, {'name': 'Bob', 'age': '25'}]


In [15]:
import json

def extract_scripts():
    """
    Objective: Extract JSON-like data embedded in a <script> tag and return it as a Python dictionary.
    Expected Output:
    {"id": 123, "name": "Alice"}
    """
    html = """
        <script>
            var userInfo = { "id": 123, "name": "Alice" };
        </script>
        """

    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    
    # TODO: Extract the JSON-like content from the <script> tag        
    # TODO: Remove the "var userInfo = " and ";" from the JSON-like content
    # TODO: Convert the JSON-like content to a Python dictionary
    # TODO: Print the dictionary
    script_content = soup.find('script').string
    json_str = script_content.strip().replace('var userInfo = ', '').rstrip(';')
    data = json.loads(json_str)
    print(data)
    

extract_scripts()

{'id': 123, 'name': 'Alice'}


In [18]:
def extract_comments():
    from bs4 import Comment

    """
    Objective: Extract a comment from the HTML and return it as a string.
    Expected Output:
    ' User ID: 67890 '
    """
    html = """
        <!-- User ID: 67890 -->
        <div class="user-info">Name: John Doe</div>
        """

    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    
    # TODO: Use BeautifulSoup to locate and extract the comment.
    comment = soup.find(string=lambda text: isinstance(text, Comment))
    
    # TODO: Print the comment
    print(comment)
    

extract_comments()

 User ID: 67890 


In [19]:
def extract_dynamic_classes():
    """
    Objective: Extract text content from all <div> elements with class names starting with 'content-'.
    Expected Output:
    ['Item 1', 'Item 2', 'Item 3']
    """
    html = """
        <div class="content-1">Item 1</div>
        <div class="content-2">Item 2</div>
        <div class="content-3">Item 3</div>
        """
    
    # TODO: Convert html to soup
    soup = BeautifulSoup(html, 'html.parser')
    
    # TODO: Use `.find_all()` with a custom filter to locate the elements
    divs = soup.find_all('div', class_=lambda x: x and x.startswith('content-'))
     
    # TODO: Iterate over the elements and extract text from the specified <div> elements.
    items = [div.text for div in divs]
    
    # TODO: Print the list
    print(items)
    

extract_dynamic_classes()

['Item 1', 'Item 2', 'Item 3']


### **Reflection**
Which one method in BeautifulSoup you prefer? .find() or .select_one() ?

(answer here)

I prefer .find() over .select_one() for several reasons:

1. Readability : .find() has a more straightforward syntax that clearly shows what you're looking for. For example:

In [20]:
soup.find('div', class_='content')  # More readable
soup.select_one('div.content')      # Less intuitive for beginners

NameError: name 'soup' is not defined

2. Flexibility : .find() offers more flexible filtering options through its parameters:

- Can use dictionaries for multiple attributes
- Supports custom filtering functions
- Easier to combine multiple conditions

3. Performance : .find() is generally faster for simple searches because it doesn't need to parse CSS selectors.

However, .select_one() is better when:

- You need to use complex CSS selectors
- You're working with nested structures where CSS paths are clearer
- You're already familiar with CSS selector syntax

For most basic web scraping tasks, .find() is sufficient and more maintainable.

### **Exploration**
Automate the process of getting HTML content by using Requests library. Read the official documentations.

Install :
#pip install requests

Key features demonstrated:

1. Making HTTP GET requests
2. Error handling for network issues
3. Status code checking
4. Converting response to BeautifulSoup object
5. Basic error handling
The code shows how to:

- Send HTTP requests to web pages
- Handle potential network errors
- Parse the received HTML content
- Extract specific information from the parsed content
Remember to respect websites' robots.txt and implement appropriate delays between requests when scraping multiple pages.

Example:

In [None]:
import requests
from bs4 import BeautifulSoup

def fetch_webpage(url):
    try:
        # Send GET request to the URL
        response = requests.get(url)
        
        # Raise an exception for bad status codes
        response.raise_for_status()
        
        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup
        
    except requests.RequestException as e:
        print(f"Error fetching the webpage: {e}")
        return None

# Example usage
if __name__ == "__main__":
    # Example with a real website
    url = "https://jakarta.go.id"
    soup = fetch_webpage(url)
    
    if soup:
        # Find and print all h2 headings
        headings = soup.find_all('h2')
        print("Main headings on Python.org:")
        for heading in headings:
            print(f"- {heading.text.strip()}")