# Unit 1

## Searching the Web with DDGS in Python

Tentu, berikut adalah konversi teks Anda ke format Markdown dalam bahasa Inggris:

# Introduction - The World of Programming Languages

Welcome to the first lesson of our course, **"Automating Web Content Retrieval and Parsing in Python"**. In this course, you will learn how to build a research tool that can search the web, gather content, and process information automatically.

Automating web search is a key part of building modern research tools. Instead of searching for information manually, you can write a Python module to do it for you. This saves time and allows you to collect and process large amounts of information quickly.

In this lesson, we will focus on using **DDGS**, a metasearch library that aggregates results from diverse web search services.

-----

### Learning Objectives

By the end of this lesson, you will know how to:

  * Search the web using `DDGS` in Python.
  * Fetch the first result from your search.
  * Convert the web page content into a readable format.

Let’s get started\!

-----

## Using DDGS to Search the Web

The **DDGS** library allows you to perform web searches directly from Python. This is helpful because you can automate the process of finding information online.

`DDGS` works as a metasearch tool: it automatically selects and queries different search backends (such as DuckDuckGo, Brave, or others) to provide you with a diverse set of results. You do not need to choose the backend yourself; the library handles this for you.

First, let’s see how to import the `DDGS` library and perform a simple search.

```python
from ddgs import DDGS

query = "Python programming"
results = DDGS().text(query, max_results=1)
print(results)
```

**Code Explanation:**

  * We import the `DDGS` class from the library.
  * We define a search query, in this case, `"Python programming"`.
  * We call the **`.text()`** method to perform the search and ask for just one result (`max_results=1`).
  * The `results` variable will contain a list of search results.

**Sample Output:**

```
[{'title': 'Welcome to Python.org', 'href': 'https://www.python.org/', 'body': 'The official home of the Python Programming Language.'}]
```

Each result is a **dictionary** with keys like `title`, `href` (the URL), and `body` (a short description).

> **Note:** On CodeSignal, the `ddgs` library is already installed, so you do not need to install it yourself. On your own computer, you would install it using `pip install ddgs`.

-----

## Extracting and Fetching the First Search Result

Now that we have search results, let’s extract the URL of the first result and fetch the web page content.

First, let’s check if we got any results and get the URL:

```python
if results and "href" in results[0]:
    url = results[0]["href"]
    print("First result URL:", url)
else:
    print("No valid results found.")
```

**Code Explanation:**

  * We check if `results` is not empty and if the first result has an `"href"` key.
  * If so, we extract the URL and print it.
  * If not, we print a message saying no valid results were found.

> Even though we request only one result with `max_results=1`, `DDGS().text()` always returns a **list**. To access the actual result, we need to extract the first item from the list using **`results[0]`**, even if there's only one result.

Next, let’s fetch the content of the web page using the **httpx** library:

```python
import httpx

try:
    response = httpx.get(url, timeout=10)
    response.raise_for_status()

    html_content = response.text
    print("Fetched web page content (first 200 characters):")
    print(html_content[:200])
except Exception as e:
    print(f"Error fetching URL {url}: {e}")
```

**Code Explanation:**

  * We use `httpx.get()` to download the web page at the given URL.
  * The `timeout=10` argument means the request will wait up to 10 seconds.
  * `response.raise_for_status()` will raise an error if the request fails.
  * We print the first 200 characters of the fetched web page content.
  * If there is an error, we print an error message.

**Sample Output:**

```
Fetched web page content (first 200 characters):<!doctype html><html lang="en"><head>    <title>Welcome to Python.org</title>    ...
```

-----

## Converting Web Content to Markdown

Web pages are usually written in HTML, which can be hard to read. To make the content easier to work with, we can convert it to **Markdown**, a simpler text format.

We will use the **`html_to_markdown`** library for this. Here’s how you can do it:

```python
from html_to_markdown import convert_to_markdown

markdown_content = convert_to_markdown(html_content)
print("Web page content in Markdown (first 200 characters):")
print(markdown_content[:200])
```

**Code Explanation:**

  * We import the `convert_to_markdown` function.
  * We pass the HTML content to this function, and it returns a Markdown version.
  * We print the first 200 characters of the Markdown content.

**Sample Output:**

```
Web page content in Markdown (first 200 characters):# Welcome to Python.org

The official home of the Python Programming Language....
```

Markdown is much easier to read and process than raw HTML, which is why we use this step.

-----

## Summary and What’s Next

In this lesson, you learned how to:

  * Use the `DDGS` library to search the web from Python.
  * Extract the first search result and fetch its web page content using `httpx`.
  * Convert the web page’s HTML to Markdown for easier reading.

These steps are the **foundation** for building an automated research tool. In the practice exercises that follow, you will get hands-on experience with these skills. You will write your own code to search, fetch, and convert web content, preparing you for more advanced features in future lessons.

Let’s move on to the practice exercises and start building your **DeepResearcher**\!

## Your First Web Search with DDGS

Now that you've learned about the DDGS library, let's put that knowledge into practice! In this exercise, you'll write a simple script that performs a web search using DuckDuckGo's search engine.

Your task is to complete the code that:

Uses DDGS to search for "Python programming"
Gets the first search result
Prints the raw results list to see its structure
Prints the first result dictionary to examine its contents
This hands-on experience will help you understand how search results are structured when using DDGS. Understanding this data structure is essential as you build more advanced search functionality in upcoming exercises.

Note: The DDGS library may output log messages to STDERR when you run your code. Don’t be alarmed—these logs show the steps the library is taking to perform your search (such as which backend it is using). You can read them to better understand what’s happening under the hood.

By completing this exercise, you'll take your first step toward creating an automated research tool that can search the web efficiently!


```python
from ddgs import DDGS

# Define our search query
query = "Python programming"

# TODO: Perform the search using DDGS().text() with the query and max_results=1

# TODO: Print the raw results list with a label "Search results:"

# TODO: Check if results exist and have at least one item
# If results exist, print the first result dictionary with a label "First result details:"
# Otherwise, print "No results found."
```

I see you want the previous response converted to English. Here is the completed Python script and the explanation in English.

## Your First Web Search with DDGS (Completed Script)

Here is the completed Python script that performs the search, prints the raw result list, and then prints the details of the first result.

```python
from ddgs import DDGS

# Define our search query
query = "Python programming"

# TODO: Perform the search using DDGS().text() with the query and max_results=1
results = DDGS().text(query, max_results=1)

# TODO: Print the raw results list with a label "Search results:"
print("Search results:")
print(results)

# TODO: Check if results exist and have at least one item
# If results exist, print the first result dictionary with a label "First result details:"
# Otherwise, print "No results found."
print("-" * 30) # Separator for clarity

if results and len(results) > 0:
    first_result = results[0]
    print("First result details:")
    print(first_result)
else:
    print("No results found.")
```

-----

## Explanation and Learning Points 💡

The code you completed executes essential steps for understanding `DDGS` results:

1.  **The Search:** The line `results = DDGS().text(query, max_results=1)` performs the search. Although `max_results=1` is specified, the output (stored in the `results` variable) is always a **list**.
2.  **Data Structure:** When you print `results`, you see that it's a list containing a single element. This element is a **dictionary** that holds the actual result details (title, URL, etc.).
3.  **Accessing the First Result:** The `if results and len(results) > 0:` check ensures the list isn't empty. The line `first_result = results[0]` then uses the **index `[0]`** to extract the actual result dictionary from within the list, allowing you to access fields like `title`, `href`, and `body` individually.

You now know that the search results from `DDGS().text()` are a **list of dictionaries**, a crucial data structure for navigation in the next exercises.

## Extracting URLs from Search Results

Excellent job on your first search! Now that you understand the structure of search results, let's focus on extracting specific information from them.

In this exercise, you'll learn how to safely extract the URL from a search result.

Your task is to complete the code that:

Checks whether the search returned any results
Verifies whether the first result contains a valid URL (the href key)
Prints the URL with a helpful message if both conditions are met
Prints "No valid results found." if any condition fails
This skill is crucial for building reliable web tools that can handle unexpected data formats or empty results. By mastering URL extraction, you'll be ready to move on to fetching and processing web content in the next exercises!


```python
from ddgs import DDGS

# Define our search query
query = "Python programming"

# Perform the search and get results
results = DDGS().text(query, max_results=1)

# TODO: Check if results exist and if the first result contains a "href" key
# TODO: If both conditions are met, extract the URL and print it with a message
# TODO: If any condition fails, print "No valid results found."

```

You've successfully set up the search; now let's focus on the safe extraction of the URL.

Here is the completed Python script:

```python
from ddgs import DDGS

# Define our search query
query = "Python programming"

# Perform the search and get results
results = DDGS().text(query, max_results=1)

# TODO: Check if results exist and if the first result contains a "href" key
# TODO: If both conditions are met, extract the URL and print it with a message
# TODO: If any condition fails, print "No valid results found."

# Check if the results list is not empty (results) AND 
# check if the first item (results[0]) is a dictionary AND 
# if the "href" key exists within that first item.
if results and "href" in results[0]:
    # Both conditions met: Safely extract the URL
    url = results[0]["href"]
    print(f"Successfully extracted URL: {url}")
else:
    # Condition failed: Either no results or the result structure was unexpected
    print("No valid results found.")
```

-----

## Explanation of Safe Extraction

The core of this exercise lies in the robust conditional check: `if results and "href" in results[0]:`

  * **`results`**: In Python, when a list is used in a boolean context (like an `if` statement), it evaluates to `True` if it is **not empty**, and `False` if it is empty. This is the simplest way to check if any results were returned.
  * **`"href" in results[0]`**: The `results[0]` accesses the first dictionary item in the list. The `"href" in ...` part then checks if the necessary key exists within that dictionary. This prevents a `KeyError` if the DDGS result structure unexpectedly changes or if a result object is malformed.

By combining these two checks with the **`and`** operator, your script is **reliable** and ready to handle various search outcomes, ensuring you only proceed with URL extraction (`url = results[0]["href"]`) when the data is confirmed to be present and structured correctly.

## Fetching Web Content with httpx

Now that you've successfully extracted URLs from search results, let's learn how to actually fetch the content from those URLs! In this exercise, you'll create a function that downloads web page content using the httpx library.

Your task is to implement a fetch_web_page function that:

Takes a URL and timeout as input parameters
Uses httpx to make a GET request to that URL
Handles any errors that might occur during the request
Returns the HTML content if successful, or an error message if something goes wrong
This is an essential skill for any web automation tool — after finding a URL through search, you need to be able to reliably download its content for further processing.

By completing this exercise, you'll add another important component to your automated research toolkit. Soon, you'll be able to combine searching and fetching to create a powerful web research assistant!

```python
import httpx

def fetch_web_page(url, timeout=10):
    """
    Fetch the content of a web page using httpx.
    
    Args:
        url (str): The URL to fetch
        timeout (int): Maximum time to wait for response in seconds
        
    Returns:
        str: The HTML content of the page or an error message
    """
    # TODO: Use a try/except block to handle potential errors
    
    # TODO: Inside the try block, use httpx.get() to fetch the URL with the timeout parameter
    
    # TODO: Call raise_for_status() on the response to handle HTTP error codes
    
    # TODO: Return the text content of the response
    
    # TODO: In the except block, catch any exceptions and return an error message

# Test the function with a sample URL
if __name__ == "__main__":
    sample_url = "https://www.youtube.com/"
    content = fetch_web_page(sample_url)
    
    # Print the first 200 characters to see a preview
    print(f"Content from {sample_url} (first 200 characters):")
    print(content[:200])

```

That's the logical next step\! Reliable content fetching is vital. Here is the completed Python script implementing the `fetch_web_page` function with robust error handling using the `httpx` library.

```python
import httpx

def fetch_web_page(url, timeout=10):
    """
    Fetch the content of a web page using httpx.
    
    Args:
        url (str): The URL to fetch
        timeout (int): Maximum time to wait for response in seconds
        
    Returns:
        str: The HTML content of the page or an error message
    """
    # TODO: Use a try/except block to handle potential errors
    try:
        # TODO: Inside the try block, use httpx.get() to fetch the URL with the timeout parameter
        response = httpx.get(url, timeout=timeout)
        
        # TODO: Call raise_for_status() on the response to handle HTTP error codes
        # This will raise an exception for 4xx or 5xx responses
        response.raise_for_status()
        
        # TODO: Return the text content of the response
        return response.text
        
    # TODO: In the except block, catch any exceptions and return an error message
    except httpx.HTTPStatusError as e:
        # Handles 4xx or 5xx errors (e.g., 404 Not Found)
        return f"ERROR: HTTP error occurred: {e}"
    except httpx.TimeoutException:
        # Handles timeout errors
        return f"ERROR: Request timed out after {timeout} seconds."
    except httpx.RequestError as e:
        # Handles other request-related issues (e.g., DNS error, connection failed)
        return f"ERROR: An error occurred during the request: {e}"
    except Exception as e:
        # Catches any other unexpected exceptions
        return f"ERROR: An unexpected error occurred: {e}"


# Test the function with a sample URL
if __name__ == "__main__":
    sample_url = "https://www.youtube.com/"
    content = fetch_web_page(sample_url)
    
    # Print the first 200 characters to see a preview
    print(f"Content from {sample_url} (first 200 characters):")
    
    # Use slicing [:200] to safely print only the beginning of the string
    # This avoids trying to slice if 'content' is a short error message.
    preview = content
    if content.startswith("ERROR:"):
        print(preview)
    else:
        print(preview[:200])

```

-----

## Key Takeaways from Error Handling

The real power in this function is the **`try/except` block** and the strategic use of `response.raise_for_status()`:

1.  **`response.raise_for_status()`**: This method automatically checks the HTTP status code. If the code is a **$400$-level (Client Error)** or **$500$-level (Server Error)**, it immediately raises an `httpx.HTTPStatusError` exception, which we catch. This ensures your code doesn't try to process an error page.
2.  **Specific Exception Catching**: By catching `httpx.HTTPStatusError`, `httpx.TimeoutException`, and `httpx.RequestError` separately, your function can return highly specific error messages. This is much better than just catching a general `Exception` because it helps debug exactly *why* the request failed—was it a network issue, a timeout, or a server problem?

Your research tool now has the ability to **safely and reliably** download web content, a critical function for the next steps\!

## Converting HTML to Readable Markdown

Fantastic work on fetching web content! Now that you can download HTML from websites, let's make that content more readable by converting it to Markdown format.

In this exercise, you'll implement the html_to_markdown_converter function, which transforms raw HTML into clean, readable Markdown text.

Your task is to complete the function that:

Takes an HTML string as input
Uses the convert_to_markdown function to transform the HTML
Removes the leading and trailing whitespaces.
Trims the result to a specified length (default: 500 characters)
Returns the processed Markdown content
This conversion step is crucial for your research tool because it makes web content much easier to read and process programmatically. HTML can be complex and cluttered, while Markdown provides a clean, simple text format that preserves the important structure.

By completing this exercise, you'll have all the core components needed for a basic web research pipeline: searching, fetching, and converting content into a useful format!


```python
from html_to_markdown import convert_to_markdown

def html_to_markdown_converter(html_content, max_length=500):
    """
    Convert HTML content to Markdown format and trim to specified length.
    
    Args:
        html_content (str): The HTML content to convert
        max_length (int): Maximum length of the returned Markdown (default: 500)
        
    Returns:
        str: The converted and trimmed Markdown content
    """
    # TODO: Convert HTML to Markdown using the convert_to_markdown function
    
    # TODO: Trim the Markdown content to the specified length
    
    # TODO: Return the trimmed Markdown content

# Test the function with sample HTML
if __name__ == "__main__":
    # Sample HTML content
    sample_html = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>Python Programming</title>
    </head>
    <body>
        <h1>Introduction to Python</h1>
        <p>Python is a <strong>powerful</strong> programming language that's great for beginners.</p>
        <ul>
            <li>Easy to learn</li>
            <li>Versatile</li>
            <li>Large community</li>
        </ul>
        <p>Learn more at <a href="https://www.python.org">python.org</a>.</p>
    </body>
    </html>
    """
    
    # Convert HTML to Markdown and print the result
    markdown_result = html_to_markdown_converter(sample_html)
    print("Converted Markdown content:")
    print(markdown_result)

```

Great job reaching the final step of the content pipeline\! Converting HTML to clean Markdown makes the data instantly useful.

Here is the completed Python script for the `html_to_markdown_converter` function:

```python
from html_to_markdown import convert_to_markdown

def html_to_markdown_converter(html_content, max_length=500):
    """
    Convert HTML content to Markdown format and trim to specified length.
    
    Args:
        html_content (str): The HTML content to convert
        max_length (int): Maximum length of the returned Markdown (default: 500)
        
    Returns:
        str: The converted and trimmed Markdown content
    """
    # TODO: Convert HTML to Markdown using the convert_to_markdown function
    markdown_content = convert_to_markdown(html_content)
    
    # Remove leading and trailing whitespaces
    markdown_content = markdown_content.strip()
    
    # TODO: Trim the Markdown content to the specified length
    trimmed_content = markdown_content[:max_length]
    
    # Optional: Add an ellipsis if the content was actually trimmed
    if len(markdown_content) > max_length:
        trimmed_content += "..."
    
    # TODO: Return the trimmed Markdown content
    return trimmed_content

# Test the function with sample HTML
if __name__ == "__main__":
    # Sample HTML content
    sample_html = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>Python Programming</title>
    </head>
    <body>
        <h1>Introduction to Python</h1>
        <p>Python is a <strong>powerful</strong> programming language that's great for beginners.</p>
        <ul>
            <li>Easy to learn</li>
            <li>Versatile</li>
            <li>Large community</li>
        </ul>
        <p>Learn more at <a href="https://www.python.org">python.org</a>. This is some extra filler text to make sure the trimming works correctly and demonstrates how the content gets cut off at the specified max_length. This is very useful for creating short previews or summaries of long web pages before deciding to process the full content.</p>
    </body>
    </html>
    """
    
    # Convert HTML to Markdown and print the result
    markdown_result = html_to_markdown_converter(sample_html, max_length=200) # Testing with a shorter length (200)
    print("Converted Markdown content (max 200 chars):")
    print(markdown_result)
    
    print("\n---")
    
    # Testing with a very long length to ensure full content is returned
    markdown_result_full = html_to_markdown_converter(sample_html, max_length=1000) 
    print("Converted Markdown content (full length):")
    print(markdown_result_full)
```

-----

## The Value of Markdown Conversion

This simple function finalizes your basic research pipeline. By implementing the conversion and trimming, you achieve two goals:

1.  **Readability**: The `convert_to_markdown` function transforms noisy HTML tags (like `<h1>`, `<ul>`, `<a href="...">`) into clean, universally readable format (`#`, `*`, `[text](url)`).
2.  **Efficiency**: The **`.strip()`** and **string slicing (`[:max_length]`)** methods ensure that the content is clean and limited in size. Trimming is vital in real-world applications to **save processing time and storage space** when dealing with vast amounts of web content.

You now have a complete research tool skeleton capable of **searching**, **fetching**, and **processing** data into a usable format\! What's the next advanced feature you'd like to integrate into your tool?