# Unit 1 Handling Issues During Web Scraping

Hello\! In today's lesson, we're diving into the world of **error handling** in web scraping. Error handling is crucial because it helps ensure that your scraping scripts run smoothly, even when they encounter issues such as **HTTP** errors, parsing errors, or missing data.

Before we begin, let's understand the common types of errors you might encounter while scraping the web:

  - **HTTP Errors**: These occur when there's a problem with fetching the web page, such as a 404 Not Found error or a 500 Internal Server Error.
  - **Parsing Errors**: These arise when the **HTML** content is malformed or unexpected, causing issues during parsing.
  - **Missing Data/Attributes**: Sometimes, the necessary **HTML** elements or attributes may be missing, leading to errors.

By handling these issues, you can build robust and reliable web scraping scripts that continue to perform well even in the face of challenges.

### Handling HTTP Errors

**HTTP** errors occur when there is a problem with the request made to the server. Common **HTTP** status codes include:

  - **200 OK**: The request was successful.
  - **404 Not Found**: The requested resource could not be found.
  - **500 Internal Server Error**: The server encountered an unexpected condition.

Handling these errors gracefully is essential to ensure your script can proceed effectively or log meaningful error messages.

In Python, the `requests` library makes it simple to handle **HTTP** errors using the `response.raise_for_status()` method. This method raises an `HTTPError` if the **HTTP** request returned an unsuccessful status code.

Here's how we can implement it:

```python
import requests

def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP errors
        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None

url = 'http://quotes.toscrape.com'
html = fetch_page(url)

if html:
    print(html[:500])  # Print the first 500 characters of the page content
else:
    print("An HTTP Error Occurred")
```

In this code:

  - We use a `try` block to attempt to fetch the page content.
  - The `response.raise_for_status()` method checks for **HTTP** errors.
  - In the `except` block, we catch `requests.HTTPError` and print an error message if an error occurs.

### Handling Parsing Errors with BeautifulSoup

Parsing errors can occur if the **HTML** content is malformed or unexpected. By using `try` and `except` blocks, you can handle these errors gracefully.

Here's an example using `BeautifulSoup` to parse **HTML** content and extract quotes from a webpage:

```python
from bs4 import BeautifulSoup
import requests

html = fetch_page('http://quotes.toscrape.com/')

def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        print(f'Found {len(quotes)} quotes')
    except Exception as e:
        print(f"Parsing error: {e}")

if html:
    parse_and_extract_quotes(html) # Will print the number of quotes found
parse_and_extract_quotes({}) # Will raise a parsing error
```

This code demonstrates how to handle parsing errors when using `BeautifulSoup`. The `try` block attempts to parse the **HTML** content and extract quotes. If an error occurs during parsing, the `except` block catches the exception and prints an error message.

### Handling Missing Attributes and Data

Attribute errors occur when an expected **HTML** element or attribute is missing. For instance, if a `span` tag with the class `text` is not found, an `AttributeError` will be raised.

We can use `try` and `except` blocks to handle missing attributes gracefully. Here's how:

```python
from bs4 import BeautifulSoup
import requests

html = fetch_page('http://quotes.toscrape.com/')

def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        quote = quotes[0]
        try:
            text = quote.find('span', class_='text').get_text()
            author = quote.find('small', class_='author').get_text()
            tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
            invalid_attribute = quote.find('invalid', class_='invalid').get_text() # This will raise an AttributeError
            print(text, author, tags, invalid_attribute)
        except AttributeError as e:
            print(f"Attribute error: {e}")
    except Exception as e:
        print(f"Parsing error: {e}")

if html:
    parse_and_extract_quotes(html)
```

In this code:

  - The inner `try` block attempts to extract the `text`, `author`, and `tags` from each quote which are expected attributes. However, it also tries to extract an `invalid` attribute that doesn't exist.
  - The `except AttributeError` block catches any missing attribute errors and logs the error message.

In this case, we catch the `AttributeError` and print an error message. This helps us identify and handle missing attributes without causing the script to crash.

### Summary

In this lesson, we covered the basics of error handling in web scraping. We discussed how to handle **HTTP** errors, parsing errors, and missing attributes gracefully. By now, you should feel comfortable handling various issues that may arise during web scraping. This will make your scripts more robust and reliable.

Keep practicing these concepts to master error management in web scraping. Happy scraping\!

## Run and Observe Error Handling

Great work on learning about error handling in web scraping!

Let's observe the code you saw in the lesson to see how it manages different types of errors. Run the code to see the result.

As a refresher, this code will:

Fetch a webpage and handle HTTP errors.
Parse the HTML content and handle parsing errors.
Extract quote details and handle missing attribute errors.

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP errors
        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            try:
                text = quote.find('span', class_='text').get_text()
                author = quote.find('small', class_='author').get_text()
                tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
                print({'text': text, 'author': author, 'tags': tags})

                novels = quote.find_all('a', class_='novel').get_text()
                print(novels)
            except AttributeError as e:
                print(f"Attribute error encountered: {e}")
    except Exception as e:
        print(f"Parsing error: {e}")

# Fetch the page content
html = fetch_page(url)

if html:
    # Parse and extract quotes if the page content is successfully retrieved
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")

```

Based on the code provided, here is the output you can expect to see after running the script.

-----

The code will first successfully fetch the page from `http://quotes.toscrape.com`. It will then proceed to find each quote on the page, and for every single one, it will print a similar block of output.

You will see a series of dictionaries printed, each containing the `text`, `author`, and `tags` for a quote. This shows the code is successfully extracting the main content.

Directly below each of these dictionaries, you will see a repeated error message like this:

```
Attribute error encountered: 'list' object has no attribute 'get_text'
```

### What's Happening

This error is an excellent example of a **robust scraper catching an unexpected issue**.

The `AttributeError` occurs at this line:

```python
novels = quote.find_all('a', class_='novel').get_text()
```

Here's a breakdown of why:

1.  **`find_all()` vs. `find()`:** The `find_all()` method is designed to return a list of all matching elements, even if that list contains only one item, or is empty. In this case, there are no elements on the page with the class `novel`, so `quote.find_all(...)` returns an empty list, `[]`.
2.  **Calling a method that doesn't exist:** The `get_text()` method is for a single tag object, not for a list of tags. Since the code is trying to call `get_text()` on an empty list, Python raises an `AttributeError`.
3.  **The code keeps running:** The `try...except AttributeError` block you see is what prevents the script from crashing. It gracefully catches the error, prints the informative message, and allows the `for` loop to continue processing the next quote.

This demonstrates how **error handling** is crucial for building reliable web scrapers that can handle inconsistencies in a webpage's structure without failing completely.

## Fix the Scraping Script

Great job handling errors so far! Now let's identify and fix an error in a script that scrapes quotes from a webpage.

The script is supposed to fetch quotes from a webpage and print them. We simulated an attribute error in the script by changing the class name of the quote text element to txt instead of text. And we want to check if the exception handling is working correctly. However, the script is not handling the error as expected, it's giving a syntax error instead. Can you identify and fix the issue so that the script handles the attribute error gracefully?

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP errors
        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            text = quote.find('span', class_='txt').get_text()
            print(f"Quote: {text}")
    catch Exception as e:
        print(f"Parsing error: {e}")

# Fetch the page content
html = fetch_page(url)

if html:
    # Parse and extract quotes if the page content is successfully retrieved
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")
```

## Fix the Scraping Script

Great job handling errors so far! Now let's identify and fix an error in a script that scrapes quotes from a webpage.

The script is supposed to fetch quotes from a webpage and print them. We simulated an attribute error in the script by changing the class name of the quote text element to txt instead of text. And we want to check if the exception handling is working correctly. However, the script is not handling the error as expected, it's giving a syntax error instead. Can you identify and fix the issue so that the script handles the attribute error gracefully?

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP errors
        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            text = quote.find('span', class_='txt').get_text()
            print(f"Quote: {text}")
    catch Exception as e:
        print(f"Parsing error: {e}")

# Fetch the page content
html = fetch_page(url)

if html:
    # Parse and extract quotes if the page content is successfully retrieved
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")
```

This is an excellent sign\! The error message you're seeing confirms that the script is now working exactly as intended.

The `try...except` block we added is successfully catching the `AttributeError`, so the script isn't crashing. It's simply telling you that it couldn't find the text for each quote and is moving on.

### Why You're Seeing the Error

The error, `'NoneType' object has no attribute 'get_text'`, is happening because of this line:

```python
text = quote.find('span', class_='txt').get_text()
```

The `find()` method returned **`None`** because it couldn't find a `<span>` tag with the class `'txt'`. As a result, when the code tried to call `.get_text()` on the `None` object, it raised the error.

To get the script to work correctly, you just need to fix the typo in the class name.

### The Final Fix

Change `'txt'` to the correct class name, which is **`'text'`**.

Here is the complete and final working script:

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        
        for quote in quotes:
            try:
                # FIX: Change 'txt' back to 'text'
                text_element = quote.find('span', class_='text')
                if text_element:
                    text = text_element.get_text()
                    print(f"Quote: {text}")
                else:
                    print("Quote text element not found. Skipping.")
            except AttributeError as e:
                print(f"Attribute error encountered: {e}. Skipping this quote.")
        
    except Exception as e:
        print(f"A general parsing error occurred: {e}")

# Fetch the page content
html = fetch_page(url)

if html:
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")
```

## Handling Errors in Web Scraping

Great progress on learning about error handling so far!

For this practice, let's focus on handling errors while fetching and parsing web pages. You need to complete the function fetch_page to handle HTTP errors and complete the function parse_and_extract_quotes to handle parsing and attribute errors.

Complete the TODO comments and fill in the missing parts of the code to ensure the script is robust when encountering issues.

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    try:
        response = requests.get(url)
        # TODO: Handle HTTP errors with the raise_for_status method

        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            try:
                text = quote.find('span', class_='text').get_text()
                author = quote.find('small', class_='author').get_text()
                tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
                print({'text': text, 'author': author, 'tags': tags})
            # TODO: Handle missing attribute errors with an AttributeError exception
                print(f"Attribute error: {e}")
    # TODO: Handle parsing errors with a generic exception
        print(f"Parsing error: {e}")

# Fetch the page content
html = fetch_page(url)

if html:
    # Parse and extract quotes if the page content is successfully retrieved
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")
```

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    try:
        response = requests.get(url)
        # TODO: Handle HTTP errors with the raise_for_status method
        response.raise_for_status()

        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            try:
                text = quote.find('span', class_='text').get_text()
                author = quote.find('small', class_='author').get_text()
                tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
                print({'text': text, 'author': author, 'tags': tags})
            # TODO: Handle missing attribute errors with an AttributeError exception
            except AttributeError as e:
                print(f"Attribute error: {e}")
    # TODO: Handle parsing errors with a generic exception
    except Exception as e:
        print(f"Parsing error: {e}")

# Fetch the page content
html = fetch_page(url)

if html:
    # Parse and extract quotes if the page content is successfully retrieved
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")
```

## Web Scraping Error Handling

You've done a great job so far! Now, let's put everything together.

In this final task, your goal is to write a complete web scraping script from scratch. Follow the instructions in the starter code to implement the logic.

Good luck!

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    # TODO: Implement the function to fetch page content and handle HTTP errors
    # Make sure to return the page content if the request is successful, otherwise return None

    pass

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    # TODO: Implement the function to parse HTML content and handle parsing errors with generic Exception
    # Make sure to handle missing attributes for each quote while extracting quote text with AttributeError exception
    pass

# Fetch the page content
html = fetch_page(url)

if html:
    # Parse and extract quotes if the page content is successfully retrieved
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")

```

```python
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = 'http://quotes.toscrape.com'

# Function to fetch the page content
def fetch_page(url):
    # TODO: Implement the function to fetch page content and handle HTTP errors
    # Make sure to return the page content if the request is successful, otherwise return None
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.HTTPError as e:
        print(f"HTTP error: {e}")
        return None
    except requests.RequestException as e:
        print(f"An error occurred during the request: {e}")
        return None

# Function to parse and extract quote details
def parse_and_extract_quotes(html):
    # TODO: Implement the function to parse HTML content and handle parsing errors with generic Exception
    # Make sure to handle missing attributes for each quote while extracting quote text with AttributeError exception
    try:
        soup = BeautifulSoup(html, 'html.parser')
        quotes = soup.find_all('div', class_='quote')
        for quote in quotes:
            try:
                text = quote.find('span', class_='text').get_text()
                author = quote.find('small', class_='author').get_text()
                tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
                print({'text': text, 'author': author, 'tags': tags})
            except AttributeError as e:
                print(f"Attribute error: {e}")
    except Exception as e:
        print(f"Parsing error: {e}")

# Fetch the page content
html = fetch_page(url)

if html:
    # Parse and extract quotes if the page content is successfully retrieved
    parse_and_extract_quotes(html)
else:
    print("An HTTP Error occurred.")

```