### **Introduction to Python Logging**

The logging module in Python provides a flexible framework for emitting log messages from your code. Logs are essential for understanding and debugging your program, especially in production environments or when you're working with complex systems like web scraping.

---

#### **Why Use Logging?**
1. **Debugging:** Helps in tracking program execution without cluttering the code with `print()` statements.
2. **Persistence:** Logs can be saved to a file, enabling analysis after the program finishes.
3. **Control:** You can set logging levels to filter messages based on their importance.
4. **Structured Output:** With proper configuration, logs can include timestamps, severity levels, and more.

---

#### **Basic Concepts in Logging**
1. **Loggers:** The main entry point for logging. You can think of them as entities that emit log messages.
2. **Handlers:** Define where the log messages go (console, file, etc.).
3. **Levels:** Determine the severity of a log message. Common levels are:
   - `DEBUG`: Detailed information for diagnosing problems.
   - `INFO`: Confirmation that things are working as expected.
   - `WARNING`: An indication of something unexpected or an issue that isn’t critical yet.
   - `ERROR`: A serious problem that prevents the program from continuing.
   - `CRITICAL`: A very serious error, often indicating a program crash.

---

#### **Basic Logging Example**

Here’s how to get started with Python's logging module:

```python
import logging

# Set up a basic logger
logging.basicConfig(
    level=logging.DEBUG,  # Set the minimum logging level
    format='%(asctime)s - %(levelname)s - %(message)s'  # Define the log message format
)

# Example log messages
logging.debug("This is a debug message. Used for detailed diagnostic output.")
logging.info("This is an info message. Indicates the program is running as expected.")
logging.warning("This is a warning message. Something unexpected happened.")
logging.error("This is an error message. A problem occurred.")
logging.critical("This is a critical message. A serious error happened.")
```

---

#### **Output Explanation**
When you run the code, you'll see output like this:

```
2024-12-23 14:23:01,123 - DEBUG - This is a debug message. Used for detailed diagnostic output.
2024-12-23 14:23:01,124 - INFO - This is an info message. Indicates the program is running as expected.
2024-12-23 14:23:01,125 - WARNING - This is a warning message. Something unexpected happened.
2024-12-23 14:23:01,126 - ERROR - This is an error message. A problem occurred.
2024-12-23 14:23:01,127 - CRITICAL - This is a critical message. A serious error happened.
```

- **Timestamp:** Indicates when the log was recorded.
- **Log Level:** Shows the severity of the log message.
- **Message:** The custom message provided.

---

#### **Key Functions**
1. **`logging.basicConfig()`**: Sets up the configuration for logging.
2. **Logging methods:** These emit messages with a severity level:
   - `logging.debug()`
   - `logging.info()`
   - `logging.warning()`
   - `logging.error()`
   - `logging.critical()`

In [2]:
import time
import random
import logging

In [2]:
""" 
Example of scraping process
"""
def start_scraping():
    # Scraping page 1 to 10
    for i in range(1, 11):
        print(f"Scraping page {i}")
        time.sleep(1)
        # Getting page response
        print("Scraping completed successfully.")

start_scraping()

Scraping page 1
Scraping completed successfully.
Scraping page 2
Scraping completed successfully.
Scraping page 3
Scraping completed successfully.
Scraping page 4
Scraping completed successfully.
Scraping page 5
Scraping completed successfully.
Scraping page 6
Scraping completed successfully.
Scraping page 7
Scraping completed successfully.
Scraping page 8
Scraping completed successfully.
Scraping page 9
Scraping completed successfully.
Scraping page 10
Scraping completed successfully.


In [6]:
"""
Objective: Understand the basics of Python's logging module and why it's important. 
Logging helps you monitor your program's behavior and debug issues without relying on print statements.
"""

# TODO:
# 1. Set up a basic logger that logs messages at the INFO level.
# 2. Replace the start_scraping print statement with logging message.
# 3. Log the following messages with info level:
#    - "Scraping page " following by the page number
#    - "Scraping completed successfully."

import logging

# Create a file handler
file_handler = logging.FileHandler('scraper.log')

# Create a formatter and add it to the file handler
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)

# Add the file handler to the logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(file_handler)


def start_scraping():
    # Scraping page 1 to 10
    for i in range(1, 11):
        logging.info(f"Scraping page {i}")
        time.sleep(1)   
        # Getting page response
        logging.info("Scraping completed successfully.")

start_scraping()


INFO:root:Scraping page 1
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 2
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 3
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 4
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 5
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 6
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 7
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 8
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 9
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 10
INFO:root:Scraping completed successfully.


In [None]:
""" 
Objective: Setup different logs level
"""

# TODO: Setup a logger that only log error messages



def scraping_with_error_response():
    response_code = [200, 200, 200, 200, 200, 404, 503]

    # Scraping page 1 to 10
    for i in range(1,11):
        # TODO: Add log message for tracking page number
        logging.info(f"Scraping page {i}")

        time.sleep(1)

        # Getting page response
        response = random.choice(response_code)
        
        if response == 200:   
            # TODO: Add log message for valid response
            logging.info("Scraping completed successfully.")

        else:
            # TODO: Add log message for invalid response
            logging.error("Scraping failed.")


scraping_with_error_response()

ERROR:root:Scraping failed.


In [8]:
"""
Objective: Learn to configure logging to log messages to a file for persistent records. 
This is useful for analyzing scraping sessions or debugging after the program runs.
"""

# TODO:
# 1. Configure logging to log messages at the DEBUG level to a file named `scraper.log`.
# 2. Add timestamps to the log messages.
# 3. Use previous function for this task

import logging

# Create a file handler
file_handler = logging.FileHandler('scraper.log')

# Set the logging level to DEBUG
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)


# Create a formatter with timestamps
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

# Add the formatter to the file handler
file_handler.setFormatter(formatter)


# Add the file handler to the logger
logger.addHandler(file_handler)


scraping_with_error_response()


INFO:root:Scraping page 1
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 2
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 3
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 4
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 5
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 6
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 7
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 8
ERROR:root:Scraping failed.
INFO:root:Scraping page 9
INFO:root:Scraping completed successfully.
INFO:root:Scraping page 10
ERROR:root:Scraping failed.


In [10]:
"""
Objective: Apply logging to a full scraping workflow and use different logging levels for various stages.
This will help you monitor and troubleshoot scraping operations more effectively.
"""

# TODO:
# 1. Write a script that:
#    - Logs INFO when scraping starts.
#    - Logs DEBUG for each URL being processed.
#    - Logs ERROR if a request fails.
#    - Logs INFO when scraping ends.
# 2. Scrape data from multiple URLs, including one invalid URL to test the error logging.

import logging
import requests
from bs4 import BeautifulSoup

# Configure logging
logging.basicConfig(filename='scraper.log', level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

def scrape_url(url):
    try:
        # Log DEBUG for each URL being processed
        logging.debug(f'Processing URL: {url}')

        # Send a GET request to the URL
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            # Scrape data from the HTML content (for example, extract all links)
            links = soup.find_all('a')

            # Log INFO for the scraped data
            logging.info(f'Scraped {len(links)} links from {url}')
        else:
            # Log ERROR if the request fails
            logging.error(f'Request failed for {url}: {response.status_code}')
    except requests.exceptions.RequestException as e:
        # Log ERROR if a request exception occurs
        logging.error(f'Request exception for {url}: {e}')

def main():
    # Log INFO when scraping starts
    logging.info('Scraping started')

    # Scrape data from multiple URLs
    urls = [
        'https://www.example.com',
        'https://www.google.com',
        'https://www.invali',  # Invalid URL to test error logging
        'https://www.stackoverflow.com'
    ]

    for url in urls:
        scrape_url(url)

    # Log INFO when scraping ends
    logging.info('Scraping ended')

if __name__ == '__main__':
    main()




INFO:root:Scraping started
DEBUG:root:Processing URL: https://www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.example.com:443
DEBUG:urllib3.connectionpool:https://www.example.com:443 "GET / HTTP/1.1" 200 648
INFO:root:Scraped 1 links from https://www.example.com
DEBUG:root:Processing URL: https://www.google.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.google.com:443
DEBUG:urllib3.connectionpool:https://www.google.com:443 "GET / HTTP/1.1" 200 None
INFO:root:Scraped 19 links from https://www.google.com
DEBUG:root:Processing URL: https://www.invali
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.invali:443
ERROR:root:Request exception for https://www.invali: HTTPSConnectionPool(host='www.invali', port=443): Max retries exceeded with url: / (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000002B306806F60>: Failed to resolve 'www.invali' ([Errno 11003] getaddrinfo failed)"))
DE

In [4]:
"""
Objective: Explore advanced logging by using custom handlers to log messages to multiple destinations. 
This technique improves flexibility in handling log output.
"""
import logging
# Create handlers
console_handler = logging.StreamHandler() # This will shows log message in the console
console_handler.setLevel(logging.DEBUG)

# TODO: 
# 1. Create another handler for storing log in a file using logging.FileHandler('error.log')
# 2. Set the level to DEBUG
file_handler_error = logging.FileHandler('error.log')
file_handler_error.setLevel(logging.DEBUG)




# Create formatter
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

# Attach formatter to handlers
console_handler.setFormatter(formatter)
# TODO: Add formatter to the file handler
file_handler_error.setFormatter(formatter)


# Create logger and attach handlers
logger = logging.getLogger('ScraperLogger')
logger.setLevel(logging.DEBUG)
logger.addHandler(console_handler) # Attach stream handler into the logger object
# TODO: Attach the file handler into the logger object
logger.addHandler(file_handler_error)



In [9]:
""" 
Objective: Handling failed requests using logging
"""
# TODO:
# 1. Create a function that loop through number and get the random response,
# just like previous code but modify it as you like
# 2. Handle stream log in the console and the error log in a file
# 3. Provide a file that contains all of failed URL so you can retry again
# 4. Automate the process (optional)

import logging
import random
import requests

# Create a logger
logger = logging.getLogger(__name__)

# Create a file handler for error logs
error_handler = logging.FileHandler('error.log')
error_handler.setLevel(logging.ERROR)

# Create a console handler for stream logs
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)

# Create a formatter
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# Add the formatter to the handlers
error_handler.setFormatter(formatter)
console_handler.setFormatter(formatter)

# Add the handlers to the logger
logger.addHandler(error_handler)
logger.addHandler(console_handler)

def get_random_response():
    numbers = [1, 2, 3, 4,80]  # replace with your numbers
    failed_urls = []

    for num in numbers:
        url = f"https://books.toscrape.com/catalogue/category/books_1/page-{num}.html"  # replace with your URL
        try:
            response = requests.get(url)
            response.raise_for_status()  # Raise an exception for HTTP errors
            logger.info(f"Successful response for {url}")
        except requests.exceptions.RequestException as e:
            logger.error(f"Error for {url}: {e}")
            failed_urls.append(url)

    return failed_urls

def save_failed_urls(failed_urls):
    with open('failed_urls.txt', 'w') as f:
        for url in failed_urls:
            f.write(url + '\n')

def main():
    failed_urls = get_random_response()
    save_failed_urls(failed_urls)

    if failed_urls:
        logger.info("Failed URLs saved to failed_urls.txt")
    else:
        logger.info("No failed URLs")

if __name__ == "__main__":
    main()

2025-02-09 22:45:34,228 - __main__ - ERROR - Error for https://books.toscrape.com/catalogue/category/books_1/page-80.html: 404 Client Error: Not Found for url: https://books.toscrape.com/catalogue/category/books_1/page-80.html
2025-02-09 22:45:34,228 - __main__ - ERROR - Error for https://books.toscrape.com/catalogue/category/books_1/page-80.html: 404 Client Error: Not Found for url: https://books.toscrape.com/catalogue/category/books_1/page-80.html
2025-02-09 22:45:34,228 - __main__ - ERROR - Error for https://books.toscrape.com/catalogue/category/books_1/page-80.html: 404 Client Error: Not Found for url: https://books.toscrape.com/catalogue/category/books_1/page-80.html
2025-02-09 22:45:34,228 - __main__ - ERROR - Error for https://books.toscrape.com/catalogue/category/books_1/page-80.html: 404 Client Error: Not Found for url: https://books.toscrape.com/catalogue/category/books_1/page-80.html
2025-02-09 22:45:34,228 - __main__ - ERROR - Error for https://books.toscrape.com/catalogue/

### **Reflection**
In what situation logging will help you a lot?

(answer here)

### **Exploration**
Explore advanced log and monitoring tools like:
- Loguru
- Loggly
- Datadog