# Unit 4 Scraping Best Practices

Welcome\! In this lesson, we will focus on vital aspects of web scraping that ensure your scrapers are efficient, scalable, and respectful to the websites you are scraping from. We'll cover a variety of techniques and best practices to improve your scraping scripts using Python and BeautifulSoup.

## Importance and Ethics of Web Scraping

When you scrape data from a website, it's important to do it ethically. This means understanding and respecting the website's terms of service and its `robots.txt` file, which outlines what parts of the site can be crawled by web scrapers and bots. Ignoring this can lead to your IP being blocked and potentially legal consequences.

Ethical scraping involves:

  * **Honoring the `robots.txt` file:** Always check if the data you wish to scrape is allowed. This file usually can be found at the root of the website (e.g., `https://example.com/robots.txt`).
  * **Avoiding overloading the server:** Make your scraper polite by controlling the rate of requests to avoid putting unnecessary load on the server.
  * **Understanding data ownership:** Some data might be protected by copyright or require permission to be scraped.

Aggressive scraping behaviors can degrade the performance of target websites, making them slow and potentially unresponsive for users. This is why polite crawling, rate limiting, and adhering to best practices is crucial.

## Rate Limiting

Rate limiting involves adding delays between requests to avoid overwhelming the server. You can use the `time.sleep()` function to achieve this.

### Python

```python
import requests
import time
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
page = "/page/1/"

while page:
    response = requests.get(url + page)
    soup = BeautifulSoup(response.text, 'html.parser')
    print(f'Parsed page {url + page}')
    next_page = soup.select_one('.next a')
    page = next_page['href'] if next_page else None
    time.sleep(1)  # Add a delay of 1 second between requests to avoid overloading the server
```

The above code snippet demonstrates how to add a delay of 1 second between requests. This helps in controlling the rate of requests and ensures that the server is not overwhelmed.

## Handling Timeouts

When making requests to a server, it's important to handle timeouts gracefully. You can set a timeout value for your requests to avoid waiting indefinitely for a response.

### Python

```python
import requests

url = "https://httpbin.org/delay/20"  # This URL introduces a delay of 20 seconds
try:
    response = requests.get(url, timeout=2)  # Set a timeout of 2 seconds
    print(response.text)
except requests.Timeout:
    print("The request timed out")
```

If we don't set a timeout, the request will wait indefinitely for a response, which can lead to performance issues.

## Blending in the Regular Traffic

Setting the `User-Agent` header can help your scraper blend in with regular browser traffic. This header provides information about the client's software environment.

### Python

```python
import requests

url = "https://quotes.toscrape.com"
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' }

response = requests.get(url, headers=headers)
```

By setting the `User-Agent` header, you can make your scraper appear more like a regular browser, reducing the chances of being blocked. This information varies based on the browser and operating system you use. You can find a list of common user agents online for different browsers and operating systems.

## Lesson Summary

We've covered essential best practices, ensuring that your web scraper is efficient, respectful, and robust. By following these guidelines, you can build reliable scrapers that extract data effectively without causing disruptions to the websites you scrape from. Remember, ethical scraping is the key to successful and sustainable web scraping practices. Happy scraping\!

## Best Practices of Web Scraper in Action

Great progress so far! Now, let's run a Python script that demonstrates best practices for web scraping. The script scrapes quotes from a website, collecting content from multiple pages while respecting the server.

This script uses:

Requests and BeautifulSoup: For fetching and parsing web page content.

Timeouts: To ensure the scraper doesn't hang indefinitely.

Error Handling: To manage request errors gracefully.

Polite Crawling: Adding delays between requests using time.sleep().

Navigation: Handling multiple pages by detecting "Next" page links.

When you run this code, it will print each quote's text and author name. Pay attention to how errors are handled, how pages are navigated, and how the server is respected by limiting requests.

```python
import requests
from bs4 import BeautifulSoup
import time

def scrape(base_url, start_page):
    current_page = start_page
    while current_page:
        try:
            response = requests.get(base_url + current_page, timeout=10)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(e)
            break

        soup = BeautifulSoup(response.text, 'html.parser')
        for quote in soup.select(".quote"):
            text = quote.select_one(".text").get_text(strip=True)
            author = quote.select_one(".author").get_text(strip=True)
            print(f"'{text}' by {author}")

        next_page = soup.select_one("li.next > a")
        current_page = next_page["href"] if next_page else None
        time.sleep(1)  # Respectful crawling by sleeping for 1 second
        print("\n")

base_url = 'http://quotes.toscrape.com'
start_page = '/page/1/'
scrape(base_url, start_page)

```

## Fix the Web Scraper Bugs

You've done a great job so far! Let's now identify and fix an issue in the code to improve its performance.

Currently, the code is making a request to a URL and waiting for a response. The response takes too long, slowing down the code execution. We need to fix this issue by setting a timeout for the request for 2 seconds.

Can you identify the issue and fix it to improve the code performance?

```python
import requests

url = "https://postman-echo.com/delay/10"

try:
    response = requests.get(url)
    print(response.text)
except requests.Timeout:
    print("The request timed out")

```

To fix the issue of the code taking too long to execute, you need to add a `timeout` parameter to the `requests.get()` function. The problem is that the request is set to an endpoint with a 10-second delay, but the code doesn't have a timeout, so it will wait the full 10 seconds before continuing.

Here is the corrected code with a 2-second timeout:

```python
import requests

url = "https://postman-echo.com/delay/10"

try:
    response = requests.get(url, timeout=2)
    print(response.text)
except requests.Timeout:
    print("The request timed out")
```

## Complete the Web Scraper

Excellent progress so far! The scraper should use best practices, such as adding a delay between requests.

Your task is to fill in the missing part of the code to have a delay to not overwhelm the server.

```python
import requests
from bs4 import BeautifulSoup
import time

def scrape_heroes(base_url, first_page):
    next_page = first_page
    while next_page:
        try:
            response = requests.get(base_url + next_page, timeout=10)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(e)
            break
        
        soup = BeautifulSoup(response.text, 'html.parser')
        for hero in soup.select(".quote"):
            name = hero.select_one(".text").get_text(strip=True)
            power = hero.select_one(".author").get_text(strip=True)
            print(f"'{name}' has power: {power}")

        next_page_link = soup.select_one("li.next > a")
        next_page = next_page_link["href"] if next_page_link else None
        
        # TODO: Add wait time of 1 second before continuing the scraping to not overwhelm the server
        
        print("\n")

base_url = 'http://quotes.toscrape.com'
first_page = '/page/1/'
scrape_heroes(base_url, first_page)

```

To add a delay to the web scraper, you should use the `time.sleep()` function. As per the instructions, a 1-second delay is needed to avoid overwhelming the server.

Here is the completed code with the missing part filled in:

```python
import requests
from bs4 import BeautifulSoup
import time

def scrape_heroes(base_url, first_page):
    next_page = first_page
    while next_page:
        try:
            response = requests.get(base_url + next_page, timeout=10)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(e)
            break
        
        soup = BeautifulSoup(response.text, 'html.parser')
        for hero in soup.select(".quote"):
            name = hero.select_one(".text").get_text(strip=True)
            power = hero.select_one(".author").get_text(strip=True)
            print(f"'{name}' has power: {power}")

        next_page_link = soup.select_one("li.next > a")
        next_page = next_page_link["href"] if next_page_link else None
        
        # TODO: Add wait time of 1 second before continuing the scraping to not overwhelm the server
        time.sleep(1)
        
        print("\n")

base_url = 'http://quotes.toscrape.com'
first_page = '/page/1/'
scrape_heroes(base_url, first_page)
```